<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Byte Size Biology &#187; Bioinformatics</title>
	<atom:link href="http://bytesizebio.net/index.php/category/science/biology/bioinformatics/feed/" rel="self" type="application/rss+xml" />
	<link>http://bytesizebio.net</link>
	<description>The musings and ravings of a computational biologist about science, computers, music and, you know, stuff</description>
	<lastBuildDate>Fri, 18 May 2012 18:10:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Job opening: Scientific Curator at the Jackson Laboratory</title>
		<link>http://bytesizebio.net/index.php/2012/05/18/job-opening-scientific-curator-at-the-jackson-laboratory/</link>
		<comments>http://bytesizebio.net/index.php/2012/05/18/job-opening-scientific-curator-at-the-jackson-laboratory/#comments</comments>
		<pubDate>Fri, 18 May 2012 18:09:13 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[curator]]></category>
		<category><![CDATA[gene annotation]]></category>
		<category><![CDATA[genome annotation]]></category>
		<category><![CDATA[Jackson Lab]]></category>
		<category><![CDATA[jobs]]></category>
		<category><![CDATA[mouse]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=6130</guid>
		<description><![CDATA[Scientific Curator – Bioinformatics Interested individuals should apply on-line at www.jax.org/careers, referring to job posting #3256.  Contact Jeannine Ross at ext. 6045 with questions. The incumbent in this position plays a critical role in data annotation and curation for the Gene Ontology (GO) and Protein Ontology (PRO) programs at The Jackson Laboratory in Bar Harbor [...]]]></description>
			<content:encoded><![CDATA[<blockquote>
<div><strong>Scientific Curator – Bioinformatics</strong></div>
<div>
<p>Interested individuals should apply on-line at <a href="http://www.jax.org/careers" target="_blank">www.jax.org/careers</a>, referring to job posting #3256.  Contact Jeannine Ross at ext. 6045 with questions.</p>
</div>
<div>
<p>The incumbent in this position plays a critical role in data annotation and curation for the Gene Ontology (GO) and Protein Ontology (PRO) programs at The Jackson Laboratory in Bar Harbor Maine, through diverse activities to gather, analyze, evaluate and integrate information and analysis results using biomedical ontologies.  Activities include, but are not limited to, obtaining data via literature or electronic-based means, determining data object identity/uniqueness, judging information or analyses for appropriateness of incorporation into GO and PRO resources, and evaluating and applying biomedical ontologies.  This individual must keep abreast of new scientific developments that are relevant to functional genomics, and should attend group meetings and seminars, as well as make poster present posters/platform sessions at conferences.  Team participation in project development andsoftware testing is expected, as well as collaborations with outside research groups and international bioinformatics communities.  Assisting with training new curation staff, authoring project proposals, responsibility for writing/maintaining curational documentation are some of the additional roles that may be played by scientific curators.</p>
<p>Required:</p>
<p>·       advanced knowledge in mouse as an experimental organism</p>
<p>·       expert knowledge in specific data areas of biochemistry as well as functional and comparative genomics</p>
<p>·       broad understanding of database principles, biomedical ontologies, and skills with computational analysis techniques and data interpretation</p>
<p>·       exceptional communication and organizational skills</p>
<p>Experience/Education:</p>
<p>·       requires a Doctoral degree in the Life Sciences, and</p>
<p>·       a minimum of 1 – 3 years of experience</p>
</div>
</blockquote>
<p>&nbsp;</p>
<div id="attachment_6131" class="wp-caption alignnone" style="width: 394px"><a href="http://bytesizebio.net/wp-content/uploads/2012/05/mouse-annotations.jpg"><img class=" wp-image-6131" title="mouse-annotations" src="http://bytesizebio.net/wp-content/uploads/2012/05/mouse-annotations.jpg" alt="" width="384" height="288" /></a><p class="wp-caption-text">Credit: Mr.Thomas, Flickr</p></div>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2012/05/18/job-opening-scientific-curator-at-the-jackson-laboratory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>It&#8217;s a smORF world, after all?</title>
		<link>http://bytesizebio.net/index.php/2012/04/27/its-a-smorf-world-after-all/</link>
		<comments>http://bytesizebio.net/index.php/2012/04/27/its-a-smorf-world-after-all/#comments</comments>
		<pubDate>Fri, 27 Apr 2012 19:25:05 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Biology]]></category>
		<category><![CDATA[Evolution]]></category>
		<category><![CDATA[Genomics]]></category>
		<category><![CDATA[drosophila]]></category>
		<category><![CDATA[fly]]></category>
		<category><![CDATA[genomics]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5774</guid>
		<description><![CDATA[Here is a study that looked for a type of genes that the authors felt was neglected by classic genomic annotation. The research shows how to employed concepts in molecular evolution to validate the existence of these genes. Some background: the first question we ask after assembling a genome is: &#8220;where are the genes&#8221;? Not [...]]]></description>
			<content:encoded><![CDATA[<p><span style="float: left; padding: 5px;"><a href="http://www.researchblogging.org"><img style="border: 0;" src="http://www.researchblogging.org/public/citation_icons/rb2_large_gray.png" alt="ResearchBlogging.org" /></a></span></p>
<p>Here is a study that looked for a type of genes that the authors felt was neglected by classic genomic annotation. The research shows how to employed concepts in molecular evolution to validate the existence of these genes.</p>
<p>Some background: the first question we ask after assembling a genome is: &#8220;where are the genes&#8221;? Not an easy question to answer, since a gene is classically defined as a <em>unit of heredity</em>. It may code for RNA, protein, or sometimes, nothing at all. The actual implementation of the &#8220;unit of heredity&#8221; can take several physical forms, each one of them different. Therefore, the algorithms for finding genes would depend on which type gene one is looking for, exactly.</p>
<p>A somewhat more tractable question is: &#8220;where are the open reading frames&#8221;? Open reading frames or ORFs are those stretches of DNA that code for proteins.  Indeed, most gene calling software actually identifies ORFs. There are many attributes that go into an ORF calling algorithm: the frequency of the bases  (<em></em>or <em>k-</em>mers of bases) in the suspected coding regions, the signals for the beginning and ends of introns, the existence of non-coding regions that aid transcription such as promoters and enhancers, the location on the chromosome with relation to other ORFs, and the length of the of the final product. The latter criterion is actually quite important, as many ORF-calling algorithms will discount anything coding for a protein that is shorter than 100 amino acids as being &#8220;too short&#8221;. The reason for employing this length cutoff, is that the number of false positives increases dramatically when ORFs coding for proteins shorter than 100aa (or 300 nucleotides) are called. Therefore, most gene-callers would just tend to discard any short peptides.</p>
<p>But throwing away the baby with the bathwater is not a good solution, since short peptides are known to be responsible for many of life&#8217;s activities: mating pheromones, small compound transporters, hormones, neurotransmitters and regulation of other proteins&#8217; activities, to name a few. Many of these short peptides are the result of the cleavage of larger proteins, which means that the ORFs encoding for them are originally longer than 300bp.  But some may actually have their own ORFs, coding only for them. How can we find those small ORFs or <strong>smORFs</strong> out? How many of them are there? Is the number of smORFs large enough to make it worth re-annotating genomes?</p>
<div class="wp-caption alignnone" style="width: 310px"><a href="http://bytesizebio.net/wp-content/uploads/2012/04/1209px-Gene2-plain.svg_.png"><img class="size-medium wp-image-6032" title="1209px-Gene2-plain.svg" src="http://bytesizebio.net/wp-content/uploads/2012/04/1209px-Gene2-plain.svg_-300x254.png" alt="" width="300" height="254" /></a></dt>
</dl>
<p class="wp-caption-dd">Click to enlarge. Gene Structure. Source: Wikimedia commons. Credit: Forluvoft</p>
</div>
<p>Emmanuel Ladoukakis from the University of Crete and colleagues from the university of Essex, UK have set up a bioinformatic pipeline to look for smORFs in the <em>Drosophila melanogaster</em> genome. Bear with me, there are a few steps in this pipeline. But there&#8217;s a lot to learn about genomics just from looking at what they did, and why they took those steps.</p>
<p>Here&#8217;s what they did: <strong>1) Find smORF candidates:</strong> they looked for all potential smORFs (starting with a start codon and ending with an in-frame stop codon, 30-300bp long) in those parts of <em>D. melanogaster&#8217;</em>s genome that were annotated as non-coding. <strong></strong>To keep things simple, they looked only for intron-less smORFs: smORFs that are encoded consecutively in the DNA.  They found 593,586 potential sequences. <strong>2) Remove transposons: </strong>they then removed all those that had a similarity to transposons. Transposons are DNA elements that multiply in the chromosome: something like an internal virus, only usually benign. They may carry bits of other genes they &#8220;grab&#8221; on the way, but they are not functional. They were left with 556,554 sequences <strong>3) Big step: look for homologs in another fly species: </strong>they then looked for smORFs with similar  translated amino-acid sequences in <em>D. pseudoobscura, </em>which diverged from the <em>melanogaster </em> 25 to 55 million years ago. The reason they looked for similar amino-acid sequences was that if there is a selection to conserve a smORF, it would be on the protein, and not at the DNA level. This step reduced the number of smORF candidates by 93%: from 556,554 down to 43,210.  <strong></strong>Looking only for <strong>4) global alignments, (another big step)</strong>  they found 4,561 smORF candidates by looking at alignments of whole smORF sequences, not only of partial local similarities. this reduced the number of candidates by 72% from the  step (3). We are now down to 0.8% of the original 593,586 smORF candidates.</p>
<p>Quite a filtering process. Note the huge elimination: 99.2% of all initial smORFs candidates are gone. I believe that they decided to sacrifice sensitivity in favor of specificity</p>
<p>So they had 4,561 smORF candidates conserved between two flies. Still, how many ORFs got in by chance? Hard to know, but they continued to rely on evolutionary conservation as a guideline. There may be smORFs that appeared independently in <em>melanogaster</em> and <em>pseudoobscura</em> after they separated 55 million years ago,  but the main evidence for true smORFs would be their evolutionary conservation between the two fly species.</p>
<p>To get even more specific, they now<strong> 5) looked for <a href="http://en.wikipedia.org/wiki/Synteny#Shared_synteny">shared synteny</a></strong><a href="http://en.wikipedia.org/wiki/Synteny#Shared_synteny">:</a>  conservation not only of sequence, but also of the genomic context: the sequences surrounding it. That brought the number down to 3,314.</p>
<p>OK, so they looked for conservation based on homology and based on synteny. Anything more? Well, yes. The next step would be to <strong>6) look for evolutionarily selected smORFs</strong>. The two evolutionary criteria they used until now were homology and synteny. Now comes a third:  selection. If  smORF candidates are actually coding, they will be subject to  purifying selection, that is, to selection that eliminates deleterious mutations. This is evident in a low rate of non-synonymous <em>vs</em>. synonymous substitutions, or a <a href="http://en.wikipedia.org/wiki/Ka/Ks_ratio" target="_blank">Ka/Ks ratio</a> of &lt;&lt; 1. (Read about Ka/Ks ratios also <a href="http://www.sciencedirect.com/science/article/pii/S0168952502027221" target="_blank">here</a>.) <strong>7) Looking at what actually gets transcribed in Drosophila</strong> (from looking at the transcriptome) this number was whittled down to a final <span style="text-decoration: underline;">401</span>.</p>
<div class="mceTemp">
<dl id="attachment_6039" class="wp-caption alignnone" style="width: 203px;">
<dt class="wp-caption-dt"><a href="http://bytesizebio.net/wp-content/uploads/2012/04/smorf-pipeline.jpg"><img class="size-medium wp-image-6039" title="smorf-pipeline" src="http://bytesizebio.net/wp-content/uploads/2012/04/smorf-pipeline-193x300.jpg" alt="" width="193" height="300" /></a><p class="wp-caption-text">Click to enlarge. Search pipeline for Drosophila smORFs. Diagram of the smORF search pipeline followed in this study. The percentages of smORFs passing each filter are indicated. For full details, see Results and Materials and methods. CDS, coding DNA sequence; Dm, Drosophila melanogaster; Dp, Drosophila pseudoobscura; Ka/Ks, ratio of non-synonymous (Ka) to synonymous (Ks) nucleotide substitution.Ladoukakis et al. Genome Biology 2011 12:R118   doi:10.1186/gb-2011-12-11-r118</p></div>
<p>So the chosen 401 smORFs are evolutionarily conserved, both in sequence and in synteny, subject to purifyng selection (by Ka/Ks ratio) and produce a transcript. The authors obviously went for specificity over sensitivity: they looked for &#8220;good bet&#8221; smORFs rather than a large number of candidates. What I like about this study is the way that the authors used a large number of evolutionary traits that can be used as attributes for identifying smORFs. They also were careful to rule out, as much as possible, that these smORFs that may be a result of a larger transcript. This is a really nice molecular evolution work. There is no experimental evidence yet of the functionality of these smORFs: those are left to future proteomic and fly geneticists. But the idea of a small(er) world of genes, hiding in plain site among the more familiar large ones, does have its appeal, and may yield some surprises about how are genomes are structured.</p>
<p>Finally, for the evolutionary biologists: read the <a href="http://genomebiology.com/2011/12/11/R118" target="_blank">paper</a>; there is quite a lot more to it that what I wrote. I just gave the highlights.</p>
<p>&nbsp;</p>
<hr />
<p><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.jtitle=Genome+Biology&amp;rft_id=info%3Adoi%2F10.1186%2Fgb-2011-12-11-r118&amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;rft.atitle=Hundreds+of+putatively+functional+small+open+reading+frames+in+Drosophila&amp;rft.issn=1465-6906&amp;rft.date=2011&amp;rft.volume=12&amp;rft.issue=11&amp;rft.spage=0&amp;rft.epage=&amp;rft.artnum=http%3A%2F%2Fgenomebiology.com%2F2011%2F12%2F11%2FR118&amp;rft.au=Ladoukakis%2C+E.&amp;rft.au=Pereira%2C+V.&amp;rft.au=Magny%2C+E.&amp;rft.au=Eyre-Walker%2C+A.&amp;rft.au=Couso%2C+J.&amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CBioinformatics%2C+%2C+Genetics+%2C+Evolutionary+Biology%2C+Genomics">Ladoukakis, E., Pereira, V., Magny, E., Eyre-Walker, A., &amp; Couso, J. (2011). Hundreds of putatively functional small open reading frames in Drosophila <span style="font-style: italic;">Genome Biology, 12</span> (11) DOI: <a href="http://dx.doi.org/10.1186/gb-2011-12-11-r118" rev="review">10.1186/gb-2011-12-11-r118</a></span></p>
<p>&nbsp;</p>
<p><a href="http://genomebiology.com/2011/12/11/R118/abstract">http://genomebiology.com/2011/12/11/R118/abstract</a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2012/04/27/its-a-smorf-world-after-all/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Biocuration 2012</title>
		<link>http://bytesizebio.net/index.php/2012/04/06/biocuration-2012/</link>
		<comments>http://bytesizebio.net/index.php/2012/04/06/biocuration-2012/#comments</comments>
		<pubDate>Fri, 06 Apr 2012 15:03:25 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Biology]]></category>
		<category><![CDATA[blogging]]></category>
		<category><![CDATA[Social media]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[annotation]]></category>
		<category><![CDATA[biocuration]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[DC]]></category>
		<category><![CDATA[protein function prediction]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5977</guid>
		<description><![CDATA[&#160; Great meeting:  Biocuration 2012, Georgetown University, DC.  When I leave a meeting with my head exploding with new ideas and a need to try them all out at once, I know I got my money&#8217;s worth, and then some. Even a three hour flight delay followed by discovering my car with a dead battery [...]]]></description>
			<content:encoded><![CDATA[<p>&nbsp;</p>
<p>Great meeting:  <a href="http://pir.georgetown.edu/biocuration2012/">Biocuration 2012</a>, Georgetown University, DC.  When I leave a meeting with my head exploding with new ideas and a need to try them all out at once, I know I got my money&#8217;s worth, and then some. Even a three hour flight delay followed by discovering my car with a dead battery at 1am at the deserted Dayton Airport parking lot did not dampen my enthusiasm upon return. I will make sure my dome light is off before I leave my car  the next time though. To follow are bits and pieces from the meeting I enjoyed. I&#8217;m doing this mostly from memory, two days later, so I may have an addendum once I get my notes together.</p>
<p>What is biocuration? Well, anything that has to do with annotating, labeling, indexing, identifying biological entities. Almost exclusively genes in this conference. Genome databases, especially those of model organisms, employ curators to annotate, check and re-annotate the genomic data Here&#8217;s a more elaborate explanation, <a href="http://biocurator.org/what.shtml" target="_blank">taken</a> from the website of the <a href="http://biocurator.org/home.shtml" target="_blank">International Society for Biocuration</a>:</p>
<blockquote><p>Biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets. Accurate and comprehensive representation of biological knowledge, as well as easy access to this data for working scientists and a basis for computational analysis, are primary goals of biocuration.</p>
<p>The goals of biocuration are achieved thanks to the convergent endeavors of biocurators, software developers and researchers in bioinformatics. Biocurators provide essential resources to the biological community such that databases have become an integral part of the tools researchers use on a daily basis for their work.</p></blockquote>
<p><a href="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses.jpg"><img class="aligncenter" src="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses-296x300.jpg" alt="" width="178" height="180" /></a></p>
<p>&nbsp;</p>
<p><strong>Day 1</strong> started off with many community annotation tools. I thought that the Wikipedia model for annotation was dead, but maybe I&#8217;m wrong. Many community efforts use a large number of experts, as opposed to a huge number of non-experts, which is what the speakers at the first session were discussing. <a href="http://www.pombase.org/" target="_blank">Pombase</a> (whose title drew some chuckles from the French speakers at my table), the <a href="http://ciliate.org/index.php/home/welcome" target="_blank">Tetrahymna Genome Database</a> Wiki and the <a href="http://en.wikipedia.org/wiki/Gene_Wiki" target="_blank">Gene Wiki</a> were presented. The Gene Wiki, presented by <a href="http://sulab.org/" target="_blank">Andrew Su</a> from TSRI is a <em>bona-fide</em> crowdsourcing approach, not just Wikipedia-like but actually comprised of a set of 10,000 gene definition stubs folded into Wikipedia. Jennifer Harrow from Sanger presented a poster with an accession model of annotations: the &#8220;blessed annotator&#8221; who has been trained for 3 months and has the run of the wiki, and the &#8220;gatekeeper&#8221;, who has been trained in a 2-day workshop, and whose contributions need to be monitored. Lots of talks about trusted annotators, etc. Perhaps we should look to cryptography&#8217;s &#8220;circles of trust&#8221; to enable trusted annotations yet increase the number of curators. (I use &#8220;curation&#8221; and &#8220;annotation&#8221; interchangeably throughout.)</p>
<p>An afternoon workshop, discussed <a href="http://database.oxfordjournals.org/content/2012/bar059.abstract" target="_blank">who are biocurators</a>. If you are a biocurator, there&#8217;s a good probability you are 31-50 years young (80%), female (60%), with a PhD (76%), been through the academic mill and found it to be a bad fit for one reason or the other. You like your work, you rarely burn out, it is challenging and stimulating, you are not in it for the money. (Few people in non-industry science are.)  Actually, since non-profit science is run on soft money, funding is a serious concern, and your job may have a shorter half-life that you would care for it to have, as you are probably employed on a 3-5 year contract. Your boss is rarely a biocurator her/himself, which may mean that your job description may sometimes be ill-defined.</p>
<p>After  that, there was a  whole session devoted to curation workflows and tools. If  you are setting up your own genomic database, check these out: <a href="http://gmod.org/wiki/WebApollo" target="_blank">WebApollo</a>,  <a href="http://database.oxfordjournals.org/content/2012/bas001.short" target="_blank">CvManGO</a> and the <a href="http://www.reactome.org/" target="_blank">Reactome</a>. <a href="http://pimm.wordpress.com/about/" target="_blank">Attila Csordas</a> from EBI presented <a href="http://www.ebi.ac.uk/pride/" target="_blank">PRIDE</a>, a tool for curating proteomic data. While proteomic data are growing, there are few choices of software tools to annotate them. So PRIDE is a welcome player in the field.</p>
<p style="text-align: center;"><a href="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses.jpg"><img class="wp-image-5996 aligncenter" src="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses-296x300.jpg" alt="" width="178" height="180" /></a></p>
<p><strong> Day 2</strong> had a &#8220;Genomics, metagenomics comparative genomics&#8221; session, only without the metagenomics. <img src='http://bytesizebio.net/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' />   What I really liked was the <a href="http://viralzone.expasy.org/" target="_blank">ViralZone</a> resource for viral genomes, out of SIB. High time someone did this for the most abundant biological particle on Earth, and the one responsible for most diversity in life.</p>
<p>The breakout sessions were my favorite, getting a change to interact with like-minded people interested in similar questions. (That is, those that share my prejudices.) I went to the one organized by <a href="http://www.unil.ch/dee/page22707_en.html" target="_blank">Marc Robinson-Rechavi</a> and <a href="http://www.unil.ch/dee/page48559_en.html">Frederic Bastian</a> which dealt with the question of quality in gene annotation.  Here is the problem: when we annotate a gene with a function (or functions), we also need to say what is the evidence that brought us to think that this gene does what it does. The most popular vocabulary for annotating genes is the <a href="http://www.geneontology.org/" target="_blank">Gene Ontology</a> or GO. GO provides us with <a href="http://www.geneontology.org/GO.evidence.shtml" target="_blank">evidence codes</a> which allow the curator to say what is the evidence for the function they assign to a gene. Those range from experimental evidence codes such as &#8220;inferred from mutant phenotype&#8221; which are always entered by a human curator, to &#8220;Inferred from Electronic Annotation&#8221; which have no human oversight. These evidence codes are used as a proxy for quality: people generally tend to accept that evidence from an experiment may be stronger evidence that that gene does what it does than an electronic one. That may not necessarily be true. For example, high-throughput experiments that results in many genes getting assigned with annotations wholesale. Even with the uncharacteristically low) 5% error rate, a single paper used as a source from which 5,000 genes are annotated would result in 25 wrongly annotated genes.  In addition, these types of experiments supply annotations that are not very specific, such as &#8220;protein binding&#8221; or &#8220;embryonic development&#8221;, terms that in many cases are too general to be useful. On  the other hand, Nives Škunca of ETH Zurich has shown a beautiful study about how fully automated annotations may not be as inferior to human-curated ones as most people think, with some caveats. (Note: Nives also showed her work in a poster that won the best poster award at the meeting, and this work has just been accepted to <em>PLoS Computational Biology</em>. I will try to blog more about it once it&#8217;s published, it&#8217;s really brilliant.) The discussion revolved around how we should ascertain the quality of annotations, what would be considered a useful annotation, and how can we establish trustworthiness. Seems like there is quite a bit of work to be done, as people are only beginning to realize that this is a more complex problem than we thought. A major player in this will be the Evidence Ontology or <a href="http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=ECO">ECO</a>, an elaborate ontology in the making describing lines of evidence for gene annotation.</p>
<p><a href="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses.jpg"><img class="aligncenter" src="http://bytesizebio.net/wp-content/uploads/2012/04/Solar-and-Lunar-eclipses-296x300.jpg" alt="" width="178" height="180" /></a></p>
<p><strong>Day 3</strong>: Atilla Csordas, whom I mentioned earlier, organized an <a href="http://en.wikipedia.org/wiki/Unconference" target="_blank">unconference</a> session early morning. A few of us gave brief talks there. Ben Good from Andrew Su&#8217;s lab talked about biocuration through games, with harnessing  The idea is to do for biocuration what <a href="http://fold.it" target="_blank">fold.it</a> has done for protein folding. The <a href="http://sulab.org/2011/11/learning-from-the-dizeez-game/" target="_blank">Dizeez</a> game quizzes you about diseases related to genes, and scores you according to how well you link genes to diseases. But as Andrew says on his <a href="http://sulab.org/2011/11/learning-from-the-dizeez-game/" target="_blank">blog</a>:</p>
<blockquote><p> Generally, the gene-disease links in structured databases will be reasonably correct (though likely not at all complete). When we analyze the game logs in aggregate, we expect that players’ answers will generally reinforce what’s already known. But given enough game player data, also expect that we’ll see multiple instances of gene-disease links that <em>aren’t</em> reflected in current annotation databases. And these are candidate novel annotations.</p></blockquote>
<p>So there may be something there, although it is not the &#8220;wisdom of the crowds&#8221; that is being exploited, since I imagine that only people with advanced degrees in their field can contribute to Dizeez. You can see games from the Su lab on <a href="http://genegames.org/" target="_blank">genegames.org</a>. Sean Mooney from Buck talked about the <a href="http://www.mooneygroup.org/stop/input" target="_blank">Statistical Tracking of Ontological Phrases</a> (STOP) project. The idea here is to automatically enrich GO annotation of genes with other ontologies, to get a more comprehensive description of their function, especially when it comes to disease.  I talked about the <a href="http://bytesizebio.net/index.php/2011/07/02/cafa-update/" target="_blank">Critical Assessment of Function Annotations</a> (we finally submitted the paper, yay!).  Atilla talked about annotating proteomic data.</p>
<p>Great meeting. A big thank you to the <a href="http://pir.georgetown.edu/biocuration2012/organizers.html" target="_blank">organizers</a>, it went without a hitch.  Logistics, food, coffee were all fantastic. Looking forward to Cambridge nest year! <strong>EDIT</strong>: a <a href="http://www.oxfordjournals.org/our_journals/databa/biocuration_virtual_issue.html" target="_blank">virtual special issue of <em>Database</em></a> has been published for this meeting, Some of the talks are there as papers. Open Access, of course.</p>
<p>Finally, my favorite promotional item from the meeting:</p>
<p><a href="http://bytesizebio.net/wp-content/uploads/2012/04/2012-04-03-19.09.47.jpg"><img class="alignnone size-medium wp-image-6000" title="2012-04-03 19.09.47" src="http://bytesizebio.net/wp-content/uploads/2012/04/2012-04-03-19.09.47-225x300.jpg" alt="" width="225" height="300" /></a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2012/04/06/biocuration-2012/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>You. Want. This. Job.</title>
		<link>http://bytesizebio.net/index.php/2012/03/27/you-want-this-job/</link>
		<comments>http://bytesizebio.net/index.php/2012/03/27/you-want-this-job/#comments</comments>
		<pubDate>Tue, 27 Mar 2012 15:47:51 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[evolution]]></category>
		<category><![CDATA[genomics]]></category>
		<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5961</guid>
		<description><![CDATA[NSF grant funded, woohoo! Now I am hiring a programmer. So if you want to be part of a dynamic, growing lab, do lots of interesting stuff and upgrade yourself from just a great bioinformatician to a super-bioinformatician, this job&#8217;s for you.  You&#8217;ll be working primarily on microbial genome evolution, including setting up a kick-butt [...]]]></description>
			<content:encoded><![CDATA[<p>NSF grant funded, woohoo! Now I am hiring a programmer. So if you want to be part of a dynamic, growing lab, do lots of interesting stuff and upgrade yourself from just a great bioinformatician to a super-bioinformatician, this job&#8217;s for you.  You&#8217;ll be working primarily on microbial genome evolution, including setting up a kick-butt multi-genome database, and all sorts of interesting distractions.  See below for the nitty-gritty. Original ad here: <a href="https://www.miamiujobs.com" target="_blank">https://www.miamiujobs.com</a>, job posting number: <strong>0001377</strong> . Pass on to interested parties. Three year position, renewable annually.</p>
<blockquote><p><strong>Microbiology</strong>: Scientific Programmer/Specialist to implement and maintain a genomic database web site; implement data management tools including relational database management applications for efficient storage and retrieval of genomic data; perform other duties as related to the position such as data and project management to ensure data are being processed in an efficient and timely manner; contribute to writing scientific manuscripts.</p>
<p><strong>Required qualifications</strong>: BS or BA in Computer Science, bioinformatics, or a related discipline; demonstrated programming experience, particularly in Python and SQL databases; demonstrated web programming experience; knowledge of Linux/Unix; excellent spoken and written communication and documentation skills.</p>
<p><strong>Preferred qualifications</strong>: Advanced degree (M.Sc. or Ph.D) or equivalent in Computer Science, Bioinformatics, Molecular Biology or a related discipline; experience in development of bioinformatic algorithms; knowledge of R programming; experience in development of or contribution to open source projects; experience in collaborative software development such as the use of version control software, writing and following software specifications, participation in code review; knowledge of basic molecular biology; experience with genomic browser programming, such as GMOD or equivalent.</p>
<p>Candidates should send a CV or resume and have three letters of reference sent separately to Dr. Iddo Friedberg at <a href="http://is.gd/40N6zn">Friedberg.lab.jobs &#8216;at&#8217; gmail &#8216;dot&#8217; com</a>. Screening of applications begins April 14, 2012 and will continue until the position is filled.</p>
<p>Miami University is an affirmative action/equal opportunity employer with smoke-free campuses. Consumer Information http://www.miami.muohio.edu/about-miami/publications-and-policies/student-consumer-info/. Hard copy upon request.</p></blockquote>
<p><span style="font-family: Arial,sans-serif;"><a href="http://bytesizebio.net/wp-content/uploads/2012/03/job-ad-programmer.pdf" target="_blank">Ad in PDF</a>.<br style="font-family: Arial,sans-serif;" /></span></p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2012/03/27/you-want-this-job/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Wikipedia pages on protein function prediction</title>
		<link>http://bytesizebio.net/index.php/2012/02/01/wikipedia-pages-on-protein-function-prediction/</link>
		<comments>http://bytesizebio.net/index.php/2012/02/01/wikipedia-pages-on-protein-function-prediction/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 15:55:20 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Free Culture]]></category>
		<category><![CDATA[Writing]]></category>
		<category><![CDATA[function-prediction]]></category>
		<category><![CDATA[protein-function]]></category>
		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5861</guid>
		<description><![CDATA[I just received an email from Julian Gough , one of last year&#8217;s CAFA participants. He started a Wikipedia initiative on protein function prediction, which are barely stubs at the moment. EDIT: He alerted me to the fact that protein function prediction has virtually no presence on Wikipedia. So all you protein function predictors out there, please contribute. Yes, [...]]]></description>
			<content:encoded><![CDATA[<p>I just received an email from <a href="http://www.cs.bris.ac.uk/~gough/" target="_blank">Julian Gough</a> , one of last year&#8217;s <a href="http://bytesizebio.net/index.php/2011/07/02/cafa-update/" target="_blank">CAFA</a> participants.<span style="color: #000000;"> <del>He started a Wikipedia initiative on protein function prediction, which are barely stubs at the moment</del>.</span> <span><span><strong style="color: #000000; text-decoration: underline;">EDIT</strong><span style="text-decoration: underline;">: He alerted me to the fact that protein function prediction has virtually no presence on Wikipedia</span></span><span style="color: #800000;">.</span></span> So all you protein function predictors out there, please contribute. Yes, you too!</p>
<p>I guess that as a CAFA organizer, I should really contribute to the second page. And I will. But I really don&#8217;t mind if someone else jump-starts it. <img src='http://bytesizebio.net/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p><a href="http://en.wikipedia.org/wiki/Protein_function_prediction" target="_blank">http://en.wikipedia.org/wiki/<wbr>Protein_function_prediction</wbr></a></p>
<p><a href="http://en.wikipedia.org/wiki/Critical_Assessment_of_Function_Annotation" target="_blank">http://en.wikipedia.org/wiki/<wbr>Critical_Assessment_of_<wbr>Function_Annotation</wbr></wbr></a></p>
<p>&nbsp;</p>
<p><a href="http://bytesizebio.net/wp-content/uploads/2012/02/Wikipedia-logo.png"><img class="alignnone size-full wp-image-5862" title="Wikipedia-logo" src="http://bytesizebio.net/wp-content/uploads/2012/02/Wikipedia-logo.png" alt="" width="200" height="200" /></a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2012/02/01/wikipedia-pages-on-protein-function-prediction/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Circumcision, preventing fraud, and icky toilets. You know you&#8217;re going to read this.</title>
		<link>http://bytesizebio.net/index.php/2011/12/04/circumcision-preventing-fraud-and-icky-toilets-you-know-youre-going-to-read-this/</link>
		<comments>http://bytesizebio.net/index.php/2011/12/04/circumcision-preventing-fraud-and-icky-toilets-you-know-youre-going-to-read-this/#comments</comments>
		<pubDate>Sun, 04 Dec 2011 18:23:02 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Free Culture]]></category>
		<category><![CDATA[Metagenomics]]></category>
		<category><![CDATA[Microbiology]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Science publication]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5710</guid>
		<description><![CDATA[In no particular order or ranking, recent and not-so-recent articles from PLoS-1. The common thread (if any): I thought they were pretty cool in one way or another. &#160; 1. Men don&#8217;t tell the truth about their penis. No kidding? But this is somewhat more serious. It has been accepted for some time that male [...]]]></description>
			<content:encoded><![CDATA[<p>In no particular order or ranking, recent and not-so-recent articles from PLoS-1. The common thread (if any): I thought they were pretty cool in one way or another.</p>
<hr/>
&nbsp;</p>
<p>1.<strong> Men don&#8217;t tell the truth about their penis.</strong> No kidding? But this is somewhat more serious. It has been accepted for some time that male circumcision dramatically reduces the rate of HIV infection. But recently, some reports have shown that high rates of infection prevail among circumcised men as well. But since circumcision is usually self-reported, could there be a problem there? This study shows that in a cross-sectional (sorry&#8230;) study among recruits to the Lesotho Defense Force, 50% of the men that reported they were circumcised were, in fact, partially (27%) or completely (23%) not circumcised. The researchers conclude that biases in the self-reporting of male circumcision may lead to erroneous reports that show high HIV infection rates among circumcised men.</p>
<p><span style="text-decoration: underline;">Concluding quote:</span></p>
<blockquote><p>&#8230;until further research can document improved methods for obtaining accurate self-reported MC [male circumcision <em>I.F.</em>] data, all assessments of MC and HIV prevalence, as well as projections for VMMC [voluntary male medical circumcision <em>I.F.</em>] interventions, should be informed by physical-exam-based data [as opposed toself reporting, <em>I.F.</em>].</p></blockquote>
<p><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027561">http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027561</a></p>
<p><span style="float: center; padding: 5px;"><a href="http://www.researchblogging.org"><img style="border: 0;" src="http://www.researchblogging.org/public/citation_icons/rb2_large_gray.png" alt="ResearchBlogging.org" /></a></span></p>
<hr/>
2. <strong>Share your data or GTFO. </strong></p>
<p>Can sharing data help prevent errors and fraud?</p>
<p>From the abstract:</p>
<blockquote><p><strong>Background</strong>: The widespread reluctance to share published research data is often hypothesized to be due to the authors&#8217; fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically</p></blockquote>
<p>So <a href="http://wicherts.socsci.uva.nl/" target="_blank">Jelte Wicherts</a> and his colleagues from the University of Amsterdam wanted to see whether sharing data was related to the number of statistical analysis errors in a paper. So, to phrase this as a null and alternative hypothesis:</p>
<p><strong>H0:There is no difference in the number of statistical errors in those papers where the authors are willing to share data, and those where the authors are unwilling to do so.</strong></p>
<p><strong>H1: (one sided): the number of weaker evidence and statistical errors in papers where the authors are unwilling to share data is larger than those in which the authors are willing to share data.</strong></p>
<p>Wicherts and colleagues contacted authors of 141 papers published in five journals of the American Psychological Association, requesting their data. Trouble is, they could not get enough authors to share data to make their own study significant: in a <a href="http://psycnet.apa.org/journals/amp/61/7/726/" target="_blank">previous study</a>, some 73% of the authors contacted were unwilling to share data. Wow.</p>
<p>However, authors publishing in two of these journals, <em>Journal of Personality and Social Psychology (JPSP)</em> and <em>Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC),</em> were somewhat more forthcoming.  Wicherts and colleagues therefore limited their analysis to a subset of 49 papers published in those journals. (Note that sometimes lack of data sharing is due to legitimate considerations, such as being part of an ongoing study, or third-party proprietary rights. However, those were not considerations in 49 papers analyzed here.)</p>
<p>Wicherts  then checked for specific types of statistical errors in these papers, and compared the number of errors in papers from authors willing to share data to those who did not. Here are some of the findings:</p>
<div id="attachment_5719" class="wp-caption alignnone" style="width: 624px"><a href="http://bytesizebio.net/wp-content/uploads/2011/12/data-errors.png"><img class="size-large wp-image-5719 " title="data-errors" src="http://bytesizebio.net/wp-content/uploads/2011/12/data-errors-1024x962.png" alt="" width="614" height="577" /></a><p class="wp-caption-text">Distribution of the number of errors in the reporting of p-values for 28 papers from which the data were not shared (left column) and 21 from which the data were shared (right column) for all misreporting errors (upper row), larger misreporting errors at the 2nd decimal (middle row), and misreporting errors that concerned statistical significance (p&lt;.05; bottom row). doi:10.1371/journal.pone.0026828.g001</p></div>
<p>&nbsp;</p>
<p>Pretty clear picture: those papers where the authors authors were willing to share data were less prone to statistical errors.</p>
<p>Concluding quote:</p>
<blockquote><p>In this sample of psychology papers, the authors&#8217; reluctance to share data was associated with more errors in reporting of statistical results and with relatively weaker evidence (against the null hypothesis). The documented errors are arguably the tip of the iceberg of potential errors and biases in statistical analyses and the reporting of statistical results. It is rather disconcerting that roughly 50% of published papers in psychology contain reporting errors <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3174372" target="_blank">[33]</a> and that the unwillingness to share data was most pronounced when the errors concerned statistical significance.</p></blockquote>
<p>Although note that Wicherts is very careful about drawing conclusions:</p>
<blockquote><p>Although our results are consistent with the notion that the reluctance to share data is generated by the author&#8217;s fear that reanalysis will expose errors and lead to opposing views on the results, our results are correlational in nature and so they are open to alternative interpretations. Although the two groups of papers are similar in terms of research fields and designs, it is possible that they differ in other regards. Notably, statistically rigorous researchers may archive their data better and may be more attentive towards statistical power than less statistically rigorous researchers. If so, more statistically rigorous researchers will more promptly share their data, conduct more powerful tests, and so report lower p-values. However, a check of the cell sizes in both categories of papers (see Text S2) did not suggest that statistical power was systematically higher in studies from which data were shared.</p></blockquote>
<p>&nbsp;</p>
<p>In fact, Wicherts also wrote a <a href="http://www.nature.com/news/psychology-must-learn-a-lesson-from-fraud-case-1.9513" target="_blank">piece in <em>Nature</em></a> where he argued that sharing data can help avoid fraud, such as in the recent <a href="http://www.nature.com/news/2011/111101/full/479015a.html" target="_blank">infamous case of Diederik Stapel</a>, a highly regarded psychologist at Tilburg University in the Netherlands.</p>
<p><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828">http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828</a></p>
<hr/>
<p>3. <strong>Toilet paper. </strong>A study of surfaces of public restrooms has shown that they are covered with bacteria, mainly the kind that is known to live on and in humans. So now we have a somewhat broader view of the species living in restrooms, including the uncultured ones.</p>
<p>Two interesting quotes from the paper:</p>
<blockquote><p>Although many of the source-tracking results evident from the restroom surfaces sampled here are somewhat obvious, this may not always be the case in other environments or locations.</p></blockquote>
<p>Not sure about this bit: if the sources here are obvious, then is this paper a proof-of concept?</p>
<p>Also:</p>
<blockquote><p>Unfortunately, previous studies have documented that college students (who are likely the most frequent users of the studied restrooms) are not always the most diligent of hand-washers.</p></blockquote>
<p>No shit! (Pun intended).</p>
<p>Concluding quote:</p>
<blockquote><p>Although the methods used here did not provide the degree of phylogenetic resolution to directly identify likely pathogens, the prevalence of gut and skin-associated bacteria throughout the restrooms we surveyed is concerning since enteropathogens or pathogens commonly found on skin (e.g. <em>Staphylococcus aureus</em>) could readily be transmitted between individuals by the touching of restroom surfaces.</p></blockquote>
<p>Translation:</p>
<p><a href="http://bytesizebio.net/wp-content/uploads/2011/12/washhands.jpg"><img class="alignnone size-full wp-image-5718" title="washhands" src="http://bytesizebio.net/wp-content/uploads/2011/12/washhands.jpg" alt="" width="342" height="477" /></a></p>
<p><a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028132">http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0028132</a></p>
<hr />
<p><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.jtitle=PLoS+ONE&amp;rft_id=info%3Adoi%2F10.1371%2Fjournal.pone.0027561&amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;rft.atitle=Voluntary+Medical+Male+Circumcision%3A+A+Cross-Sectional+Study+Comparing+Circumcision+Self-Report+and+Physical+Examination+Findings+in+Lesotho&amp;rft.issn=1932-6203&amp;rft.date=2011&amp;rft.volume=6&amp;rft.issue=11&amp;rft.spage=0&amp;rft.epage=&amp;rft.artnum=http%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0027561&amp;rft.au=Thomas%2C+A.&amp;rft.au=Tran%2C+B.&amp;rft.au=Cranston%2C+M.&amp;rft.au=Brown%2C+M.&amp;rft.au=Kumar%2C+R.&amp;rft.au=Tlelai%2C+M.&amp;rfe_dat=bpr3.included=1;bpr3.tags=Medicine%2CPsychology%2CHealth%2CEpidemiology%2C+Public+Health%2C+Human+Factors">Thomas, A., Tran, B., Cranston, M., Brown, M., Kumar, R., &amp; Tlelai, M. (2011). Voluntary Medical Male Circumcision: A Cross-Sectional Study Comparing Circumcision Self-Report and Physical Examination Findings in Lesotho <span style="font-style: italic;">PLoS ONE, 6</span> (11) DOI: <a href="http://dx.doi.org/10.1371/journal.pone.0027561" rev="review">10.1371/journal.pone.0027561</a></span></p>
<p><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.jtitle=PLoS+ONE&amp;rft_id=info%3Adoi%2F10.1371%2Fjournal.pone.0026828&amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;rft.atitle=Willingness+to+Share+Research+Data+Is+Related+to+the+Strength+of+the+Evidence+and+the+Quality+of+Reporting+of+Statistical+Results&amp;rft.issn=1932-6203&amp;rft.date=2011&amp;rft.volume=6&amp;rft.issue=11&amp;rft.spage=0&amp;rft.epage=&amp;rft.artnum=http%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0026828&amp;rft.au=Wicherts%2C+J.&amp;rft.au=Bakker%2C+M.&amp;rft.au=Molenaar%2C+D.&amp;rfe_dat=bpr3.included=1;bpr3.tags=Mathematics%2CPsychology%2CHuman+Factors%2C+Quantitative+Psychology%2C+Probability+and+Statistics">Wicherts, J., Bakker, M., &amp; Molenaar, D. (2011). Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results <span style="font-style: italic;">PLoS ONE, 6</span> (11) DOI: <a href="http://dx.doi.org/10.1371/journal.pone.0026828" rev="review">10.1371/journal.pone.0026828</a></span></p>
<p><span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.jtitle=PLoS+ONE&amp;rft_id=info%3Adoi%2F10.1371%2Fjournal.pone.0028132&amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;rft.atitle=Microbial+Biogeography+of+Public+Restroom+Surfaces&amp;rft.issn=1932-6203&amp;rft.date=2011&amp;rft.volume=6&amp;rft.issue=11&amp;rft.spage=0&amp;rft.epage=&amp;rft.artnum=http%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0028132&amp;rft.au=Flores%2C+G.&amp;rft.au=Bates%2C+S.&amp;rft.au=Knights%2C+D.&amp;rft.au=Lauber%2C+C.&amp;rft.au=Stombaugh%2C+J.&amp;rft.au=Knight%2C+R.&amp;rft.au=Fierer%2C+N.&amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CMedicine%2CHealth%2CMicrobiology+%2C+Epidemiology%2C+Bioinformatics%2C+Metagenomics">Flores, G., Bates, S., Knights, D., Lauber, C., Stombaugh, J., Knight, R., &amp; Fierer, N. (2011). Microbial Biogeography of Public Restroom Surfaces <span style="font-style: italic;">PLoS ONE, 6</span> (11) DOI: <a href="http://dx.doi.org/10.1371/journal.pone.0028132" rev="review">10.1371/journal.pone.0028132</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2011/12/04/circumcision-preventing-fraud-and-icky-toilets-you-know-youre-going-to-read-this/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Short bioinformatics hacks: reading mate-pairs from a fastq file</title>
		<link>http://bytesizebio.net/index.php/2011/11/10/short-bioinformatics-hacks-reading-mate-pairs-from-a-fastq-file/</link>
		<comments>http://bytesizebio.net/index.php/2011/11/10/short-bioinformatics-hacks-reading-mate-pairs-from-a-fastq-file/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 15:55:15 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Biopython]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5618</guid>
		<description><![CDATA[If you have a merged file of paired-end reads, here is a quick way to read them using Biopython: from Bio import SeqIO from itertools import izip_longest # Loop over pairs of reads readiter = SeqIO.parse(open(inpath), "fastq") for rec1, rec2 in izip_longest(readiter, readiter): print rec1.id # do something with rec1 print rec2.id # do something [...]]]></description>
			<content:encoded><![CDATA[<p>If you have a merged file of paired-end reads, here is a quick way to read them using <a href="http://biopython.org">Biopython</a>:</p>
<pre class="brush:python">from Bio import SeqIO
from itertools import izip_longest
# Loop over pairs of reads
readiter = SeqIO.parse(open(inpath), "fastq")
for rec1, rec2 in izip_longest(readiter, readiter):
    print rec1.id  # do something with rec1
    print rec2.id  # do something with rec2
    .
    .
</pre>
<p>izip_longest is fed the same iterator, readiter, twice. However, readiter.next(), which advances the iterator, is called on the first argument and then on the second argument. Since next() is being called on the same iterator, successive records are yielded.</p>
<p>By &#8220;merged file&#8221; I mean a fastq file where the mate-pairs are one after the other, as in:</p>
<pre><strong>@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/1</strong>
GTTTGTTTTAATTTCAGTGATTCATCAATTTTAAAAAAAGATGAGAATAATAACTATTATAAAAAGATAAATAAATGTGAAATTTATATTTCAAATTCAA
+
@:DGBGDDD@GGGDGDGDDGD@GGGGE@GGG?EBGGGADDDDGEG4?3BA*::7:GEGGGG&gt;EDDDDAG@G&gt;&lt;ADDGBGGGGEGGGGDGGGFEGGGEFDE
<strong>@HWUSI-EAS687_112864999:8:1:1980:1055#CGAGAA/2</strong>
AATGAATTGAATAAATATAAGAAGGATGATTAATAATAATTCTTGAATTTGAAATATAAATTTCACATTTATTTATCTTTTTATAATAGTTATTATTCTC
+
D?DB:@8EBDB&gt;GG:=&lt;DED79&gt;&gt;A8CEC8DGDGG8CEC&lt;BGGG+BAAEA@D&lt;2D71;:8AG&lt;ABBEEEEBEDC?C&gt;AACDDDCD&gt;AD&lt;@EFFDDDECBB
<strong>@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/1</strong>
CCTCAGTTAGCTTCTATTGGTATTAACATGGGTGAATTTACTAAACAATTTAATGACCAAACTAAAGATAAAAATGGTGAAGTTATACCTTGTATAATTA
+
GFGGGHHGHHHHHHGHHHHHGHHHHHHHFBGDBGEHHHHFHHEHHHHDFHCGFFFHHHHHHHGHHGGEBHEEFFCEE@E&gt;A&gt;&gt;8A@EBE@BBB&gt;BGEEDB
<strong>@HWUSI-EAS687_112864999:8:1:2274:1058#CGAGAA/2</strong>
AACTGGAGTTGTTTTAATTTCAAAAGTAAAAGATTTATCTTTAAATGCTGTAATTATACAAGGTATAACTTCACCATTTTTATCTTTAGTTTGGTCATTA
+
IIIIIIIIIIGIIIDHHIIIIDIHD8CGGGGDADEIIIIIIIHIIGBGD&gt;DGDGGDGIGIIIIBGDG@GFHIIII&lt;C&lt;CCGHHHIHIBGDEEB3BEDEE@
</pre>
<p>The solution is derived from <a href="http://stackoverflow.com/questions/1657299/how-do-i-read-two-lines-from-a-file-at-a-time-using-python">this Stackoverflow entry</a>.</p>
<p>Of course, if the mate-pair files are not merged then you can use this script to merge them. Also illustrates using iterators from two different files in one <font type="Monospace12"><strong>for</strong></font> loop:</p>
<pre class="brush:python">
#!/usr/bin/env python
from Bio import SeqIO
import itertools
import sys
import os
def merge_fastq(fastq_path1, fastq_path2, outpath):
    outfile = open(outpath,"w")
    fastq_iter1 = SeqIO.parse(open(fastq_path1),"fastq")
    fastq_iter2 = SeqIO.parse(open(fastq_path2),"fastq")
    for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
        SeqIO.write([rec1,rec2], outfile, "fastq")
    outfile.close()

if __name__ == '__main__':
    outpath = "%s.merged.fastq" % os.path.splitext(sys.argv[1])[0]
    merge_fastq(sys.argv[1],sys.argv[2],outpath)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2011/11/10/short-bioinformatics-hacks-reading-mate-pairs-from-a-fastq-file/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Friedberg Lab is Recruiting Graduate Students</title>
		<link>http://bytesizebio.net/index.php/2011/10/18/the-friedberg-lab-is-recruiting-graduate-students/</link>
		<comments>http://bytesizebio.net/index.php/2011/10/18/the-friedberg-lab-is-recruiting-graduate-students/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 15:03:43 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Evolution]]></category>
		<category><![CDATA[Metagenomics]]></category>
		<category><![CDATA[Microbiology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Biopython]]></category>
		<category><![CDATA[graduate school]]></category>
		<category><![CDATA[jobs]]></category>
		<category><![CDATA[lab recruitment]]></category>
		<category><![CDATA[web tool]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5549</guid>
		<description><![CDATA[&#160; The Friedberg Lab is recruiting graduate students, for both Master&#8217;s and Ph.D. WE ARE:  A dynamic young lab  interested in gene, gene cluster and genome evolution, understanding microbial communities and microbe-host interactions by metagenomic analyses, developing algorithms for understanding gene cluster evolution, and prediction of protein function from protein sequence and structure. YOU ARE: [...]]]></description>
			<content:encoded><![CDATA[<p>&nbsp;</p>
<p>The Friedberg Lab is recruiting graduate students, for both Master&#8217;s and Ph.D.</p>
<p><strong>WE ARE</strong>:  A dynamic young lab  interested in gene, gene cluster and genome evolution, understanding microbial communities and microbe-host interactions by metagenomic analyses, developing algorithms for understanding gene cluster evolution, and prediction of protein function from protein sequence and structure.</p>
<p><strong>YOU ARE</strong>: an independent, hard-working problem-solving, energetic and motivated scientist-to-be. You have graduated or are about to graduate in computer science and/or biology or related fields. The Friedberg Lab is a &#8220;dry&#8221; lab, so some programming skills are required (Python preferred).</p>
<p>Existing and planned projects include:</p>
<p>1. Computational protein function prediction and assessment of function prediction algorithms. The Friedberg Lab is among the leaders of the <a href="http://bytesizebio.net">Critical Assessment of Function Annotations</a> (CAFA), an international effort of dozens of research groups to asess and improve function prediction algorithms. We are looking for students that are excited about prediction of protein function from sequence and structure. Also, how well can we assess how well our algorithms are doing? The next CAFA meeting will take place in Berlin, July 2013 and the Friedberg Lab will play a central role in  answering these questions.</p>
<p>2. <a href="http://en.wikipedia.org/wiki/Metagenomics" target="_blank">Metagenomics</a>:  we are studying the interaction between the microbiome and the host using metagenomic and metatranscriptomic data. In collaboration We are looking at how the human microbiome affects gene expression in the host. Together with Robb Chapkin&#8217;s lab at Texas A&amp;M we are analyzing microbial genomes and their effect on transcription in the human gut. We are also developing algorithms for context-based function prediction in metagenomic data. Simply put: how well can we prediction the function of a gene from its neighbors? Since many of the genes in metagenomic data have no known homologs, we are developing creative ways to computationally discover their function.</p>
<p>3. <span style="text-decoration: underline;">Microbial Evolution</span>: we are researching the evolution of Mycoplasma, a bacteria genus which serves us as model clade for understanding genome evolution. Mycoplasma have the smallest genomes of any organism, and being parasitic evolve quickly. Together with the Balish Lab we expect to sequence several new species and strains in the next year, and we are developing computational methods and a central community database  for analyzing the Mycoplasma tree of life. Besides the biological aspect, <strong>this project is also a great opportunity to get into web programming, database design, and learn how top design and code community-based scientific software. </strong></p>
<p>4. <a href="http://biopython.org/" target="_blank">Biopython</a>: Biopython is a set of freely available tools for biological computation written in <a title="http://www.python.org" href="http://www.python.org/" rel="nofollow">Python</a> by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. If you would like to become a Biopython developer, part of an international community of open-source scientific software developers, the Friedberg Lab is the place for you. This option is especially attractive for Master&#8217;s students seeking to enter bioinformatics in Industry.</p>
<p>5. Insert your brilliant idea here! I love new projects!</p>
<p>The lab is equipped with its own 10-node cluster computer, several workstations, and has access to <a href="http://www.units.muohio.edu/uit/research/high-performance-computing/redhawk-cluster">Miami University&#8217;s Supercomputing Center</a>, and the <a href="http://www.osc.edu/" target="_blank">Ohio Supercomputer Center</a> at Ohio State University.  Students have an excellent research environment, and many opportunities to collaborate with labs on and off campus.</p>
<p>Students can apply to the Friedberg Lab via the following graduate programs at Miami University:</p>
<p>1. <a href="http://microbiology.muohio.edu/grad/" target="_blank">Microbiology</a> (Master&#8217;s and PhD).</p>
<p>2. <a href="www.cas.muohio.edu/cmsb" target="_blank">Cell, Molecular and Strcutural Biology</a> (PhD only).</p>
<p>3. <a href="http://www.eas.muohio.edu/departments/cse/cse/" target="_blank">Computer Science</a> (Master&#8217;s only).</p>
<p>You are welcome and encouraged  to inquire further. I love talking with prospective students. If you would like to set up a phone/Skype chat please send your CV to:</p>
<p>friedberg.lab.jobs &#8216;at gmail &#8216;dot&#8217; com</p>
<p>Looking forward to hearing from you.</p>
<p>&nbsp;</p>
<p><a href="http://iddo-friedberg.net" target="_blank">Iddo Friedberg</a>, PhD</p>
<p>Assistant Professor, Microbiology and Computer Science (affiliate)</p>
<p>Miami University</p>
<p>Oxford, OH, USA</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2011/10/18/the-friedberg-lab-is-recruiting-graduate-students/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Friday fun story: extreme bug hunting on MIRA</title>
		<link>http://bytesizebio.net/index.php/2011/09/02/5389/</link>
		<comments>http://bytesizebio.net/index.php/2011/09/02/5389/#comments</comments>
		<pubDate>Fri, 02 Sep 2011 20:16:31 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[Funny]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[MIRA]]></category>
		<category><![CDATA[short read sequencing]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5389</guid>
		<description><![CDATA[MIRA is a really cool sequence assembly software, developed and maintained by Bastien Chevreux. MIRA has a large and active community, led by the funny and gracious Bastien, for whom no problem is too small, or too large. Recently MIRA seemed to have developed a stochastic bug, one of those which are a serious headache [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://chevreux.org/projects_mira.html" target="_blank">MIRA</a> is a really cool sequence assembly software, developed and maintained by <a href="http://chevreux.org/" target="_blank">Bastien Chevreux</a>. MIRA has a large and active community, led by the funny and gracious Bastien, for whom no problem is too small, or too large.</p>
<p>Recently MIRA seemed to have developed a stochastic bug, one of those which are a serious headache to track down. Bastein called upon the MIRA community to help him. A couple of weeks ago, the &#8220;bug&#8221; was resolved to everyone relief. It was not a bug at all, but &#8230; well, I&#8217;ll let you read Bastien&#8217;s letter. Probably th funniest and geekiest error report I have seen since, well, ever. Reproduced here from the <a href="http://www.freelists.org/archive/mira_talk" target="_blank">mira_talk</a> email list with Bastien&#8217;s permission. <b>WARNING:</b> fairly geeky and fairly long. Not for everyone. But if you, like me, enjoy a good story travails of extreme bug hunting, I guarantee you will not be disappointed. (Because we have all been there, although personally I don&#8217;t recall encountering a problem <i>that</i> frustrating). Teaser: it was not a bug.</p>
<p><font face="Courier"><br />
Dear all,</p>
<p>my warmest thanks to the numerous people who all donated time and computing power to hunt down a &#8220;bug&#8221; (see http://www.freelists.org/post/mira_talk/Call-for-help-bughunting) which. in the end, turned out to be a RAM defect on my development machine.</p>
<p>This is the story on how the problem got nailed. It involves lots of hot electrons, a lot less electrons without spin which keel over, the end of a hunt for invisibugs of the imaginary sort, 454, mutants (but no zombies), Illumina, some spider monkeys, PacBio, a chat with Sherlock and, of course, an anthropomorphed star.</p>
<p>In short: don&#8217;t read if you&#8217;ve got more interesting things to do on a Friday morning or afternoon.</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
Life&#8217;s a rollercoaster and there are days &#8211; or weeks &#8211; where morale is on a pretty hefty ride: ups and downs in fast succession &#8230; and the occasional looping here and there.</p>
<p>Today was a day where I had &#8211; the first time ever &#8211; ups and downs occuring absolutely simultaneously. Something which is physically impossible, I know, but don&#8217;t tell any physicist or astronomist about that or else they&#8217;ll embark you in a lengthy discussion on how isochronicity is a myth by telling you stories on lightning, thunder and two poor sobs at the ends of a 300,000 km long train. But I digress &#8230;</p>
<p>So, my lowest low and highest high today were at 09:17 this morning when I prepared leaving for work (hey, it&#8217;s vacation time, almost everyone else is out and I can go a bit later than usual, right?). A few minutes earlier I had just told MIRA to run on the very same PacBio test set she had successfully worked on the night before to see how stable assemblies with this kind of data are (quite well so far, thank you for asking).</p>
<p>Reaching out to switch off the monitor and leave, MIRA suddenly came back with a warm and cosy little error message which she&#8217;s taken the habit lately to have a mischievous pleasure to present. This time, she claimed there had been an illegal base in the FASTQ file.</p>
<p>&#8220;Hey, MIRA, wait a minute!&#8221; I thought. &#8220;Yesterday and tonight you ran on the very same data file with the very same parameters for two times three hours and even gave me back some nice assembly results. And now you claim that the INPUT data has errors?! Come on, you&#8217;re not serious, are you?&#8221;</p>
<p>As a side note: she then just gave me back &#8220;that look&#8221;, you know, the one with those big open eyes behind by long, dark lashes and slightly flushed cheeks accompanied by pointed lips &#8230; as if she wanted to say &#8220;I *am* innocent and *I* did no nothing wrong you disbeliever!&#8221; (http://24.media.tumblr.com/tumblr_lj3efmmDL01qasfhmo1_400.png). This usually announces a major pouting round of hers, something which I&#8217;m not looking forward to, I can tell you.</p>
<p>Two restarts later with the same negative result (MIRA can be quite stubborn at times) I had to give in and decided to sit down again and investigate the problem.</p>
<p>&#8220;So &#8230; read number 317301 at base position 246, eh? Let&#8217;s have a look.&#8221;</p>
<p>*clickedyclick*</p>
<p>&#8220;Read 317299, 317300 &#8230; 317301 &#8230; there we are.&#8221;</p>
<p>*hackedyhack*</p>
<p>&#8220;Base position 239, 240 &#8230; now: C G G G T C F A A &#8230; wait! What? &#8216;F&#8217; &#8230; &#8216;F&#8217;?!? It&#8217;s not even an IUPAC code. What&#8217;s a frakking &#8216;F&#8217; doing in the FASTQ input file?! (CSFW: http://www.youtube.com/watch?v=r7KcpgQKo2I )</p>
<p>Indeed, it is not. Even more mysterious to me was the fact that just the night before it apparently had not been there. Or had it? I now was pretty unsure where this path would lead me, as if I had unlocked a door with the key of imagination. Beyond it: another dimension &#8211; a dimension of sound, a dimension of sight, a dimension of mind. I was moving into a land of both shadow and substance, of things and ideas. I just crossed over into &#8230; the Twilight Zone (&#8220;G#-A-G#-E-G#-A-G#-E&#8221; at 128 bpm, for more info see http://www.youtube.com/watch?v=zi6wNGwd84g).</p>
<p>Where was I? Ah, yes, the &#8216;F&#8217;.</p>
<p>So, how did that &#8216;F&#8217; appear in the FASTQ, and where had it been the night before? Out to town, ashamed of not being a nucleotide and getting a hangover without telling anyone up-front? Or did it subreptitiously sneak in from the outside, murdering an innnocent base and taking its place in hope no one would note? I didn&#8217;t have the slightest clue, but I was determined to find that out.</p>
<p>First thing to check: the log files of the successful runs the previous night. MIRA&#8217;s very chatty at times and tidying up after her has always been a chore, but now was one of those occasions where not gagging her paid out as poking around the files she left behind proved to be interesting. Read 317301 showed the following at the position in doubt: &#8220;C G G G T C ___G___ A A&#8221; Without question: a &#8216;G&#8217;, and no &#8216;F&#8217; in sight!</p>
<p>So MIRA had been right and the &#8216;G&#8217; in the sequence of the file mysteriously mutated into an &#8216;F&#8217; overnight. I must admit that I had grown suspicious of her in the past few weeks as she had seemed to become uncooperative at times. In particular she had been screaming at me a couple of times during rehearsal of combined 454 and Illumina assemblies for the premiere of her new 3.4.0 show. She claimed that some uninvited spider monkeys (http://dict.leo.org/ende?search=Klammeraffe) had frightened her so much she refused to continue to work and simply scribbled the &#8216;@&#8217; sign all over her error messages. I had not been able to find out how those critters entered MIRA&#8217;s data and had even enrolled a few volunteers to rehearse different assemblies with MIRA &#8230; to no avail as she&#8217;d performed without flaws there.</p>
<p>While reconsidering all these things, something suddenly made *click*.</p>
<p>The character &#8216;G&#8217; has the hexadecimal ASCII table code 0&#215;47 (or in 8-bit binary: 01000111). &#8216;F&#8217;, as preceding character of &#8216;G&#8217; and the table having some logic behind it, has the hex code 0&#215;46, which is 01000110 in 8-bit binary.</p>
<p>The ATINSEQ-bug (@-in-seq) I had been desperately hunting in the past few weeks (and which had held up the release of MIRA 3.4.0) was due to the &#8220;@&#8221; character sometimes mysteriously appearing in sequences during the assembly of MIRA. The &#8216;@&#8217; sign in the ASCII table has the hex code 0&#215;40 (binary: 01000000). In the ASCII table, there is one important character for DNA assembly which is very near to the &#8216;@&#8217; character &#8230;, so near that it is the successor of it: the &#8216;A&#8217; character. Hexadecimal 0&#215;41, binary 01000001.</p>
<p>I had always thought that a bug in MIRA somehow corrupted the sequence, but what if &#8230; what if MIRA was actually really innocent?! I had never taken this possibility into account as this other explanation attempts would have seemed to far stretched.</p>
<p>But now I had a similar effect *outside* of MIRA, in the Linux filesystem!</p>
<p>Filesystem MIRA<br />
G 01000111 A 01000001<br />
F 01000110 @ 01000000</p>
<p>The difference between the characters is in both cases exactly 1 bit which changes, and it&#8217;s even at the same position (last one in a byte) and changing into the same direction (from &#8217;1&#8242; to &#8217;0&#8242;.</p>
<p>I was now sure I was on to something: bit decay (http://en.wikipedia.org/wiki/Bit_rot)</p>
<p>But how could I prove it? Well, elementary my dear Watson: When you have eliminated the impossible, whatever remains, however improbable, must be the truth.</p>
<p>Suspects:<br />
- the problem is caused either by MIRA or one of the components of the<br />
comeputer: CPU, disk, disk/dma controller, RAM.</p>
<p>Facts:<br />
- an artefact was very sporadically observed during MIRA runs where sequences<br />
(containing lot&#8217;s of &#8216;A&#8217;) suddenly contained at least one &#8216;@&#8217;. This occured<br />
after several passes, i.e., not on loading.<br />
- an artefact was observed in the Linux filesystem where a &#8216;G&#8217; mutated<br />
suddenly and overnight to a &#8216;F&#8217;.<br />
- both artefacts are based on one bit flipping, perhaps even to the same<br />
direction all the time.<br />
- when loading data, MIRA does not use mmap() to mirror data from disk, but<br />
physically creates a copy of that data.<br />
- MIRA loaded the data twice flawlessly before the artefact in the filesystem<br />
occured.</p>
<p>Deduction 1:<br />
- MIRA is innocent. The artefact in the filesystem happened outside of the<br />
address space of MIRA and therefore outside her control. MIRA cannot be<br />
responsible as the Linux kernel would have prevented her from writing to<br />
some memory she was not allowed to.</p>
<p>Further facts:<br />
- the system MIRA ran on had 24 GiB RAM<br />
- even with a KDE desktop, KMail, Firefox, Emacs and a bunch of terminals<br />
open, there is still a lot of free RAM (some 22 to 23).<br />
- Linux uses free RAM to cache files</p>
<p>Deduction 2:<br />
- when loading the small FASTQ input file in the morning, Linux put it into<br />
the file cache in RAM. As MIRA almost immediately stopped without taking<br />
much memory, the file stayed in cache.</p>
<p>Further facts:<br />
- the drive with the FASTQ file is run in udma6 mode. That is, when loading<br />
data the controller moves the data directly from disk to RAM without going<br />
via the processor<br />
- subsequent &#8220;loading&#8221; of the same FASTQ into MIRA or text viewer like &#8216;less&#8217;<br />
showed the &#8216;F&#8217; character always appearing at the same place.</p>
<p>Deduction 3:<br />
- the CPU is innocent! It did not touch the data while it was transferred from<br />
disk to RAM and it afterwards shows always the same data.<br />
- the disk and UDMA controllers are innocent! Some of the glitches observed in<br />
previous weeks occured during runs of MIRA, inside the MIRA address space,<br />
long after initial loading, when UDMA had already finished their job.</p>
<p>From deductions 1, 2 &#038; 3 follows:<br />
- it&#8217;s not MIRA, not the CPU, nor the disk &#038; UDMA controller</p>
<p>Suspects left:<br />
- RAM<br />
- Disk</p>
<p>Well, that can be easily tested: shut down the computer, restart it and subsequently look at the file again. No file cache in RAM can survive that procedure. Yes, I know, there are some magic incantations one can chant to force Linux to flush all buffers and clear all caches, but in that situation I was somehow feeling conservative.</p>
<p>Low and behold, after the above procedure the FASTQ file showed an all regular, good old nucleic acid &#8216;G&#8217; in the file again. No &#8216;F&#8217; to be seen anywhere.</p>
<p>Deduction 4:<br />
- the disk is innocent.</p>
<p>Deduction 5:<br />
- as all other components have been ruled out, the RAM is faulty.</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</p>
<p>As I wrote: life&#8217;s a rollercoaster.</p>
<p>Up: MIRA is innocent! There, she&#8217;s giving me &#8220;that look&#8221; again and one would<br />
have to be blind to oversee the &#8220;told you so&#8221; she&#8217;s sending over with<br />
it.<br />
Down: My RAM&#8217;s broken and I need to replace it. Bought it only last May,<br />
should still be under guarantee, but still &#8230; time and effort.<br />
Up: I did not sell my old RAMs, so I can continue to work<br />
Down: 12 GiB feels soooooo tight after having had 24.<br />
Up: I can wrap up 3.4.0 end of this week with good conscience!<br />
Down: How the hell am I gonna tie all loose bits and pieces in the<br />
documentation in the next 24 to 48 hours?<br />
Looping: today MIRA again helped me at work to locate a mutation important for<br />
one of our Biotech groups. Boy, do I love sequencing and MIRA.</p>
<p>Have a nice Friday and a good week-end,<br />
Bastien</p>
<p>PS: while celebrating with MIRA tonight, I expressed my fear that some people<br />
might find it strange that I anthropomorphise her. They could think I went<br />
totally nuts or that I needed an extended vacation (which I do btw). She<br />
reassured me that no one would dare thinking I were insane &#8230; and if so,<br />
she would come over to their place and give them &#8220;that look.&#8221;</p>
<p>How utterly reassuring.</p>
<p></font><br />
</</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2011/09/02/5389/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Postdoc positions available at Rutgers University</title>
		<link>http://bytesizebio.net/index.php/2011/08/31/postdoc-positions-available-at-rutgers-university/</link>
		<comments>http://bytesizebio.net/index.php/2011/08/31/postdoc-positions-available-at-rutgers-university/#comments</comments>
		<pubDate>Wed, 31 Aug 2011 15:01:14 +0000</pubDate>
		<dc:creator>Iddo</dc:creator>
				<category><![CDATA[Bioinformatics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[jobs]]></category>

		<guid isPermaLink="false">http://bytesizebio.net/?p=5369</guid>
		<description><![CDATA[Postdoctoral Research Scientist Rutgers University Joint Appointment: Institute of Marine and Coastal Sciences, BioMaPS and Dept. of Biochemistry and Microbiology Two 2-3 year Postdoctoral Research Scientist positions are available. We are looking for young scholars with experience in the areas of computational biology. In the scope of this project, we will uncover how the metal-containing [...]]]></description>
			<content:encoded><![CDATA[<h2>Postdoctoral Research Scientist</h2>
<h2>Rutgers University</h2>
<h6>Joint Appointment: Institute of Marine and Coastal Sciences, BioMaPS and<br />
Dept. of Biochemistry and Microbiology</h6>
<p>Two 2-3 year Postdoctoral Research Scientist positions are available.<br />
We are looking for young scholars with experience in the areas of<br />
computational biology. In the scope of this project, we will uncover how<br />
the metal-containing enzymes responsible for the critical electron<br />
transfer reactions that turn basic elements such as H, O, C, S, and N<br />
into biologically active molecules have evolved. The position will<br />
involve developing new sequence and/or structure based bioinformatic<br />
approaches to (1) mine available databases for proteins responsible for<br />
bio-catalyzed electron transfer reactions, (2) establish evolutionary<br />
relationships between extracted sequences and structures and (3)<br />
generate hypotheses for how the electron transfer circuitry arose and<br />
now functions. Candidates should have a PhD in Computational Biology or<br />
Bioinformatics. Candidates with degrees in related fields (e.g. biology,<br />
computer science) and possessing the necessary skill-sets are welcome to<br />
apply. We strongly encourage applications from recent PhD graduates.<br />
Strong programming skills (at least one of: Perl, Python, or Java) are<br />
essential for these positions, as well as, some familiarity with the<br />
major bioinformatics tools and databases. Experience in machine<br />
learning algorithms is desired, but not required. Candidates should be<br />
fluent in spoken and written English and should be able to communicate<br />
ideas and results to colleagues from all the diversity of life sciences.<br />
The ability to integrate into a team is as essential as that to complete<br />
a project without constant supervision.</p>
<p>Interested persons should e-mail a cover letter and C.V. to:</p>
<p>Dr. Yana Bromberg,<br />
Dept. of Biochemistry and Microbiology,<br />
Rutgers University<br />
e-mail: yanab &#8216;at&#8217; rci &#8216;dot&#8217; rutgers &#8216;dot&#8217; edu</p>
]]></content:encoded>
			<wfw:commentRss>http://bytesizebio.net/index.php/2011/08/31/postdoc-positions-available-at-rutgers-university/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

