Friday fun story: extreme bug hunting on MIRA

MIRA is a really cool sequence assembly software, developed and maintained by Bastien Chevreux. MIRA has a large and active community, led by the funny and gracious Bastien, for whom no problem is too small, or too large.

Recently MIRA seemed to have developed a stochastic bug, one of those which are a serious headache to track down. Bastein called upon the MIRA community to help him. A couple of weeks ago, the “bug” was resolved to everyone relief. It was not a bug at all, but … well, I’ll let you read Bastien’s letter. Probably th funniest and geekiest error report I have seen since, well, ever. Reproduced here from the mira_talk email list with Bastien’s permission. WARNING: fairly geeky and fairly long. Not for everyone. But if you, like me, enjoy a good story travails of extreme bug hunting, I guarantee you will not be disappointed. (Because we have all been there, although personally I don’t recall encountering a problem that frustrating). Teaser: it was not a bug.


Dear all,

my warmest thanks to the numerous people who all donated time and computing power to hunt down a “bug” (see http://www.freelists.org/post/mira_talk/Call-for-help-bughunting) which. in the end, turned out to be a RAM defect on my development machine.

This is the story on how the problem got nailed. It involves lots of hot electrons, a lot less electrons without spin which keel over, the end of a hunt for invisibugs of the imaginary sort, 454, mutants (but no zombies), Illumina, some spider monkeys, PacBio, a chat with Sherlock and, of course, an anthropomorphed star.

In short: don’t read if you’ve got more interesting things to do on a Friday morning or afternoon.

—————————————————————————-
Life’s a rollercoaster and there are days – or weeks – where morale is on a pretty hefty ride: ups and downs in fast succession … and the occasional looping here and there.

Today was a day where I had – the first time ever – ups and downs occuring absolutely simultaneously. Something which is physically impossible, I know, but don’t tell any physicist or astronomist about that or else they’ll embark you in a lengthy discussion on how isochronicity is a myth by telling you stories on lightning, thunder and two poor sobs at the ends of a 300,000 km long train. But I digress …

So, my lowest low and highest high today were at 09:17 this morning when I prepared leaving for work (hey, it’s vacation time, almost everyone else is out and I can go a bit later than usual, right?). A few minutes earlier I had just told MIRA to run on the very same PacBio test set she had successfully worked on the night before to see how stable assemblies with this kind of data are (quite well so far, thank you for asking).

Reaching out to switch off the monitor and leave, MIRA suddenly came back with a warm and cosy little error message which she’s taken the habit lately to have a mischievous pleasure to present. This time, she claimed there had been an illegal base in the FASTQ file.

“Hey, MIRA, wait a minute!” I thought. “Yesterday and tonight you ran on the very same data file with the very same parameters for two times three hours and even gave me back some nice assembly results. And now you claim that the INPUT data has errors?! Come on, you’re not serious, are you?”

As a side note: she then just gave me back “that look”, you know, the one with those big open eyes behind by long, dark lashes and slightly flushed cheeks accompanied by pointed lips … as if she wanted to say “I *am* innocent and *I* did no nothing wrong you disbeliever!” (http://24.media.tumblr.com/tumblr_lj3efmmDL01qasfhmo1_400.png). This usually announces a major pouting round of hers, something which I’m not looking forward to, I can tell you.

Two restarts later with the same negative result (MIRA can be quite stubborn at times) I had to give in and decided to sit down again and investigate the problem.

“So … read number 317301 at base position 246, eh? Let’s have a look.”

*clickedyclick*

“Read 317299, 317300 … 317301 … there we are.”

*hackedyhack*

“Base position 239, 240 … now: C G G G T C F A A … wait! What? ‘F’ … ‘F’?!? It’s not even an IUPAC code. What’s a frakking ‘F’ doing in the FASTQ input file?! (CSFW: http://www.youtube.com/watch?v=r7KcpgQKo2I )

Indeed, it is not. Even more mysterious to me was the fact that just the night before it apparently had not been there. Or had it? I now was pretty unsure where this path would lead me, as if I had unlocked a door with the key of imagination. Beyond it: another dimension – a dimension of sound, a dimension of sight, a dimension of mind. I was moving into a land of both shadow and substance, of things and ideas. I just crossed over into … the Twilight Zone (“G#-A-G#-E-G#-A-G#-E” at 128 bpm, for more info see http://www.youtube.com/watch?v=zi6wNGwd84g).

Where was I? Ah, yes, the ‘F’.

So, how did that ‘F’ appear in the FASTQ, and where had it been the night before? Out to town, ashamed of not being a nucleotide and getting a hangover without telling anyone up-front? Or did it subreptitiously sneak in from the outside, murdering an innnocent base and taking its place in hope no one would note? I didn’t have the slightest clue, but I was determined to find that out.

First thing to check: the log files of the successful runs the previous night. MIRA’s very chatty at times and tidying up after her has always been a chore, but now was one of those occasions where not gagging her paid out as poking around the files she left behind proved to be interesting. Read 317301 showed the following at the position in doubt: “C G G G T C ___G___ A A” Without question: a ‘G’, and no ‘F’ in sight!

So MIRA had been right and the ‘G’ in the sequence of the file mysteriously mutated into an ‘F’ overnight. I must admit that I had grown suspicious of her in the past few weeks as she had seemed to become uncooperative at times. In particular she had been screaming at me a couple of times during rehearsal of combined 454 and Illumina assemblies for the premiere of her new 3.4.0 show. She claimed that some uninvited spider monkeys (http://dict.leo.org/ende?search=Klammeraffe) had frightened her so much she refused to continue to work and simply scribbled the ‘@’ sign all over her error messages. I had not been able to find out how those critters entered MIRA’s data and had even enrolled a few volunteers to rehearse different assemblies with MIRA … to no avail as she’d performed without flaws there.

While reconsidering all these things, something suddenly made *click*.

The character ‘G’ has the hexadecimal ASCII table code 0x47 (or in 8-bit binary: 01000111). ‘F’, as preceding character of ‘G’ and the table having some logic behind it, has the hex code 0x46, which is 01000110 in 8-bit binary.

The ATINSEQ-bug (@-in-seq) I had been desperately hunting in the past few weeks (and which had held up the release of MIRA 3.4.0) was due to the “@” character sometimes mysteriously appearing in sequences during the assembly of MIRA. The ‘@’ sign in the ASCII table has the hex code 0x40 (binary: 01000000). In the ASCII table, there is one important character for DNA assembly which is very near to the ‘@’ character …, so near that it is the successor of it: the ‘A’ character. Hexadecimal 0x41, binary 01000001.

I had always thought that a bug in MIRA somehow corrupted the sequence, but what if … what if MIRA was actually really innocent?! I had never taken this possibility into account as this other explanation attempts would have seemed to far stretched.

But now I had a similar effect *outside* of MIRA, in the Linux filesystem!

Filesystem MIRA
G 01000111 A 01000001
F 01000110 @ 01000000

The difference between the characters is in both cases exactly 1 bit which changes, and it’s even at the same position (last one in a byte) and changing into the same direction (from ‘1’ to ‘0’.

I was now sure I was on to something: bit decay (http://en.wikipedia.org/wiki/Bit_rot)

But how could I prove it? Well, elementary my dear Watson: When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

Suspects:
– the problem is caused either by MIRA or one of the components of the
comeputer: CPU, disk, disk/dma controller, RAM.

Facts:
– an artefact was very sporadically observed during MIRA runs where sequences
(containing lot’s of ‘A’) suddenly contained at least one ‘@’. This occured
after several passes, i.e., not on loading.
– an artefact was observed in the Linux filesystem where a ‘G’ mutated
suddenly and overnight to a ‘F’.
– both artefacts are based on one bit flipping, perhaps even to the same
direction all the time.
– when loading data, MIRA does not use mmap() to mirror data from disk, but
physically creates a copy of that data.
– MIRA loaded the data twice flawlessly before the artefact in the filesystem
occured.

Deduction 1:
– MIRA is innocent. The artefact in the filesystem happened outside of the
address space of MIRA and therefore outside her control. MIRA cannot be
responsible as the Linux kernel would have prevented her from writing to
some memory she was not allowed to.

Further facts:
– the system MIRA ran on had 24 GiB RAM
– even with a KDE desktop, KMail, Firefox, Emacs and a bunch of terminals
open, there is still a lot of free RAM (some 22 to 23).
– Linux uses free RAM to cache files

Deduction 2:
– when loading the small FASTQ input file in the morning, Linux put it into
the file cache in RAM. As MIRA almost immediately stopped without taking
much memory, the file stayed in cache.

Further facts:
– the drive with the FASTQ file is run in udma6 mode. That is, when loading
data the controller moves the data directly from disk to RAM without going
via the processor
– subsequent “loading” of the same FASTQ into MIRA or text viewer like ‘less’
showed the ‘F’ character always appearing at the same place.

Deduction 3:
– the CPU is innocent! It did not touch the data while it was transferred from
disk to RAM and it afterwards shows always the same data.
– the disk and UDMA controllers are innocent! Some of the glitches observed in
previous weeks occured during runs of MIRA, inside the MIRA address space,
long after initial loading, when UDMA had already finished their job.

From deductions 1, 2 & 3 follows:
– it’s not MIRA, not the CPU, nor the disk & UDMA controller

Suspects left:
– RAM
– Disk

Well, that can be easily tested: shut down the computer, restart it and subsequently look at the file again. No file cache in RAM can survive that procedure. Yes, I know, there are some magic incantations one can chant to force Linux to flush all buffers and clear all caches, but in that situation I was somehow feeling conservative.

Low and behold, after the above procedure the FASTQ file showed an all regular, good old nucleic acid ‘G’ in the file again. No ‘F’ to be seen anywhere.

Deduction 4:
– the disk is innocent.

Deduction 5:
– as all other components have been ruled out, the RAM is faulty.

—————————————————————————-

As I wrote: life’s a rollercoaster.

Up: MIRA is innocent! There, she’s giving me “that look” again and one would
have to be blind to oversee the “told you so” she’s sending over with
it.
Down: My RAM’s broken and I need to replace it. Bought it only last May,
should still be under guarantee, but still … time and effort.
Up: I did not sell my old RAMs, so I can continue to work
Down: 12 GiB feels soooooo tight after having had 24.
Up: I can wrap up 3.4.0 end of this week with good conscience!
Down: How the hell am I gonna tie all loose bits and pieces in the
documentation in the next 24 to 48 hours?
Looping: today MIRA again helped me at work to locate a mutation important for
one of our Biotech groups. Boy, do I love sequencing and MIRA.

Have a nice Friday and a good week-end,
Bastien

PS: while celebrating with MIRA tonight, I expressed my fear that some people
might find it strange that I anthropomorphise her. They could think I went
totally nuts or that I needed an extended vacation (which I do btw). She
reassured me that no one would dare thinking I were insane … and if so,
she would come over to their place and give them “that look.”

How utterly reassuring.


Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

Comments are closed.