Short Bioinformatics Hacks, ch. 1
In any programming gig, and that includes bioinformatics, a lot of repeat scriptology comes cropping up. I decided to share some of that, pro publico bono, and also because I hope to start some sort of ongoing cookbook for short bioinformatics hacks. If you have any cool short scripts you like to share, please email them to me and once I collected a few I will place them in a future post. The programing language does not matter, as long as you provide a full working code and license it under an open source license. I’m not saying I will publish everything you send me though.
Okay, let’s get started.
The hack I use most often is to find out how many sequences I have in a FASTA file. The FASTA file header should be one line, and always start with a “>”. therefore, counting the number of sequence records is as easy as:
% grep ">" mysequences.fa | wc -l
grep -c "^>" mysequences.fa (with thanks to Paulo Nuin for the simplification).
Simply using Linux’s built in wc or word count grep, with a “-l” “-c” for number of matching lines.
Using the same idea, here is a sequence length printer. The goal is to print the length of each sequence from a FASTA format file that contains many sequences:
1 % gawk '/>/ {$0=substr($0,2); sid=$1;next} \ 2 {sl[sid]+=length($0)} \ 3 END {for (i in sl) printf "%s\t%d\n",i,sl[i]}' \ 4 mysequences.fa > dist.csv
“gawk” is GNU-awk, the workhorse for text file manipulation before Perl and such came along. I still like using it for quick throwaway coding. A gawk program is composed of a series of rules , consisting of a pattern and an action in the following manner:
pattern1 {action1} pattern2 {action2} . . .
The code is executed on the entire input file (“mysequences.fa”), one line at a time. If the pattern matches the input line, then the appropriate action is taken. A rule can exist without a pattern, in which case the action is always executed.
Line 1 in the source code above shows a rule that checks for the “>” character. If it exists, that means the current line is a FASTA header line. gawk considers input lines to be records composed of fields. Fields are separated by whitespace characters, records by newlines or carriage-returns. The fields in any current record are named $1, $2, $3… and the entire record is referred to as $0. Using substr (substring) we chop off the leftmost character in the FASTA header, which is the ubiquitous “>”. (Note that in gawk, indexing starts at 1, not 0, so “>” is in position 1 in $0). We then use the first field in the modified FASTA header line as a sequence id key, sid. Finally, the command “next” tells gawk to move to the next input line, skipping all the other rules.
Moving to the next rule, the rule in line 2 only gets executed if we are already in a sequence line in FASTA. This rule has no pattern qualifying it, but if we have reached this rule, it means we cannot be in a header line, which leaves us in a sequence line. “sl” is an associative array, which uses the sequence ID generated in line 1 as its key. This rule simply increments the size of each entry in sl by the length of the current line. Thus eventually storing the sequence length for each sequence in an associative array, with the key being the sequence ID
After the entire FASTA file has been read, we would like print out a table with the sequence IDs and their lengths. This is where line #3 comes in: one special rule in gawk is the rule with the END pattern. This rule gets executed one time only, and only after the entire file has been read. The for loop goes through the associative array sl, and prints consisting of tab-separated fields: the first is the sequence ID, the second is its length.
You can now import dist.csv into your favorite spreadsheet program, or into R, and examine the distribution of sequence lengths.
A word of caution: for this to work, the first field in the FASTA file should be unique. This is according to the NCBI specifications of FASTA files, but some places do not follow this rule, so take care. Here is another gawk script to warn you if your FASTA file is breaking this rule:
gawk '/>/ {$0=substr($0,2); sl[$1]+=1} \ END {for (i in sl) \ {if (sl[i]>1) printf "WARNING. Non unique id %s\n",i}}' \ mysequences.fasta
Finally, in the same vein, here is a GC percentage calculator:
1 gawk '/>/ {$0=substr($0,2); sid=$1; next} \ 2 {sl[sid]+=length($0); gccount[sid]+=gsub(/[GgCc]/,"X",$0)} \ 3 END {for(i in gccount) printf "%s\t%.2f\n",i,gccount[i]/sl[i]}' \ 4 mysequences.fasta > gc_count.csv
The only major change here is in line 2. The gsub command replaces all the G, g, C and c characters with an X. That really does not matter to us, as this is done on the fly, and the Xs are not written into the the original FASTA file. But gsub also returns the number of replacements made, and that is actually how we count “G”s and “C”s (or “g”s and “c”s in case you decide to be case insensitive, like I decided here). The for loop in the END rule simply divides the GC count for each sequence by that sequence’s length.
This is about as complex as you want to go for command line throwaway scripts, especially in gawk which is prone to obfuscation. For anything more complex, it’s a good idea to start using a structured language, preferably with its own Bio* package. There are two reasons for that. First, even if you are a hot-shot coder, and you trust your h@x0r sk1llz with writing one-shot one-liners, this is not the place: no one is infallible. You need to keep a record in of the source code for future reproducibility. Also, in case something goes wrong down the data manipulation pipeline, and you need to backtrace the bug. A one-line throwaway is only as long- lived as the history cache of the shell in which it was written.
Next time I will do some longer stuff. Until then, if you like this piece, send me your scripts. They can be as short as the ones here, or longer, and in any language. Please test them before you send, and OSS license them.
Thanks for sharing – I know most bioinformaticians have many small hacks just like these. It would be nice if they were shared more widely. Perhaps gists at Github is a good way to do this?
grep “>” mysequences.fa | wc -l
can be rewritten as
grep -c “^>” mysequences.fa
Using ^ and not tells you if the fasta format is valid.
@Neil
Good idea. You want to get on it? You can start with the stuff from here, I’ll publicise it at the next posting.
@PN
D’oh! thanks. I’ve been using an alias to my more cumbersome version since i don’t know when… I updated the post with your more elegant version.
HI, Nice to see the code, I have been using the same trick. I wish to share another one to find the number of sequences in a fasta file. Open the file using microsoft word or other word processing application. Use ‘find and replace’ option there. Find > and replace all with >. then it will show how many replacement it did.
Chears..
riju aikkal
Bioinformatics centre, Indian Institute of Spices Research, India
Why not just use EMBOSS infoseq e.g.
$ infoseq -auto -stdout -only -length mysequences.fa > dist.csv
# And for GC
$ infoseq -auto -stdout -only -pgc mysequences.fa
And it should be a whole lot faster than gawk
@Zee
Why not indeed…. 🙂