Short bioinformatics hacks pt. 3: more FASTA counting

By Iddo on August 30th, 2009

A few one-liners to kick off the workweek:

To order a set of fasta files by the number of sequences each one contains. If anyone knows how to put a tab as the output delimiter, please let us know:

grep -c ">" fasta-files/*.fna | cut --fields=1,2 -d ":" --output-delimiter="  " | sort -k 2 -nr | less

The same, but for the total length of sequences. My gawk-fu is somewhat stronger than my shell-fu, so this is mostly in gawk. But you can cut and paste this line directly into your shell line, or shell script file, and it should work. If anybody knows how to do this as a pure shell script, please comment.

[sourcecode language=”plain” gutter=”false”]
gawk ‘ />/ {next} {a[FILENAME]+=length($0)} END {for ( i in a ) print i "\t" a[i]}’ fasta-files/*.fna | sort -k 2 -nr | less
[/sourcecode]
This is a bit weird, but I needed it, and some of you doing multiple metagenomic analyses might too. This gives the total sequence length, and the average sequence length for each file, as a gawk one-liner:

gawk '/>/ {next} {a[FILENAME]+=length($0); b[FILENAME]++} END {for ( i in a ) print i "\t" a[i] "\t" a[i]/b[i]}' fasta-files/*.fna

Comments, gripes and improvements are welcome.

Broken shell with a bug. Credit: geograph.co.uk

Share and Enjoy:

Categorized under: Bioinformatics.
Tagged with: no tags.

8 Responses to “Short bioinformatics hacks pt. 3: more FASTA counting”

Neil says:

30-August-2009 at 6:17 PM

If anyone knows how to put a tab as the output delimiter, please let us know

Option 1. Between the quotes, type Ctrl-V, then the tab key.

Option 2. –output-delimiter=$’\t’

Thanks to Pierre and Brian at my blog post.
Iddo says:

30-August-2009 at 7:01 PM

Thanks Neil (& Pierre & Brian). Option 2 works, option 1 does not seem to work in tcsh, might be a bash only thing.
Peter says:

31-August-2009 at 5:37 AM

Hi Iddo,

Depending on the rest of your shell command, having extra regular expression matches may or may not be a problem, but I would use “^>” rather than “>” as the regular expression (meaning a greater than sign at the start of a line). Consider this example:

>Test length=4
ACGT

Peter
Peter says:

31-August-2009 at 6:01 AM

The HTML formatting spoilt that – let’s try again:
>Test <this is made up> length=4
ACGT
Paul Gardner says:

31-August-2009 at 10:24 AM

I’d use Sean Eddy’s ‘seqstat’ function, part of the squid package that ships with older versions of HMMer and Infernal. The successor in more recent versions is ‘esl-seqstat’:

Format: FASTA
Type (of 1st seq): RNA
Number of sequences: 9
Total # residues: 1028
Smallest: 104
Largest: 119
Average length: 114.2

It does a bit more validation on the files than you’re doing here. Just in case something has gone horribly wrong. ‘sreformat’ is another tool that I’ve found fantastically useful.
Paul Gardner says:

31-August-2009 at 10:28 AM

I’ve found Sean Eddy’s ‘seqstat’ to be incredibly useful for this sort of thing. It ships as part of squid with older versions of HMMer and Infernal. The successor is esl-seqstat:

Format: FASTA
Type (of 1st seq): RNA
Number of sequences: 9
Total # residues: 1028
Smallest: 104
Largest: 119
Average length: 114.2

It does a bit more validation on the files than what you’re doing. Just in case something has gone horribly wrong. ‘sreformat’ is also very useful.
Paul Gardner says:

31-August-2009 at 10:30 AM

Grrrr. Got an error with the first submission. Re-typed it. Only to find it worked the first time.
Iddo says:

31-August-2009 at 12:33 PM

@Paul Gardner
Thanks Paul! Yes, there are always better tools around than simple one liners. I was thinking of the quick and dirty work I like to do on my netbook sometimes, which is not exactly loaded with software suites.

Byte Size Biology

The musings and ravings of a computational biologist about science, computers, music and, you know, stuff

Short bioinformatics hacks pt. 3: more FASTA counting

8 Responses to “Short bioinformatics hacks pt. 3: more FASTA counting”

Categories

Tags

Recent Posts

Recent Comments

Other stuff I read

Science blogs I like to read

Twitter