Short bioinformatics hacks pt. 3: more FASTA counting
A few one-liners to kick off the workweek:
To order a set of fasta files by the number of sequences each one contains. If anyone knows how to put a tab as the output delimiter, please let us know:
grep -c ">" fasta-files/*.fna | cut --fields=1,2 -d ":" --output-delimiter=" " | sort -k 2 -nr | less
The same, but for the total length of sequences. My gawk-fu is somewhat stronger than my shell-fu, so this is mostly in gawk. But you can cut and paste this line directly into your shell line, or shell script file, and it should work. If anybody knows how to do this as a pure shell script, please comment.
[sourcecode language=”plain” gutter=”false”]
gawk ‘ />/ {next} {a[FILENAME]+=length($0)} END {for ( i in a ) print i "\t" a[i]}’ fasta-files/*.fna | sort -k 2 -nr | less
[/sourcecode]
This is a bit weird, but I needed it, and some of you doing multiple metagenomic analyses might too. This gives the total sequence length, and the average sequence length for each file, as a gawk one-liner:
gawk '/>/ {next} {a[FILENAME]+=length($0); b[FILENAME]++} END {for ( i in a ) print i "\t" a[i] "\t" a[i]/b[i]}' fasta-files/*.fna
Comments, gripes and improvements are welcome.
If anyone knows how to put a tab as the output delimiter, please let us know
Option 1. Between the quotes, type Ctrl-V, then the tab key.
Option 2. –output-delimiter=$’\t’
Thanks to Pierre and Brian at my blog post.
Thanks Neil (& Pierre & Brian). Option 2 works, option 1 does not seem to work in tcsh, might be a bash only thing.
Hi Iddo,
Depending on the rest of your shell command, having extra regular expression matches may or may not be a problem, but I would use “^>” rather than “>” as the regular expression (meaning a greater than sign at the start of a line). Consider this example:
>Test length=4
ACGT
Peter
The HTML formatting spoilt that – let’s try again:
>Test <this is made up> length=4
ACGT
I’d use Sean Eddy’s ‘seqstat’ function, part of the squid package that ships with older versions of HMMer and Infernal. The successor in more recent versions is ‘esl-seqstat’:
Format: FASTA
Type (of 1st seq): RNA
Number of sequences: 9
Total # residues: 1028
Smallest: 104
Largest: 119
Average length: 114.2
It does a bit more validation on the files than you’re doing here. Just in case something has gone horribly wrong. ‘sreformat’ is another tool that I’ve found fantastically useful.
I’ve found Sean Eddy’s ‘seqstat’ to be incredibly useful for this sort of thing. It ships as part of squid with older versions of HMMer and Infernal. The successor is esl-seqstat:
Format: FASTA
Type (of 1st seq): RNA
Number of sequences: 9
Total # residues: 1028
Smallest: 104
Largest: 119
Average length: 114.2
It does a bit more validation on the files than what you’re doing. Just in case something has gone horribly wrong. ‘sreformat’ is also very useful.
Grrrr. Got an error with the first submission. Re-typed it. Only to find it worked the first time.
@Paul Gardner
Thanks Paul! Yes, there are always better tools around than simple one liners. I was thinking of the quick and dirty work I like to do on my netbook sometimes, which is not exactly loaded with software suites.