Short bioinformatics hacks pt. 3: more FASTA counting

A few one-liners to kick off the workweek:

To order a set of fasta files by the number of sequences each one contains. If anyone knows how to put a tab as the output delimiter, please let us know:

grep -c ">" fasta-files/*.fna | cut --fields=1,2 -d ":" --output-delimiter="  " | sort -k 2 -nr | less

The same, but for the total length of sequences. My gawk-fu is somewhat stronger than my shell-fu, so this is mostly in gawk. But you can cut and paste this line directly into your shell line, or shell script file, and it should work. If anybody knows how to do this as a pure shell script, please comment.

[sourcecode language=”plain” gutter=”false”]
gawk ‘ />/ {next} {a[FILENAME]+=length($0)} END {for ( i in a ) print i "\t" a[i]}’ fasta-files/*.fna | sort -k 2 -nr | less
[/sourcecode]
This is a bit weird, but I needed it, and some of you doing multiple metagenomic analyses might too. This gives the total sequence length, and the average sequence length for each file, as a gawk one-liner:

gawk '/>/ {next} {a[FILENAME]+=length($0); b[FILENAME]++} END {for ( i in a ) print i "\t" a[i] "\t" a[i]/b[i]}' fasta-files/*.fna

Comments, gripes and improvements are welcome.

Broken shell with a bug. Credit: geograph.co.uk

Broken shell with a bug. Credit: geograph.co.uk

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

8 Responses to “Short bioinformatics hacks pt. 3: more FASTA counting”

  1. Neil says:

    If anyone knows how to put a tab as the output delimiter, please let us know

    Option 1. Between the quotes, type Ctrl-V, then the tab key.

    Option 2. –output-delimiter=$’\t’

    Thanks to Pierre and Brian at my blog post.

  2. Iddo says:

    Thanks Neil (& Pierre & Brian). Option 2 works, option 1 does not seem to work in tcsh, might be a bash only thing.

  3. Peter says:

    Hi Iddo,

    Depending on the rest of your shell command, having extra regular expression matches may or may not be a problem, but I would use “^>” rather than “>” as the regular expression (meaning a greater than sign at the start of a line). Consider this example:

    >Test length=4
    ACGT

    Peter

  4. Peter says:

    The HTML formatting spoilt that – let’s try again:
    >Test <this is made up> length=4
    ACGT

  5. Paul Gardner says:

    I’d use Sean Eddy’s ‘seqstat’ function, part of the squid package that ships with older versions of HMMer and Infernal. The successor in more recent versions is ‘esl-seqstat’:

    Format: FASTA
    Type (of 1st seq): RNA
    Number of sequences: 9
    Total # residues: 1028
    Smallest: 104
    Largest: 119
    Average length: 114.2

    It does a bit more validation on the files than you’re doing here. Just in case something has gone horribly wrong. ‘sreformat’ is another tool that I’ve found fantastically useful.

  6. Paul Gardner says:

    I’ve found Sean Eddy’s ‘seqstat’ to be incredibly useful for this sort of thing. It ships as part of squid with older versions of HMMer and Infernal. The successor is esl-seqstat:

    Format: FASTA
    Type (of 1st seq): RNA
    Number of sequences: 9
    Total # residues: 1028
    Smallest: 104
    Largest: 119
    Average length: 114.2

    It does a bit more validation on the files than what you’re doing. Just in case something has gone horribly wrong. ‘sreformat’ is also very useful.

  7. Paul Gardner says:

    Grrrr. Got an error with the first submission. Re-typed it. Only to find it worked the first time.

  8. Iddo says:

    @Paul Gardner
    Thanks Paul! Yes, there are always better tools around than simple one liners. I was thinking of the quick and dirty work I like to do on my netbook sometimes, which is not exactly loaded with software suites.