Total waste of time, ep. 1

Warning: frivolously geeky and technical  post, which can be best defined as “science methodology esoterica”, and from which you can learn absolutely nothing useful.  If you don’t get what’s going on, then it’s probably for the best, because this is a complete waste of time.

Specific Aim 1:  find the longest word in English composed of the Protein 20-letter alphabet.

Method: I like gawk for quick & dirty text processing:

gawk 'BEGIN {daword="a"} \
/[BbJjOoUuXx]/ {next} \
length($1) > length(daword) {daword=$1} \
END  {print daword}' /usr/share/dict/web2

acetylphenylhydrazine

OK, this kinda sucks. I want a real word in English, not a chemical portmanteau. Let’s see what a top 10 list looks like:

gawk 'BEGIN {for(i=1;i<=10;i++) daword[i]="a"} \
/[BbJjOoUuXx]/ {next} \
{for (i in daword) {if (length($1) > length(daword[i])) {daword[i]=$1;break}}} \
END  {for (i=1;i<=10;i++) print length(daword[i]), daword[i]}' \
/usr/share/dict/web2 | sort -nr

And the result:

21 pentamethylenediamine
21 acetylphenylhydrazine
20 paraphenylenediamine
20 metaphenylenediamine
20 interparenthetically
19 transcendentalistic
19 semiantiministerial
19 platymesaticephalic
19 peripachymeningitis

19 misapprehensiveness

Interparenthetically. How lovely if you do your bioinformatics in Lisp.

Specific Aim 2: Lets BLAST this

Method: NCBI TBLASTN:

>
emb|CAK04910.1| Gene info novel protein similar to vertebrate Hermansky-Pudlak syndrome
3 (HPS3) [Danio rerio]
Length=1041

 GENE ID: 563666 LOC563666 | similar to LOC398456 protein [Danio rerio]

 Score = 30.3 bits (64),  Expect =    22
 Identities = 9/10 (90%), Positives = 10/10 (100%), Gaps = 0/10 (0%)

Query  2    NTERPARENT  11
            NTERPAR+NT
Sbjct  505  NTERPARKNT  514

>
ref|XP_664219.1| Gene info hypothetical protein AN6615.2 [Aspergillus nidulans FGSC A4]
 sp|Q5AYL5.1|SEC16_EMENI  RecName: Full=COPII coat assembly protein sec16; AltName: Full=Protein
transport protein sec16
 gb|EAA58144.1| Gene info hypothetical protein AN6615.2 [Aspergillus nidulans FGSC A4]
Length=1947

 GENE ID: 2870538 AN6615.2 | hypothetical protein [Aspergillus nidulans FGSC A4]
(10 or fewer PubMed links)

 Score = 30.3 bits (64),  Expect =    22
 Identities = 10/13 (76%), Positives = 10/13 (76%), Gaps = 0/13 (0%)

Query  1   INTERPARENTHE  13
           INTE PARE T E
Sbjct  61  INTESPAREETAE  73

>
ref|XP_001707965.1| Gene info hypothetical protein [Giardia lamblia ATCC 50803]
 gb|EDO80291.1| Gene info Hypothetical protein GL50803_14341 [Giardia lamblia ATCC 50803]
Length=247

 GENE ID: 5700874 GL50803_14341 | hypothetical protein
[Giardia lamblia ATCC 50803] (10 or fewer PubMed links)

 Score = 30.3 bits (64),  Expect =    22
 Identities = 9/12 (75%), Positives = 10/12 (83%), Gaps = 0/12 (0%)

Query  4    ERPARENTHETI  15
            ER ARE THE+I
Sbjct  221  EREAREKTHESI  232

>
ref|YP_002191813.1| Gene info conserved hypothetical protein [Streptomyces clavuligerus ATCC
27064]
 gb|EDY50943.1| Gene info conserved hypothetical protein [Streptomyces clavuligerus ATCC
27064]
Length=565

 GENE ID: 6836469 SSCG_04068 | hypothetical protein
[Streptomyces clavuligerus ATCC 27064]

 Score = 29.5 bits (62),  Expect =    39
 Identities = 12/20 (60%), Positives = 13/20 (65%), Gaps = 1/20 (5%)

Query  1    INTERPARENTHETICALLY  20
            I  ERP R +T E I ALLY
Sbjct  219  ITAERPQRTDT-EAIGALLY  237

Interesting, but the e-values are insignificant. PSI-BLAST, BLASTP against metagenomic sequences in CAMERA all came up with zip.

Conclusion: I totally wasted my time doing this, and yours reading this. Therefore, I need more funding to check the other words on the list.

Share and Enjoy:
  • Fark
  • Digg
  • Technorati
  • del.icio.us
  • StumbleUpon
  • Facebook
  • Reddit
  • Twitter
  • FriendFeed
  • PDF
  • email
  • Print
  • Google Bookmarks

2 Responses to “Total waste of time, ep. 1”

  1. widdowquinn says:

    Have you tried asking Wolfram Alpha? 😉

  2. Iddo says:

    W|A only does DNA sequences, exact matches, and human genome. The longest word in my Linux dictionary composed of the DNA alphabet us TATTA. “A bamboo frame or trellis hung at a door or window of a house, over which water is suffered to trickle, in order to moisten and cool the air as it enters”. The next word is TACT. Both are below the size 7 word size BLAST uses for DNA. So they are not on W|A, nor are they BLASTable.