Total waste of time, ep. 1
Warning: frivolously geeky and technical post, which can be best defined as “science methodology esoterica”, and from which you can learn absolutely nothing useful. If you don’t get what’s going on, then it’s probably for the best, because this is a complete waste of time.
Specific Aim 1: find the longest word in English composed of the Protein 20-letter alphabet.
Method: I like gawk for quick & dirty text processing:
gawk 'BEGIN {daword="a"} \ /[BbJjOoUuXx]/ {next} \ length($1) > length(daword) {daword=$1} \ END {print daword}' /usr/share/dict/web2
acetylphenylhydrazine
OK, this kinda sucks. I want a real word in English, not a chemical portmanteau. Let’s see what a top 10 list looks like:
gawk 'BEGIN {for(i=1;i<=10;i++) daword[i]="a"} \ /[BbJjOoUuXx]/ {next} \ {for (i in daword) {if (length($1) > length(daword[i])) {daword[i]=$1;break}}} \ END {for (i=1;i<=10;i++) print length(daword[i]), daword[i]}' \ /usr/share/dict/web2 | sort -nr
And the result:
21 pentamethylenediamine
21 acetylphenylhydrazine
20 paraphenylenediamine
20 metaphenylenediamine
20 interparenthetically
19 transcendentalistic
19 semiantiministerial
19 platymesaticephalic
19 peripachymeningitis
19 misapprehensiveness
Interparenthetically. How lovely if you do your bioinformatics in Lisp.
Specific Aim 2: Lets BLAST this
Method: NCBI TBLASTN:
> emb|CAK04910.1|novel protein similar to vertebrate Hermansky-Pudlak syndrome 3 (HPS3) [Danio rerio] Length=1041 GENE ID: 563666 LOC563666 | similar to LOC398456 protein [Danio rerio] Score = 30.3 bits (64), Expect = 22 Identities = 9/10 (90%), Positives = 10/10 (100%), Gaps = 0/10 (0%) Query 2 NTERPARENT 11 NTERPAR+NT Sbjct 505 NTERPARKNT 514 > ref|XP_664219.1|
hypothetical protein AN6615.2 [Aspergillus nidulans FGSC A4] sp|Q5AYL5.1|SEC16_EMENI RecName: Full=COPII coat assembly protein sec16; AltName: Full=Protein transport protein sec16 gb|EAA58144.1|
hypothetical protein AN6615.2 [Aspergillus nidulans FGSC A4] Length=1947 GENE ID: 2870538 AN6615.2 | hypothetical protein [Aspergillus nidulans FGSC A4] (10 or fewer PubMed links) Score = 30.3 bits (64), Expect = 22 Identities = 10/13 (76%), Positives = 10/13 (76%), Gaps = 0/13 (0%) Query 1 INTERPARENTHE 13 INTE PARE T E Sbjct 61 INTESPAREETAE 73 > ref|XP_001707965.1|
hypothetical protein [Giardia lamblia ATCC 50803] gb|EDO80291.1|
Hypothetical protein GL50803_14341 [Giardia lamblia ATCC 50803] Length=247 GENE ID: 5700874 GL50803_14341 | hypothetical protein [Giardia lamblia ATCC 50803] (10 or fewer PubMed links) Score = 30.3 bits (64), Expect = 22 Identities = 9/12 (75%), Positives = 10/12 (83%), Gaps = 0/12 (0%) Query 4 ERPARENTHETI 15 ER ARE THE+I Sbjct 221 EREAREKTHESI 232 > ref|YP_002191813.1|
conserved hypothetical protein [Streptomyces clavuligerus ATCC 27064] gb|EDY50943.1|
conserved hypothetical protein [Streptomyces clavuligerus ATCC 27064] Length=565 GENE ID: 6836469 SSCG_04068 | hypothetical protein [Streptomyces clavuligerus ATCC 27064] Score = 29.5 bits (62), Expect = 39 Identities = 12/20 (60%), Positives = 13/20 (65%), Gaps = 1/20 (5%) Query 1 INTERPARENTHETICALLY 20 I ERP R +T E I ALLY Sbjct 219 ITAERPQRTDT-EAIGALLY 237
Interesting, but the e-values are insignificant. PSI-BLAST, BLASTP against metagenomic sequences in CAMERA all came up with zip.
Conclusion: I totally wasted my time doing this, and yours reading this. Therefore, I need more funding to check the other words on the list.
Have you tried asking Wolfram Alpha? 😉
W|A only does DNA sequences, exact matches, and human genome. The longest word in my Linux dictionary composed of the DNA alphabet us TATTA. “A bamboo frame or trellis hung at a door or window of a house, over which water is suffered to trickle, in order to moisten and cool the air as it enters”. The next word is TACT. Both are below the size 7 word size BLAST uses for DNA. So they are not on W|A, nor are they BLASTable.