GOA, the Gene Ontology Annotation, provide Gene Ontology annotation to proteins in UniProt. It also provides GO annotations to several genome projects: Chicken, Arabidopsis, Fly, Human, Mouse, Rat and Cow. Anyone working on any of those genomes, or on UniProt and is interested in annotation, would most likely need to query GOA once in a while. Here I will show some to read and interrogate GOA files.
Each GOA-annotated genome has two files associated with it. One is gene_association.* and the other is *.xrefs. So for example, the Mouse genome has the two files gene_association.goa_mouse and mouse.xrefs
The gene_association file is a tab-delimited file,with each row containing a gene and its GO associations. Each gene may have more than a single GO annotation, so multiple rows for the same gene exist. The fields are tab-delimited.
def read_goa_associations(inpath):
goa_records = {}
for inline in file(inpath):
# Read a single GOA record
db, db_object_id, db_object_symbol, qualifier, go_id, \
db_reference, evidence, withit, aspect, \
db_object_name, synonym, db_object_type, \
taxon_id, date, assigned_by = inline.strip().split('\t')
# The dictionary key is a concatenation of the gene-id and the database
key = db+":"+db_object_id
goa_records.setdefault(key,[]).append({'db':db,
'db_object_id':db_object_id,
'db_object_symbol': db_object_symbol,
'qualifier': qualifier,
'go_id': go_id,
'db_reference': db_reference,
'evidence': evidence,
'with': withit,
'aspect': aspect,
'db_object_name': db_object_name,
'synonym': synonym,
'db_object_type': db_object_type,
'taxon_id': taxon_id,
'date': date,
'assigned_by': assigned_by})
return goa_records
Let’s break this one down. The records will be stored in the dictionary goa_records, which is initialized in line 2. We loop through the lines in in the files in line 3. Lines 5-8 are broken down physically, but they form a single command line to parse the 15 tab-delimited fields.
Now, as I mentioned, there may be more than one GO association with a single gene. So the value of each dictionary entry is actually a list which may contain one or more GO associations. Line 12 shows us how to do that. The format:
dict.setdefault(dakey,[]).append(davalue)
tells us that for dictionary dict, if it does not have the key dakey then that key is added, and the value associated with it is an empty list []. Otherwise, if the dictionary already has the key dakey, then the value davalue is appended to the list. This one-liner allows us to add multiple values associated with one key.
In the command stretching from line 12 to line 26, we add a value which is actually a dictionary by itself. Each key in the dictionary is a field of the gene_associations records. Each value is the value read in that field in lines 5-8.
So what have we got now in goa_records?
Let’s try running this on an Arabidopsis gene_associations file. Download the file:
gene_association.goa_arabidopsis.51.gz
unzip it:
gunzip gene_association.goa_arabidopsis.51.gz
this is is the Arabidopsis GOA from 6-OCT-2009. It has 109339 lines which are 22778 annotated genes. How do I know that? easy:
Number of lines uniqueified by field #2, which is the gene id.
gawk 'BEGIN {FS="\t"} {print $1":"$2}' gene_association.goa_arabidopsis.51 | sort | uniq | wc -l
Also, download the code file, and call it gene_assoc.py
Now run the code, Open a Python shell:
$ python
>>> import gene_assoc as ga
>>> goa_arabid = ga.read_goa_associations("gene_association.goa_arabidopsis.51")
>>>
go_arabid now contains the records. We can do a few simple statistics. For example, how many genes are annotated as hydrolases? For that, we need to know that the hydrolase_activity GO accession number is “GO:0016787”. We can now ask our question in the Python shell:
>>> n=0
>>> for i in goa_arabid:
... for j in goa_arabid[i]:
... if j['go_id'] == "GO:0016787": n+=1
...
>>> print n
975
OK, this may seem a bit silly. We could have just as easily used a gawk one-liner to search for all the lines containing “GO:0016787”. Why go into all the trouble of a Python data structure? The answer is, we are just getting started.
When genes are annotated, they are annotated by different strengths of evidence. Many genes are annotated simply by sequence similarity to other annotated genes. This is fair evidence, but not as good as, say, experimental evidence. When curators annotate genes, they also add an “evidence code” that tells s how good is the evidence that the assigned GO term is actually true. Let’s find how many hydrolases we have whose annotations were inferred by experimental evidence. See the Guide to GO evidence codes for a full description.
Experimental Evidence Codes
EXP: Inferred from Experiment
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
Computational Analysis Evidence Codes
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
RCA: inferred from Reviewed Computational Analysis
Author Statement Evidence Codes
TAS: Traceable Author Statement
NAS: Non-traceable Author Statement
Curator Statement Evidence Codes
IC: Inferred by Curator
ND: No biological Data available
Automatically-assigned Evidence Codes
IEA: Inferred from Electronic Annotation
Obsolete Evidence Codes
NR: Not Recorded
>>> n=0
>>> for i in goa_arabid:
... for j in goa_arabid[i]:
... if j['go_id'] == "GO:0016787" and j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
1
Uh, oh. Seems like only one gene was inferred to be a hydrolase by experimental evidence?
Note another problem: we only checked for genes annotated with “hydrolase_activity” GO term. But a hydrolase is a very generic term. This does not mean that other genes are not hydrolases. Hydrolases are a very large enzymatic class that includes many different enzymes: any enzyme that uses water to break a covalent bond, basically. For a complete list of hydrolase subclasses in GO see: here. Note that there are more GO terms under the fold (just click on the nodes annotated with ‘+’). So we are definitely not looking at all the enzymes that have hydrolase activity, only those that are annotated with the words “hydrolase_activity”. For example, all phosphatases are hydrolases, but if a protein is annotated with the GO term “phosphatase_activity”, it won’t show up on our search. How do we handle that problem?
Wait for the next installment.