Going to GOA, pt. 2: children of a lesser GO

By Iddo on January 1st, 2010

The source file associated with this post can be downloaded here.

The last time I talked about how to read a GOA gene_associations file into a Python dictionary data structure. Our goal was to find all genes that are annotated as hydrolases in the GOA gene_associations file. The tricky part is, most enzymes are not annotated with a general term such as “hydrolase_activity”, but with a more specific term, such “phosphatase_activity” or “tyrosine_phosphatase_activity”. “phosphatase_activity” can be viewed as a “private case” or in ontological terms a descendant of hydrolase_activity. “tyrosine_phosphatase_activity” is even more specific, being a descendant of “phosphatase_activity” and, by transitivity of “hydrolase_activity”. The bottom line is, we need to know all the descendants of our query term to get the properly annotated genes form the GOA gene_associations file.

To do that, we would need to access the GO database. You can import it from the Gene Ontology FTP site as a MySQL dump, and install it locally. This would probably make your query faster. However, you would need to worry about maintaining and regularly updating your database from GO. Instead, it might make sense to connect to one of the mirrors of the GO MySQL database, and query it remotely. This is what I will do here.

To do that, I will create a class that does two things: (1) opens (and closes) a MySQL connection to a GO mirror site and (2) reads the GOA gene_associations file. We saw how to read the file in the previous part. What I am doing here is simply rolling that code into a method (a function called by a class).

import UserDict
import MySQLdb

class GOA_ga(UserDict.UserDict):
    def __init__(self):
        self.data = {}

    def godb_open(self):
        self.dbcon = MySQLdb.connect(host="mysql.ebi.ac.uk", user="go_select",
                   passwd="amigo", db="go_latest", port=4085)
        self.dbcursor = self.dbcon.cursor()

    def godb_close(self):
        self.dbcursor.close()

    def read_gene_assoc(self,inpath):
        for inline in file(inpath):
            db, db_object_id, db_object_symbol, qualifier, go_id, \
            db_reference, evidence, withit, aspect, \
            db_object_name, synonym, db_object_type, \
            taxon_id, date, assigned_by = inline.strip().split('\t')
            key = db+":"+db_object_id
            self.data.setdefault(key,[]).append({'db':db,'db_object_id':db_object_id,
                                            'db_object_symbol': db_object_symbol,
                                            'qualifier': qualifier,
                                            'go_id': go_id,
                                            'db_reference': db_reference,
                                            'evidence': evidence,
                                            'with': withit,
                                            'aspect': aspect,
                                            'db_object_name': db_object_name,
                                            'synonym': synonym,
                                            'db_object_type': db_object_type,
                                            'taxon_id': taxon_id,
                                            'date': date,
                                            'assigned_by': assigned_by})

What we have here is a class with four methods. The class itself is an instantiation of the Python UserDict class. This means it is primarily a dictionary, and can be used as such. But there are other methods too.

Lines 5-6: The __init__ method describes what happens when the class is instantiated, and is actually superfluous here. I just like to put it as good programming practice. Instantiating GOA_ga created an empty dictionary. That’s it.

Lines 8-11: The second method will open a MySQL connection to a remote server. dbcon and dbcursor will be used to access the database from Python, as we shall see soon. I import the Python MySQLdb module which contains functions to access MySQL through Python. You can get it from SourceForge, and it is also in the repositories of most Linux distributions. To install in Ubuntu Linux simply type at the command-line:

sudo aptitude install python-mysqldb

Lines 13-14: The go_dbclose method simply closes the database connection. It is good practice to do so after all queries are done.

Lines 16-36: this is basically the code to read the GOA gene_associations file that we saw in part 1.
The file go_associations.py contains this code (and more below). You can download it, import it, and run the following on your GOA gene association file of choice, using the Python shell. I chose arabidopsis in this example.

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA

So what we have is a class that lets us read GOA gene_associations file, and also query the GOA MySQL database.

Remember the original goal: get all the records associated with a certain function, or its children. The MySQL code for that is:

SELECT
  rchild.acc
FROM
  term AS rchild, term AS ancestor, graph_path
WHERE
  graph_path.term2_id = rchild.id and
  graph_path.term1_id = ancestor.id and
  ancestor.acc = 'GO:0016787'

This mysql query will return a list of all the GO accession codes that are the descendants of the go accession code “GO:0016187”. How do you know which GO accession code belongs to which term? You can use the keyword search feature in AmiGO. By the way, I pulled this query from the GO Example Queries page on the GO site. There are many useful queries there!

So how do we insert this into Python code? Here it is:

def find_go_descendants(goa_ga, parent_go_id):
   find_descendants_go = """
   SELECT
     rchild.acc
   FROM
     term AS rchild, term AS ancestor, graph_path
   WHERE
     graph_path.term2_id = rchild.id and
     graph_path.term1_id = ancestor.id and
     ancestor.acc = '%s'
   """
   go_children = {}
   found_genes = {}
   # query GO database for all the child terms of the parent_go_id
   goa_ga.dbcursor.execute(find_descendants_go % parent_go_id)
   rpl = goa_ga.dbcursor.fetchall()
   # Flatten the list. Put it in dictionary keys for faster searching.
   for i in rpl:
      go_children[i[0]] = None
   for gene_id in goa_ga.keys():
      for goa_rec in goa_ga[gene_id]:
         if goa_rec['go_id'] in go_children.keys():
            found_genes.setdefault(gene_id,[]).append(goa_rec)
   return found_genes

Lines 2-11 are the templated MySQL query to get all the descendants of the provided GO access number.
Line 12: go_children is a dictionary whose keys are all the GO accession numbers that are the descendants of our query. The values are null. It is faster to search a dictionary keys than a list.
Line 13: found_genes is a GOA_ga dictionary which will contain the found go_associations entries. The keys are the UniProt IDs, and the values are GOA_ga dictionary entries
Lines 15-16: Now we start working. First, we query the GO MySQL database for all the descendant GO terms of go_id. The returned value is a tuple of tuples which looks something like this:

(('GO:0012345',), ('GO:0034567',),('GO:003617',),....)

That is simply how MySQLdb’s fetchall method works: each MySQL record fetched is a tuple. Since we are fetching only one field with our query, each tuple element is We now need to flatten this tuple — remove the superfluous parentheses. Lines 18-19 do that. They also put the descendant GO terms a dictionary keys structure, rather than a list structure. A dictionary keys structure is faster to search as we do repeatedly in line 23.

Line 20 is the outer loop of the search. We loop over all UniProt gene IDs in the goa_associations file we read into our goa_ga data structure.
Line 21 is the inner loop. Each Uniprot ID may appear in more than one record, as the protein it represents may be associated with more than one GO term.

Lines 22-23: if the associated GO term is in our list of descendant, add that GOA record to our found_genes dictionary, and key it with the proper UniProt ID.

Line 24: return the dictionary of all GOA records that are annotated as the descendants of the query term, or as the query term itself.

Putting it all together:

>>> import go_associations as ga
>>> my_goa = ga.GOA_ga() # create a class instantiation
>>> my_goa.read_gene_assoc("gene_association.goa_arabidopsis")  # read all GOA
>>> my_goa.godb_open()     # open a connection to the GO database
>>> all_hydrolases  = ga.find_go_descendants(my_goa,"GO:0016787") # find the descendants of "hydrolase_activity"

all_hydrolases now contains the GOA records of all the arabidopsis proteins annotated as hydrolases. How many are there?

>>> len(all_hydrolases)
3617
>>>

(This may vary when the latest version of this file changes. GOA is regularly updated). Some genes may be annotated as hydrolases more than once. This might either be an annotation error, as the is annotated twice with the same function, or the protein might be promiscuous with more than one function, or it may have two domains, each with a specific function. To find out all the hydrolase annotations:

>>> n=0
>>> for i in all_hydrolases:
...      n += len(all_hydrolases[i])
>>> print n
5753
>>>

How many of those annotations have a good evidence code?

>>> n=0
>>> for i in all_hydrolases:
...     for j in all_hydrolases[i]:
...             if  j['evidence'] in ('EXP','IDA','IPI','IMP','IGI','IEP'): n += 1
...
>>> print n
416
>>>

416 out of 5763: just over 7%. Keep that in mind next time you delve into an annotation database. (Exercise: find out how many enzymes have been annotated with good evidence in arabidopsis, or in another GOA-annotated organism).

Have fun GOing, and getting there!

Share and Enjoy: