December 2007

« February 2005 | Main

December 31, 2007

Reflections on 2007

I have recently been reflecting on the status of the Project to Annotate 1000 Genomes and in this short essay I will argue that it has been an overwhelming success due to issues that became apparent only as the project progressed.   I am writing this on the last day of 2007, which certainly proved to be a remarkable year.   A thousand more-or-less complete genomes now exist, a framework for rapidly annotating new genomes with remarkable accuracy is now functioning, and we are on the verge of another major shift in the world of annotations.  Let me try to clarify these remarks before proceeding to the celebration this evening.

The Production of Accurate Annotations

The efforts required to establish a framework for high-volume, accurate annotation are substantial. I believe that it is important that we reflect on what we have learned about the factors that determine productivity. So, what have we learned from the project?

First, subsystem-based annotation is the key to accuracy. While there are certainly numerous efforts still focusing on annotation of a single genome, the recognition that comparative analysis is the key to everything, and that focusing on the variations of a single component of cellular machinery as they are manifested over the entire collection of existing genomes is the key to accuracy, are both widely accepted principles at this stage. Manually-based subsystem creation and maintenance is the rate-limiting component of successful annotation efforts, and the factors that constrain this process are at the heart of the matter. We have understood this for some time now.

I am going to argue a new position in this short essay:

  1. There are three distinct components that make up our strategy for rapid accurate annotation: subsystems-based annotation, FIGfams as a framework for propagating the subsystems annotations, and RAST as a technology for using FIGfams and subsystems to consistently propagate annotations to newly-sequenced genomes.
  2. These three components form a cycle (subsystems => FIGfams => RAST technology => subsystems). This cycle creates a feedback that rapidly accelerates the productivity achievable in all three components. Further, failure in any of these components impairs productivity dramatically in the others. Understanding this cycle will be the key to supporting higher productivity in subsystem maintenance and creation.
  3. To understand the dependencies, we need to consider each of the components:
    • The key to accurate FIGfam creation and maintenance is to couple it directly to subsystem maintenance. Once the initial release of the FIGfams was created, updating them occurs automatically based on changes in the subsystem collection. Thus, FIGfams are automatically split, merged and added as the subsystem collection is maintained. There remains one area of substantial cost in FIGfam development -- creation of family-dependent decision procedures that are occasionally required to achieve the required accuracy. At this point we have approximately 10,000 subsystem-based FIGfams, although the overall collection contains over 100,000 families (the majority containing only 2-3 members).
    • RAST has a central dependency on FIGfams for assertion of function to newly-recognized genes. In this sense, the main dependency of RAST is on the FIGfam collection. The more accurate the FIGfams and their associated decision procedures, the more accurate the assignments of function made to genes in genomes processed by RAST.
    • Finally, the central costs of maintenance of subsystems are cleaning up errors in existing subsystems (often indicated by multiple genes having the same function) and by adding new genomes to existing subsystems. Once a subsystem has reached an acceptable level of accuracy (and many are not there yet), the central cost is integration of new genomes after annotation by RAST. The speed with which new genomes can be added depends on how well RAST assigns gene function (and, secondarily, on how accurately these RAST-based annotations can be used to infer operational variants of subsystems).
    The main costs of increasing the speed and accuracy of annotations split into two categories: those relating to maintenance of existing subsystems, and those relating to generation of new subsystems. The maintenance costs are containable, if the cycle is established and functions smoothly. Otherwise, I suspect they inevitably grow rapidly.

I have argued that the costs in achieving rapid, accurate annotations is limited by the rate at which subsystems can be maintained and created. I place the maintenance ahead of creation at this stage. As the collection grows (it now contains over 600 subsystems with over 6800 distinct functional roles), costs of maintenance will tend to dominate. The creation of new subsystems will always be a critical activity, but each new subsystem will impact smaller sets of genomes as we "move into the tail of the distribution".

The costs relating to subsystem maintenance, which will quickly dominate, depend critically on how smoothly the cycle I described functions. We have just established the complete cycle, which is arguably the major achievement of 2007.

The two central costs that cannot be avoided will be creation of FIGfam-dependent decision procedures and the creation of new subsystems. We are currently beginning focused efforts to increase annotator productivity relating to both activities. The required technology in these cases is less dramatic and relates to development of better tools (to support a set of possible decision procedures in the case of FIGfams, and to resolve inconsistencies in subsystem initiation/expansion).

More Effective Integration of Existing Annotation Efforts

In the section above, I reflected on the cycle that we shall depend upon for supporting increased volume and accuracy of our own efforts. Other groups are certainly experimenting with their own solutions, and in some cases with clear successes. I have no desire to rate these competing efforts. We have a new year coming, and I sincerely believe that cooperative activity is the key to enhanced achievements by everyone. However, effective cooperation is often elusive. I think that we have put in place an extremely important mechanism for making cooperation much easier, and the benefits more compelling.

Anyone working for one of the main annotation efforts realizes that it is not easy to really benefit from access to the annotation efforts of other groups. The efforts required to characterize discrepancies between local annotations and those produced externally often outweigh any benefits that result.

Two events of major importance have occurred:

  1. Both PIR and the SEED Project decided to build correspondences between IDs used by different annotation projects. The PIR effort produced BioThesaurus and the SEED effort produced the Annotation Clearing House. The fact that it will become trivial to reconcile IDs between the different annotation efforts will undoubtedly support rapid increases in cross-linking entries. The SEED is working with UniProt to cross-link proteins from all of our complete genomes, and I am sure similar efforts are happening between the other major annotation efforts.
  2. Within the Annotation Clearing House, a project to allow experts to assert that specific annotations are reliable (using whatever IDs they wish) has been initiated. This has led to many tens of thousands of assertions that specific annotations are highly reliable. PIR is preparing a list of assertions that they consider highly reliable, and both institutions are making these lists openly available.

To see the utility of exchanging expert assertions in a framework in which it is easy to compare the results, let me describe how we intend to use these assertions:

  1. We begin with a 3-column table of reliable annotations containing [ProteinID,AssertedFunction,IDofExpert]
  2. We then take our IDs and construct a 2-column table [FIG-function,AssertedFunction]. This table gives a correspondence between each of our functional roles and the functional roles used by the expert making the assertion of reliability.
  3. Then, we go through this correspondence table (using both tools and manual inspection) and split it into one set in which we believe both columns are essentially identical and a second set that we believe represent errors (either our own or those of the expert asserting reliability). We anticipate that in most cases the expert assertion will be accurate, which is what makes this exercise so beneficial to ourselves.
  4. We take the table of "essentially the same" assertions and distribute it as a table of synonyms (which we consider to be a very useful resource).

We are strongly motivated to resolve differences between our annotations and high-reliability assertions made by experts. The production of the table of synonyms both reduces the effort to redo such a comparison in the future, but is also a major asset by itself. I am confident that any serious annotation group that participates will benefit, and I believe that these exchanges will accelerate in 2008 and 2009.

Summary

I have tried to explain why I think that 2007 was a watershed year. The creation of the subsystems => FIGfams => RAST => subsystems cycle was not precisely planned, but its achievement has made its value obvious. That, coupled with the growing tendency to cross-link and exchange reliable assertions will lead to rapidly improving annotations over the next 2-3 years.

Although we now probably have over a thousand sequenced genomes, we have not yet integrated that many into the SEED (and I doubt that we would have access to that many right now). However, it seems very, very likely that we will have that many by sometime in 2008, and it also seems likely that we will be in a position to provide pretty decent initial annotations. I would anticipate completing the original project next year, and it is now time to plan the next stage.

# Permalink