November 2003 Newsletter
« September 2003 Newsletter | Main | December 2003 Newsletter »
November 2003 Newsletter
- The SEED
A great deal has been happening. We have settled on "The SEED" as a name for the new system, and here is how things stand on SEED development:
- We have built a "SEED disk" for Mac OS X machines. We have built six copies so far; debugging and improvements are progressing very quickly.
- We have gotten our new dual-processor G5 server. It is a wonderful machine. We are using it for development and the SEED runs well on it. We will put up a public server as soon as we load it with completely current copies of all of the 140-150 published genomes.
- The SEED will be shown at Supercomputing 2003 (a big yearly show in the world of supercomputers) later this month. We will have two G5 Macs and two linux machines (one running Postgres, and one running MySQL). As of today they support peer-to-peer (p2p) updating of code, annotations and assignments.
- By next week we plan on supporting the ability for users to add their own genomes, and probably the ability to acquire initial copies over the internet (we will not immediately offer this service -- it will take at least a month to get the initial release ready).
- We have a problem getting data entered initially. We have found at least 2-3 groups who will help us; one by helping acquire and prepare published genomes, one might help us call prokaryotic genes, and a third might help us organize the eukayotic data. I hesitate to announce names, since the commitments are not firm yet (I believe that everyone is waiting for us to organize our side of the collaborations). However, we do deeply appreciate the offers of help, and we believe that the pipelines will be flowing at high volume by the end of the year.
- The SEED is now actively in use by a large-scale sequencing effort. It will be used as the framework for analysis of genes and development of metabolic overviews. We have been approached by a number of other groups who are now planning on using the system, if their grants go through. These are all important, since each such group will become a valuable source of criticism. There is a real difference between the demands of people who just occasionally use the system and those that use it on a daily basis. The speed of improvement will be directly proportional to the number of serious users.
- The system is being used to support a class at the University of Chicago. The real impact of this effort was to force a bit more organization in our development. Having a class of students dependent on the use of the system is exposing bugs and shortcomings pretty quickly. FIG is actively planning on helping to develop course materials and to aid in sharing of course materials. We feel that it is important that we build up a collection of web-based "lectures" and "assignments". Initially, we will focus on developing three 2-week "modules" that would be suitable for use in any number of biology courses. The modules will be
- Searching for Missing Genes
- How to Annotate a Genome
- How to Annotate a Subsystem
So, the SEED now runs on Linux and Macs. The next month will be spent in getting installation procedures and instructions simplified, as well as loading the data for the initial release. Once the initial release occurs, we expect a very rapid propagation. If our ability to support peer-to-peer updates of code is solid, we will be able to handle bug fixes and new features without it being a big deal. If not, we will rapidly be overwhelmed. There is, of course, an ongoing tension between making things solid and adding new features; it always exists, and we are trying to make the right choices.
The "annotate a 1000 genomes" project is moving from the discussion stage to becoming a clear goal of FIG. The basic position may be summarized as follows:
- Think of a spreadsheet with each row corresponding to a genome, each column corresponding to a subsystem (e.g., "synthesis of histidine"), and each cell corresponding to a summary of how the given subsystem is implemented in the specified organism. The number of rows will grow at ever increasing speeds, but the number of columns (i.e., subsystems) is relatively constant.
- Most current annotation efforts focus on annotating a single genome or a small set of closely related genomes. The key to "high-throughput annotation" is not the pipeline approach that is widely employed, but rather the development of a detailed understanding of the variants of a specific subsystem, followed by consistent annotation of that subsystem through all existing genomes (i.e., annotation of columns, rather than rows).
- There are on the order of 80-120 subsystems that make up the "core machinery" of the cell. Here, we explictly exclude a few key categories that should be treated separately (e.g., signalling, transport, and regulatory genes). The study of each of these subsystems needs to be done systematically.
- One of the best examples of how this should be done, in our opinion, was published in the Sept, 2003 issue of mmbr. The paper, "Ancient Origin of the Tryptophan Operon and the Dynamics of Evolutionary Change" by Roy Jensen's team, is a vivid portrait of exactly what is needed for each of the key subsystems. We highly recommend it; it is an excellent review of the tryptophan biosynthesis subsystem, but even more importantly it is a precise template of what is needed for each subsystem.
- The speed with which an analysis of the sort discussed in the tryptophan review can be constructed is highly dependent on the tools the analyst uses. A good biologist might well also suggest that it depends on insight formed from years of work on a specific subsystem, but it seems clear to us that the availability of appropriate tools is a major issue. The right tools do not yet exist (at least in our view). FIG will partiipate actively in the development of these tools.
- FIG does not intend to limit its participation to tool building. Indeed, several of us are deeply interested in actively taking subsystems, performing the needed analysis, cleaning up the annotations, and producing the corresponding reviews.
The FIG team is growing rapidly (in the sense that a lot of interesting people are starting to closely collaborate on this strange effort). It is most encouraging.
More in a few weeks.
Take care,
The FIG Team


