July 2003

« June 2003 | Main | August 2003 »

July 15, 2003

July 2003 Newsletter

Comments on technical progress

Last month was fairly productive. We did not meet all of our objectives, but we did get a few key things accomplished and our strategic goals are falling into place. Our first concern at this point must be to get the addition of organisms running smoothly. Since this is a computational burden for Argonne and a logistic problem for me (gathering genomes, parsing GenBank entries, etc. all takes time), it must become scheduled and routine. My belief is that we need to be able to handle 15-20 organisms per month almost immediately.

Last month, Natalia Maltsev's group at Argonne National Lab succeeded in running similarities for our complete nonredundant protein db (about 1.8 million entries). This is a major accomplishment. It is the first step in constructing a pipeline in which new genomes are added on a monthly basis. We are now running more similarities for the first set of genomes to be added (13 new genomes came in while we were computing similarities for the original set). FIG and Argonne are establishing a regular schedule for adding genomes and computing the incremental similarities. We are close to having a smooth update cycle in which new genomes are constantly added to the FIG db. We are also going back and acculumating attributions for the data and annotations, which we will need before we can actually distribute things. It is all time-consuming, but important.

As genomes pour into the FIG db, the issue of curating the function assignments naturally arises. We are taking the following position for now:

  1. We will build a relatively straightforward automated annotation tool to make reasonable, but far from optimal, assignments. This will allow us to continually gather new genomes, identify genes (for prokaryotes only at this stage), and make an initial set of assignments.
  2. We are actively moving ahead with the computation of new protein families, but this will take time. It represents a serious research effort. Natalia Maltsev's group at Argonne is approaching the same problem, and I suspect a number of other groups are, as well. We will need to start comparing results, exchanging insights and data, and so forth; it seems likely that this will be a relatively long range effort that will eventually play a central role in FIG's future. For now, I think prudence dictates that we not commit to any deliverables at this point. This means that a really good automated annotation system (which will probably depend on the availability of an evolving collection of protein families, alignments, and trees) cannot be accurately scheduled at this point.
  3. Within 4-6 months, FIG will deliver an initial browser/curation environment. This is the tool that people will need to curate the functions assigned to genes. I wish that we could deliever a version more quickly, but I think the end of the year is my best estimate of when it will become available.

Ross has been spedning a significant percentage of his time working on the architecture for the new FIG system. He is pretty excited by the way it is going. It will be a radical departure from the WIT/ERGO design (at the internal level -- the design of the user interface in the form of the browser/curation tool remains unspecified at this point). We have now worked out the detailed design of what we call "the seed" which will be an interesting prototype. It will test a number of key design goals (the ability to install easily over the network, the ability to support incremental updates as "synchronization of independentl installations", maintenance of access restrictions to subsets of the data, and so forth). We are not prepared to give a date for the original release of the seed; that will be a key objective, and we hope to be able to give details in the next newsletter.

Goals of FIG

There is a constant ongoing discussion relating to the objectives of FIG, both technical and financial. Here are some unstructured comments on the situation:

  1. There is no agreement in detail about the exact objectives of FIG. We are actively hoping for a collaboration between people with diverse interests. At times, it seems like we are formulating schedules of deliverables in response to a growing set of chaotic pressures. However, there are a number of positions that impose a more coherent overall picture. One key position that is emerging is roughly commitment to support characterization of the gene pool. This will necessarily require
    • development of detailed hierarchical decompositions for each of thousands of organisms,
    • encoding detailed reaction networks for these organisms, and
    • attaching accurate function assignment to the genes in these organisms.

    We have decided to call the hierarchical decompositions "informal metabolic reconstructions" to distinguish them from encodings of the metabolic reaction network, which we are now calling "formal metabolic reconstructions". We view FIG as focusing on development of tools to provide these three basic deliverables. We intend to make all of FIG's deliverables completely open.
  2. There is a growing view that FIG needs three major components:
    • We will generate an open integration with hundreds (and thousands before long) of genomes. We view this as a resource that is required to support our research.
    • There is a component of FIG that is deeply interested in promoting wet lab confirmations of conjectures. This aspect of FIG will support generation of conjectures, prioritizing the results, posting the prioritized list, and setting up relationships with key labs to help facilitate progress.
    • Finally, there is a small subset of participants that are deeply interested in education. They push for developing software that would allow the use of the FIG db in supporting teaching (in areas like biochemistry, microbial physiology, molecular biology, etc.).

    It is not yet clear how to get all three components progressing (at this stage, we are still struggling to get the basic tools ready and to get a few key proposals written).
  3. There is a growing need for academics with new sequencing projects to get software support for annotation. We considered ERGO to be the premier system for supporting annotation, but most academics do not have access to it. It is possible that ERGO will become accessible to the academic community, but that whole issue depends on Integrated Genomics and we have little to say about the situation at this point. We will almost certainly build a tool to support annotation of at least prokaryotic genomes, but it will not be fully functional in less than 4-6 months. Partial components will be going before then, and we will do what we can to help academic projects.
  4. Financially, FIG will inevitably start very slowly. If we get a big grant, great; we will add staff to support both the programming and annotation efforts. If not, we will support ourselves for a while doing small contracts. The core of FIG intends to leverage its limited resources by forming many relationships with other groups. We have been offered a substantial amount of programming support by individuals who would just like to cooperate in building a great system, and it seems likely that a similar situation will exist for annotations once the software framework to support a large-scale effort is in place.

Goals for July

  1. We need to get a smooth update cycle for adding new genomes going. They are pouring in, and the work of processing them is getting pretty time-consuming.
  2. We need to get everything in order to support distribution of our non-redundant protein db. We hope to have a version ready by mid August.
  3. Ross is building a wonderful application to support the search for missing genes. He realized that this can be done without building much infrastructure. With luck he will have a web site running with this application by the end of July or early August.
  4. We have built a tool called the Resolution Center for displaying/resolving the decisions relating to the genes in a prokaryotic organism (which ORFs contain a gene and where the start codon resides). We will try to get this installed on our web site (and maybe ANL's) by August. It was basically done a month ago, but there are still details that need to be straightened out before we release it.

FIG Special News

HAPPY 75th CARL!

# Permalink