June 2003 Newsletter
June 2003 Newsletter
- Monthly Newsletter
- Proposals/Contracts
- FIG: the First Release
- FIG Deliverables for June
We are starting a monthly newsletter to help keep participants in FIG aware of what's going on and to suggest monthly goals. For those who are not completely familiar with what FIG is, let us start by just saying that
- FIG stands for the Fellowship for Interpretation of Genomes. It is a non-profit corporation. Veronika Vonstein is President. Michael Fonstein, Andrei Osterman, Ross Overbeek, and Yakov Kogan were the other founding members.
- It was formed to support the development of a completely open source set of tools for the interpretation of genomes (and, hence, for interpretation of functional data, since figuring our what genes do is at the core of interpreting genomes).
- At this very initial stage, there is a loose center of gravity around the Chicago suburbs and Argonne National Lab, with significant outposts in Urbana, Tennessee and California. The hope is that this will be an international effort to produce something really good. It remains to be seen whether or not that will happen.
- Ralph Butler has helped develop a simple approach allowing the FIG tools to work under both Postgres and MySQL. Eventually, we will make the FIG tools all work under at least Postgres, MySQL, Oracle, and DB2. By early June, we will have everything working properly under both DBs. Now, we need to get one installed on the Macs (which Gary Olsen and Ross are using as development environments). This should be straightforward, so let's plan on getting this done by mid June.
- Ross has implemented most of the "Resolution Center". We should try to target an early implementation setup at Argonne National Lab to support gene calls in a number of the organisms funded for the Genomes To Life project under DOE. FIG will be seeking funding to support Synechocystis research, so that organism will probably get the first attention. The goal here should be to offer a fully functional site at ANL by the end of June for at least Synechocystis, E.coli and B.subtilis. Then, we should add organisms at the rate of one or two per month. This needs to be almost completely automated without requiring ongoing efforts by FIG participants, but we are hoping that Ralph Butler will participate heavily in helping develop tools to improve starts and gene calls.
- Natalia Maltsev's team at Argonne is helping us by running the first set of similarities under blast (i.e., all of the new nr against itself). This is a huge computation, but we believe that it will be done before the end of June. Once this first version is completed, relatively small amounts will need to be run each week. Natalia's team is setting up a fully automated pipeline to handle our needs. Hence, we can realistically plan on maintaining a state-of-the-art nr with up-to-date similarities by sometime in July.
- It is clear that we need to move away from relying on similarities between protein sequences. Rather, we need to move to comprehensive protein families structured as trees. We held the first workshop on how we are going to proceed in May. Gary Olsen is participating heavily in this work, which will begin in earnest once we have the similarities from ANL. We hope to be in high gear on this project by the third week in June. Gary is developing a perl module for tree manipulation, and we will scrap the old one Ross did at Argonne and move to Gary's as soon as possible.
- We are attempting to get a small workshop scheduled quickly to focus on resolving the precise definitions for "informal metabolic reconstruction" and "formal metabolic reconstruction". We have lined up people that would do the work of developing formal metabolic reconstructions for a number of organisms, if we had the software to support them. Ross is planning on trying to get an initial set of tools up by mid to end July.
- running Resolution Center based at Argonne National Lab by the end of June
- complete similarities for the initial nr (~ 1.8 million sequences) by the end of June
- Gary Olsen's new tree manipulation routines by the end of June
- Initial CGI scripts to support the development of formal metabolic reconstructions by mid to end of July. In use by 2-3 users by mid August.
- first complete FIG release supporting a tool for automated assignment of function to genes by mid September
Proposals/Contracts
Let us begin by stating that we just submitted a large proposal with Argonne National Lab (ANL) to NIH. In our judgment, it is a really good proposal and stands a reasonable chance of being funded. If it does get funded, FIG is solid for the next five years.
We have also participated (in a fairly minor role) with Wim Vermass of Arizona State University University in a second proposal. Again we are offering to provide bioinformatics support.
This experience has led us to the gradual awareness that there is a huge demand now for bioinformatics support for projects, and we are in a good position to participate in a number of proposals. The beauty is that any work done on any of the grants/contracts strengthens the overall performance on all of them, since all of the work will be open source. There is a problem in that Ross cannot become over committed, so we will need a few more senior software people if we do proceed with more large efforts.
FIG: the First Release
We are shooting for a first real release of FIG software by the end of the summer. It is too early to fix a deadline yet. We would like to provide roughly the following functionality:
| Resolution Center | an environment for resolving issues relating to the identification of genes in prokaryotic genomes (both genes and starts) | |
| a larger protein nr | It turns out that we can probably provide an nr with about 20-25% more sequences than the one available from NCBI. There are that many genomes out there that have some level of restrictions on use (e.g., the Sanger and JGI centers essentially restrict whole genome publications for a year). We will need to ensure that participants that take copies of the distribution agree to the restrictions, or we can release the code with instructions on how to download data and make their own nr. | |
| protein families | we have started a project to build yet another set of protein families. We believe that there are good reasons to do our own. | |
| automated annotation | this will be the main thrust of the first release. We will offer a free service for calling genes in prokaryotes and providing automated assignments. |
Needless to say, we will be scheduling work on the Browser/Curation Tool, the Data Mining Tool, and both informal and formal metabolic reconstructions, as well. We hope to be supporting several efforts to develop formal metabolic reconstructions within 2-3 months, but they will not be part of the release.
FIG Deliverables for June
One goal of each newsletter will be to list the detailed work that is going on during the coming month. Here are some notes:
So, KEY DELIVERABLES:
More in a month...


