<?xml version="1.0" encoding="utf-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>FIG</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/news/" />
<modified>2008-01-07T20:17:48Z</modified>
<tagline>Home of the Project to Annotate 1000 Genomes</tagline>
<id>tag:www.figresearch.com,2008:/news//3</id>
<generator url="http://www.movabletype.org/" version="4.01">Movable Type</generator>
<copyright>Copyright (c) 2007, Ross Overbeek</copyright>

<entry>
<title>Reflections on 2007</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2007/12/reflections-on.html" />
<modified>2008-01-07T20:17:48Z</modified>
<issued>2007-12-31T19:20:49Z</issued>
<id>tag:www.figresearch.com,2007:/news//3.3242</id>
<created>2007-12-31T19:20:49Z</created>
<summary type="text/plain">I have recently been reflecting on the status of the Project to Annotate 1000 Genomes and in this short essay I will argue that it has been an overwhelming success due to issues that became apparent only as the project...</summary>
<author>
<name>Ross Overbeek</name>

<email>ross@thefig.info</email>
</author>

<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<p>I have recently been reflecting on the status of the <a href="http://www.figresearch.com/archives/2004/01/manifesto_1.html">Project to Annotate 1000 Genomes</a>
and in this short essay I will argue that it has been an
overwhelming success due to issues that became apparent only as the
project progressed. &nbsp; I am writing this on the last day of
2007, which certainly proved to be a remarkable year. &nbsp; A
thousand more-or-less complete genomes now exist, a framework for
rapidly annotating new genomes with remarkable accuracy is now
functioning, and we are on the verge of another major shift in the
world of annotations. &nbsp;Let me try to clarify these remarks
before proceeding to the celebration this evening.</p>
]]>
<![CDATA[<h4>The Production of Accurate Annotations</h3>
<p>The efforts required to establish a framework for high-volume, accurate
annotation are substantial.  I believe that it is important
that we reflect on what we have learned about the factors that
determine productivity.  So, what have we learned from the
project?</p>
<p>First, <strong><em>subsystem-based
annotation</em> is the key to accuracy</strong>. While there are certainly
numerous efforts still focusing on annotation of a single genome, the
recognition that comparative analysis is the key to everything, and
that focusing on the variations of a single component of cellular
machinery as they are manifested over the entire collection of existing
genomes is the key to accuracy, are both widely accepted
 principles at this stage.  <strong>Manually-based subsystem creation
and maintenance is the rate-limiting component</strong> of
successful annotation efforts, and the factors that constrain this
process are at the heart of the matter.  We have understood
this for some time now.</p>
<p>I am going to argue a new position in this short essay:</p>
<ol>
<li>There are three distinct components that make up our
strategy for rapid accurate annotation: <strong>subsystems-based annotation</strong>,
<strong>FIGfams</strong>
as a framework for propagating the subsystems annotations, and <strong>RAST</strong> as a technology
for using FIGfams and subsystems to consistently propagate annotations
to newly-sequenced genomes.</li>
<li>These three components form a cycle (subsystems =&gt;
FIGfams =&gt; RAST technology =&gt; subsystems). This
cycle creates a feedback that rapidly accelerates the productivity
achievable in all three components. Further, failure in any
of these components impairs productivity dramatically in the others.
Understanding this cycle will be the key to supporting higher
productivity in subsystem maintenance and creation.</li>
<li>To understand the dependencies, we need to consider each of the components:
<ul><li><strong>The key to
accurate FIGfam creation and maintenance is to couple it directly to
subsystem maintenance</strong>. Once the initial release
of the FIGfams was created, updating them  occurs automatically
based on changes in the subsystem collection.  Thus, FIGfams are
automatically split, merged and added as the subsystem
collection is maintained.  There remains one area of
substantial cost in FIGfam development -- creation of family-dependent
decision procedures that are occasionally required to achieve the
required accuracy.  At this point we have approximately 10,000
subsystem-based FIGfams, although the overall collection contains over
100,000 families (the majority containing only 2-3 members).</li>
<li><strong>RAST has a
central dependency on FIGfams</strong> for assertion of function to
newly-recognized genes.  In this sense, the main dependency of
RAST is on the FIGfam collection.  The more accurate the
FIGfams and their associated decision procedures, the more accurate the
assignments of function made to genes in genomes processed by RAST.</li>
<li>Finally, the central costs of maintenance of subsystems
are cleaning up errors in existing subsystems (often indicated by
multiple genes having the same function) and by adding new genomes to
existing subsystems.  Once a subsystem has reached an
acceptable level of accuracy (and many are not there yet), <strong>the central cost is integration
of new genomes after annotation by RAST.</strong>  The
speed with which new genomes can be added depends on how well RAST
assigns gene function (and, secondarily, on how accurately these
RAST-based annotations can be used to  infer operational
variants of subsystems).</li>
</ul>
The main costs of increasing the speed and accuracy of
annotations split into two categories: those relating to maintenance of
existing subsystems, and those relating to generation of new
subsystems.
The maintenance costs are containable, if the cycle is established and
functions smoothly.  Otherwise, I suspect they inevitably grow
rapidly.</li>
</ol>
<p>I have argued that the costs in achieving rapid, accurate annotations is
limited by the rate at which subsystems can be maintained and created.
 I place the maintenance ahead of creation at this stage.
 As the collection grows (it now contains over 600
subsystems with over 6800 distinct functional roles), costs of maintenance will tend to dominate.  The
creation of new subsystems will always be a critical activity, but each
new subsystem will impact smaller sets of genomes as we "move into the
tail of the distribution".</p>
<p>The costs relating to subsystem maintenance, which will quickly
dominate, depend critically on how smoothly the cycle I described
functions.  We have just established the complete cycle, which
is arguably the major achievement of 2007.</p>
<p>The two central costs that cannot be avoided will be creation of
FIGfam-dependent decision procedures and the creation of new
subsystems.  We are currently beginning focused efforts to
increase annotator productivity relating to both activities.  
The required technology in these cases is less dramatic and relates to
development of better tools (to support a set of possible decision
procedures in the case of FIGfams, and to resolve inconsistencies in
subsystem initiation/expansion).</p>
<h5>More Effective Integration of Existing Annotation Efforts</h5>
<p>In the section above, I reflected on the cycle that we shall depend
upon for supporting increased volume and accuracy of our own efforts.
 Other groups are certainly experimenting with their own
solutions, and in some cases with clear successes.  I have no
desire to rate these competing efforts.  We have a new year
coming, and I sincerely believe that cooperative activity is the key to
enhanced achievements by everyone.  However, effective
cooperation is often elusive.  I think that we have put in
place an extremely important mechanism for making cooperation much
easier, and the benefits more compelling.</p>
<p>Anyone working for one of the main annotation efforts realizes that it
is not easy to really benefit from access to the annotation efforts of
other groups.  The efforts required to characterize
discrepancies between local annotations and those produced externally
often outweigh any benefits that result.</p>
<p>Two events of major importance have occurred:</p>
<ol><li>Both PIR and the SEED Project decided to build
correspondences between IDs used by different annotation projects.
 The PIR effort produced <a href="%28http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml">BioThesaurus</a>
and the SEED effort produced <a href="http://clearinghouse.nmpdr.org/aclh.cgi">the
Annotation Clearing House</a>.  The fact that it will
become trivial to reconcile IDs between the different annotation
efforts will undoubtedly support rapid increases in cross-linking
entries.  The SEED is working with UniProt to cross-link
proteins from all of our complete genomes, and I am sure similar
efforts are happening between the other major annotation efforts.</li>
<li>Within the Annotation Clearing House, a project to allow
experts to assert that specific annotations are reliable (using
whatever IDs they wish) has been initiated.  This has led to
many tens of thousands of assertions that specific annotations are
highly reliable.  PIR is preparing a list of assertions that
they consider highly reliable, and both institutions are making these
lists openly available.</li>
</ol>
<p>To see the utility of exchanging expert assertions in a framework in
which it is easy to compare the results, let me describe how we intend
to use these assertions:</p>
<ol>
<li>We begin with a 3-column table of reliable annotations
containing <em>[ProteinID,AssertedFunction,IDofExpert]</em></li>
<li>We
then take our IDs and construct a 2-column table <em>[FIG-function,AssertedFunction].</em>
 This table gives a correspondence between each of our <em>functional roles</em>
and the functional roles used by the expert making the assertion of
reliability.</li>
<li>Then, we go through this correspondence table (using both
tools and manual inspection) and split it into one set in which we
believe both columns are essentially identical and a second set that we
believe represent errors (either our own or those of the expert
asserting reliability).  We anticipate that in most cases the
expert assertion will be accurate, which is what makes this exercise so
beneficial to ourselves.</li>
<li>We take the table of "essentially the same" assertions and
distribute it as a table of synonyms (which we consider to be a very
useful resource).</li>
</ol>
<p>We are strongly motivated to resolve differences between our
annotations and high-reliability assertions made by experts.
 The production of the table of synonyms both reduces the
effort to redo such a comparison in the future, but is also a major
asset by itself.  I am confident that any serious annotation
group that participates will benefit, and I believe that these
exchanges will accelerate in 2008 and 2009.</p>
<h4>Summary</h4>
<p>I have tried to explain why I think that 2007 was a watershed year.
 The creation of the <strong>subsystems
=&gt; FIGfams =&gt; RAST =&gt; subsystems </strong> cycle
was not precisely planned, but its achievement has made its value
obvious.  That, coupled with the growing tendency to
cross-link and exchange reliable assertions will lead to rapidly
improving annotations over the next 2-3 years. <br><br>Although
we now probably have over a thousand sequenced genomes, we have not yet
integrated that many into the SEED (and I doubt that we would have
access to that many right now).  However, it seems very, very
likely that we will have that many by sometime in 2008, and it also
seems likely that we will be in a position to provide pretty decent
initial annotations.  I would anticipate completing the original
project next year, and it is now time to plan the next stage.</p>]]>
</content>
</entry>

<entry>
<title>History of FIG is Available</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2005/02/history-of-fig.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2005-02-28T17:25:36Z</issued>
<id>tag:www.figresearch.com,2005:/news//3.425</id>
<created>2005-02-28T17:25:36Z</created>
<summary type="text/plain">A Timeline showing the intertwined careers of the various FIG Fellows and a partial list of the papers on which they collaborated is now available on our new history page....</summary>
<author>
<name>Ferdinand T Cat</name>
<url>http://www.conservativecat.com/Ferdy</url>
<email>ferdy@conservativecat.com</email>
</author>
<dc:subject>Announcements</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<p>A Timeline showing the intertwined careers of the various FIG Fellows and a partial list of the papers on which they collaborated is now available on our new <a href="http://www.figresearch.com/History.html">history page</a>.</p>]]>

</content>
</entry>

<entry>
<title>Web Site Goes Live</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2005/02/web-site-goes-l.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2005-02-10T20:41:22Z</issued>
<id>tag:www.figresearch.com,2005:/news//3.355</id>
<created>2005-02-10T20:41:22Z</created>
<summary type="text/plain">The new FIG web site is up and running!...</summary>
<author>
<name>FIG</name>
<url>http://www.figresearch.com</url>
<email>bruce@gigabarb.com</email>
</author>
<dc:subject>Announcements</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<p>The new FIG web site is up and running!</p>]]>

</content>
</entry>

<entry>
<title>Late January 2005 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2005/02/late-january-20-1.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2005-02-01T06:15:20Z</issued>
<id>tag:www.figresearch.com,2005:/news//3.326</id>
<created>2005-02-01T06:15:20Z</created>
<summary type="text/plain"> Development of the NMPDR The Manifesto The SEED Developers Meeting in Chicago in October Workshops, Tutorials, etc. News Releases, etc. The More Important Release: the Subsystems...</summary>
<author>
<name>Ferdinand T Cat</name>
<url>http://www.conservativecat.com/Ferdy</url>
<email>ferdy@conservativecat.com</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>Development of the NMPDR</li>
<li>The Manifesto</li>
<li>The SEED Developers Meeting in Chicago in October</li>
<li>Workshops, Tutorials, etc.</li>
<li>News Releases, etc.</li>
<li>The More Important Release: the Subsystems</li>
</ul>]]>
<![CDATA[<p>A great deal has happened during the last three months, so we decided to write another newsletter.</p>

<h4>Development of the NMPDR</h4>

<p>As we mentioned in our last newsletter, FIG received a sizable portion of a 5-year grant to build a National Microbial Pathogen Database Resource.  This grant is providing a welcome stability.  In addition, the NMPDR is a completely open source project, and the extensions to the existing SEED technology that are under development will benefit
everyone.   Many aspects of the basic design are being refined and
cleaned up.  The first release is only a few weeks away.  While based on SEED technology, the team doing the development has introduced a number of key innovations that we will discuss in the next newsletter.</p>

<h4>The Manifesto</h4>

<p>A discussion has been taking place relating to the appropriate objectives of FIG and the SEED project.  Now that FIG has solid funding due to its participation in a the development of the NMPDR, there is a welcome chance to re-examine goals.  There are a number of ways things might develop, and as one might expect, different views are constantly being expressed.  Ross attempted to formulate his view, which was posted at <a href="http://TheSEED.uchicago.edu/FIG/Html/1KG.html">http://TheSEED.uchicago.edu/FIG/Html/1KG.html</a>.
The focus he advocates is largely based on annotations: success should be measured by FIG's ability to advance the development of subsystems and detailed encodings of metabolism (in the form of stoichiometric matrixes).</p>

<h4>The SEED Developers Meeting in Chicago in October</h4>

<p>The SEED Developers meeting was held on Oct 24-25.  It focused on preparing a distribution version of a merged environment that would support both the SEED and GenDB.  Teams at both Bielefeld and Argonne National Lab put substantial effort into merginjg the systems.  A prototype was produced, DVDs were created (shortly after the meeting), and the result was shown at Supercomputing 2004, which occurred in early November.  The merged system is not completely distributable yet for two reasons:</p>
<ol>
<li>The installation scripts are not solid.  We can manually do an
        install by working through the steps, but the details for
        making things work properly in arbitrary environments is taking time.</li>
<li>It became clear that we wanted to include a number of genomes
        in the GenDB side that had all of the precomputed data needed
        to support analysis (rather than making users begin by doing a fairly substantial amount of computation to prepare their
        genomes of interest).</li>
</ol>

<p>The effort to make both systems available in an integrated environment will continue during the first quarter of 2005.  If anyone really wants a copy right away for production use (as opposed to just evaluation), we will make an effort to help you install it.  However, the major release scheduled for the end of the first quarter will include hundreds of genomes already preloaded into the GenDB component, tens of thousands of newly-called genes, and on the order of a hundred additonal genomes.  We sincerely believe that it will be worth the wait.</p>

<h4>Workshops, Tutorials, etc.</h4>

<p>We held tutorial/workshops at MIT and in Mexico, and things went extremely well.  We have another scheduled in early March at the University of Florida, and Andrei will be giving one in the Netherlands in May.  We have been reconsidering the most effective way to spread the technology, given that everyone is already hugely over commited.</p>

<p>One observation has been that, in the few cases in which the SEED technology for annotation of genes and subsystems has actually been used in classrooms, we got more benefit for our efforts.  Most notably, we had a really productive experience helping introduce the technology in Bernhard Palsson's graduate class at UCSD.  The students took it very seriously, they built a number of well-done subsystems, and it established the foundation for ongoing use and collaboration.</p>

<p>The position that is emerging is that advancing the effort to produce accurate, well supported subsystems should be the main goal; all tutorial efforts would be judged on the odds that the expended effort would advance that goal.  Hence, tutorials given to experts planning on using the technology to support development of reveiws or in their own research would get highest priority, graduate classes would get next highest priority, etc.</p>

<h4>New Releases, etc.</h4>

<p>The effort required to add hundreds of genomes at an increasing pace is draining.  Just trying to help people make new installations and update their old ones is time consuming.  It is becoming very clear that we need to establish a strategy that scales, and we need to implement it very quickly.  The current plan is really pretty exciting.  The key points are as follows:</p>
<ol>
<li>The major release that we will make about the end of the
               first quarter will be the "first and only more-or-less
               official release of the SEED/GenDB data".  Code will be constantly updated and installed over the network as it
               is done now.  However, we will stop constructing
               "current releases".</li>
<li>Rather than new releases of an ever-increasing
               collection of genomes, we will support incremental
               addition of single genomes.  Users can download genomes, similarities, improved gene calls, etc. from the
               clearinghouse.  Users will download and install genomes the way they now download and install subsystems.</li>
<li>A number of participants will be helping to prepare
               genomes for addition, and they will all be depositing
               them into the clearinghouse.</li>
<li>For new participants, we will provide the single
               release, and we will implement the ability to "clone"
               and existing SEED/GenDB system.  This allows a new user to acquire a "clone" from someone who has made the
               investment to download whatever genomes are desired,
               install whatever subsystems they trust, etc.  It also
               frees us from having to worry about how to help people
               get started.</li>
</ol>
<p>There are numerous aspects of this strategy that have not been properly implemented yet.  We will do our best to have everything fully functional by the big release.</p>

<h4>The More Important Release: the Subsystems</h4>

<p>The big release will include not just the SEED/GenDB integration, but an initial set of subsystems, as well.  These represent the output of the "prototyping and evaluation" stage.  There will be on the order of 100-150 subsystems in differing stages of development.  Some were done by experts, have lovely diagrams, and reflect major amounts of effort. Many were done by enthusiastic participants with less knowledge and they reflect it.  There are many ways that one might measure the utility of 
this initial batch.  We believe that it represents a major milestone.  Much of the next two months will be spent organizing and doing quality control for this initial release.</p>


<p>Well, that is about it for now.  There are probably many things that we forgot to cover, but we did hit the major topics.  We will try to get out another newletter next month, but it is not clear that we will take the time to do so before "the big release", which should be about the start of March.  In any event,

we wish you well and hope you prosper,</p>

<p>the team at FIG</p>]]>
</content>
</entry>

<entry>
<title>In-Between Fall 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/10/inbetween-fall.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-10-17T20:13:12Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.303</id>
<created>2004-10-17T20:13:12Z</created>
<summary type="text/plain"> Developers Meeting Meeting Agenda...</summary>
<author>
<name>FIG</name>
<url>http://www.figresearch.com</url>
<email>bruce@gigabarb.com</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>Developers Meeting</li>
<li>Meeting Agenda</li>
</ul>]]>
<![CDATA[<p>We are writing this short newsletter to update people on a number of issues, most notably the upcoming SEED Developers Meeting.  The important topic that we need to point out is this: it appears that somewhat more people than expected will be coming, it would be a strain to fit them into the facilities at the University of Chicago during the part of the week we were going to run there (Oct. 27-29), and so we are going to run the WHOLE meeting at Argonne National Lab.</p>

<p>The reason we had not originally planned on doing the whole thing at ANL was based on worries about being able to get people into the lab (we must get people cleared by ANL security).  We have checked, and it appears that everyone (that we know about) coming to the second half can easily be cleared.  So, it is critical that, if you have not discussed your attendance with us, but you would like to come, please contact us immediately.  If it emerges that there are people who would like to come, but cannot get clearance, we will run a small parallel version at the University of Chicago (or, we will help them get set up for a second tutorial that we will be holding at MIT in Boston on Nov 1-2).</p>
<p>So, that is the important message: PLEASE MAKE SURE THAT WE KNOW IF YOU ARE COMING AND MAKE SURE THAT WE HAVE GOTTEN CLEARANCE FOR YOU.</p>

<h4>Agenda for the Developers Meeting (Oct 25-26)</h4>

<p>The SEED Developers Meeting will be held at Argonne National Lab on Oct 25-26.  There are good cheap hotels available (on site for about $65/night; contact Veronika or Cheryl Zidel).</p>

<p>These meetings are intended to be largely free format with intense exchanges between individuals extending the SEED.  Inevitably small subgroups will form and work on specific technical issues.  However, we will consider the following to be a draft agenda:</p>

<h4>October 25</h4>

<p>Building 221, room A216, Mathematics and Computer Science Division, Argonne National Lab (wireless connections are available, so bring your laptops)</p>

<table border="2" width="100%">
<tr><td>9:30 - 10:30</td>
<td><p>Summary of the Key Topic: Preparing a DVD Distribution that Contains Communicating Versions of GenDB and SEED.</p>
<p>One group of individuals will spend much of the 
entire week attempting to prepare this release.</p></td>
</tr><tr>
<td>11:00 - 12:00</td>
<td><p>Overall Development Plans for The SEED (Where We Will Be Going Over the Next 6 Months)</p></td>
</tr><tr>
<td>12:00 - 1:30</td>
<td><p>Lunch, walks around the grounds, getting to know one 
another</p></td>
</tr><tr>
<td>1:30 - 2:30</td>
<td><p>What it takes to install and maintain a version of 
the SEED.  This discussion, like many of the discussions that will take place over the two days, will include a discussion of the same topic as it relates to GenDB. Our joint goal with the group at the University of Bielefeld is to produce easily installed versions of our open source software.</p></td>
</tr><tr>
<td>2:30 - 3:00</td>
<td><p>Coffee, discussions</p></td>
</tr><tr>
<td>3:00 - 5:00</td>
<td><p>Adding new genomes to the SEED</p></td>
</tr></table>

<p>We must stress that not everyone will be attending these discussions. One group will organize itself to focus on the development of the distribution DVDs.  Members will come and go.  It will (intensionally) be a somewhat unstructured, hopefully intensly productive, exchange.</p>

<h4>October 26</h4>

<table border="2" width="100%">
<tr><td>9:30 - 10:30</td>
<td><p>Issues Relating to Preparing Genomes for Entry to 
the SEED</p></td>
</tr><tr>
<td>10:30 - 11:00</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>11:30 - 12:30</td>
<td><p>Improving Gene Calls (both protein encoding and RNA 
encoding genes)</p></td>
<tr></tr>
<td>12:30 - 2:00</td>
<td><p>Lunch</p></td>
</tr><tr>
<td>2:00 - 5:00</td>
<td><p>Status of the Subsystem Support Components of the SEED. This discussion will focus on what support now exists for developing 
subsystems, what is planned, and some obvious uses of the developed 
subsystems.</p></td>
</tr></table>

<blockquote>The three days that follow (Oct 27-29) are intended to be less 
technical, more tutorial, but also somewhat unstructured.  We 
anticipate that many users will come only for these three days.</blockquote>

<h4>October 27:  What Can One Do With the SEED?</h4>

<table border="2" width="100%">
<tr><td>9:30 - 11:00</td>
<td><p>An Introduction to Annotation/Subsystem Construction</p></td>
</tr><tr>
<td>11:00 - 11:30</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>11:30 - 12:30</td>
<td><p><strong>Some Examples</strong></p>
<p>This tutorial session will include numerous points where we simply hand out IDs of genes we consider interesting and ask people to figure out whatever they can about them in 15-20 minutes.  Then, we will discuss some approaches to answering such questions using the SEED. These short "challenge questions" are designed to be fun exercises which allow people to play with the system.</p></td>
</tr><tr>
<td>12:30 - 2:00</td>
<td><p>Lunch</p></td>
</tr><tr>
<td>2:00 - 3:00</td>
<td><p>Starting a Subsystem</p></td>
</tr><tr>
<td>3:00 - 3:30</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>3:30 - 5:00</td>
<td><p>Users Will Either Start or Extend Subsystems of 
their Choice.</p>
<p>We are going to try to help people learn to work on development/annotation of subsystems.  People with experience will help beginners get started.</p></td>
</tr></table>

<h4>October 28</h4>

<table border="2" width="100%">
<tr><td>9:30 - 10:30</td>
<td><p>Challenge Problems</p></td>
</tr><tr>
<td>10:30 - 11:00</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>11:30 - 12:30</td>
<td><p>Looking at Some Detailed Problems/Inconsistencies in Existing Subsystems</p></td>
</tr><tr>
<td>12:30 - 2:00</td>
<td><p>Lunch</p></td>
</tr><tr>
<td>2:00 - 4: 00</td>
<td><p>Working on Subsystems</p></td>
</tr><tr>
<td>4:00 - 4:30</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>4:30 - 5:00</td>
<td><p>Challenge problems</p></td>
</tr></table>
<h4>October 29</h4>

<table border="2" width="100%">
<tr><td>9:30 - 11:00</td>
<td><p>Extending Subsystems</p>
<p>Once a subsystem has been carefully constructed to include 10-20 diverse entries, how can one extend it to 200 organisms?</p></td>
</tr><tr>
<td>11:00 - 11:30</td>
<td><p>Coffee</p></td>
</tr><tr>
<td>11:30 - 12:00</td>
<td><p>Challenge Problems</p></td>
</tr><tr>
<td>12:00 - 2:00</td>
<td><p>Lunch</p></td>
</tr><tr>
<td>2:00 - 5:00</td>
<td><p>Work on subsystems, discuss the status of the 
subsystems effort, and what uses can be made of subsystems</td>
</tr></table>

<p>That is approximately what we plan to do.</p>

<p>Hope you can make it.</p>]]>
</content>
</entry>

<entry>
<title>August 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/09/august-2004-new-1.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-09-02T17:52:08Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.115</id>
<created>2004-09-02T17:52:08Z</created>
<summary type="text/plain"> The SEED Developers Meeting in Chicago in October The Subsystems Effort: How it is Shaping Up The iBook: a Minor Technical Note Leading to a Proposed Annotation Environment Tutorials The Funding Situation...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>The SEED Developers Meeting in Chicago in October</li>
<li>The Subsystems Effort: How it is Shaping Up</li>
<li>The iBook: a Minor Technical Note</li>
<li>Leading to a Proposed Annotation Environment</li>
<li>Tutorials</li>
<li>The Funding Situation</li>
</ul>]]>
<![CDATA[<h4>The SEED Developers Meeting in Chicago in October</h4>

<p>The next SEED Developers Meeting will be in Chicago during the week of
October 24-28.  The meeting will be split between Argonne National Lab
(Oct 24-25) and the University of Chicago (October 26-28).</p>  

<p>The initial period at Argonne will focus on preparing a distribution
version of the GenDB/SEED merged environment.  This should produce a
set of DVDs with both GenDB and the SEED in the distribution.  The
merged environment is built on technology developed at the University
of Bielefeld in Germany and the combination should offer capabilities
that are attractive to many groups.  In particular, it should support
straightforward editing of genes (adding CDSs, deleteing them, and
changing start locations), as well as comparative capabilties.  This
part of the meeting will be quite technical.  Anyone planning on
attending should contact Ross Overbeek (Ross@TheFIG.info) or Mike
Kubal (mkubal@mcs.anl.gov) till September 15th. Entrance to Argonne is
restricted, and we need to apply for guest passes for visitors well in
advance (especially foreign nationals).</p>

<p>The second part of the meeting will be held at the University of
Chicago and will focus on the Project to Annotate 1000 Genomes (i.e.,
subsystems annotation).  This will include both a tutorial for people
that wish to understand how to annotate using the SEED, as well as an
exchange of views and status by people already quite active in the
effort.  For those wishing to attend this part, please contact either
Mike or Ross, and feel free to do so right up to the meeting.</p>  

<h4>The Subsystems Effort: How it is Shaping Up</h4>

<p>The subsystems effort is beginning to take shape.  It now involves a
number of people, some real experts and some relatively inexperienced.
Some people work on the SEED server at the University of Chicago
(http:TheSEED.uchicago.edu/FIG), but more work on their own personal
systems.  Periodically, they deposit versions of their analysis on a
server from which others can download as desired (this is called
"publishing" the subsystem).  Any user of the SEED can download any
or all of these deposited subsystems.</p>

<p>A Wiki is used to coordinate who is working on what
(<a href="http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SubsystemBulletinBoard">http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SubsystemBulletinBoard</a>).
A Forum (<a href="http:subsys.info">http:subsys.info</a>) has been set up as a framework for posting
notes of general interest, predictions, and general comments.  The
forum is intended to become a vehicle for opening up the effort to an
unlimited set of potential users.</p>

<p>We have found that there has been a bit of effort required to get
people acquainted with use of the Wiki and the Forum, and use of the
software to support development of a subsystem does require a
tutorial.  There are basic inadequacies in the tools at all levels,
and this can prove to be quite frustrating.</p>

<p>However, these last three weeks (vacation time seems to come to an
end) have seen an amazing amount of progress, and things are now
beginning to take off.  A number of subsystems that cover thousands of
assignments have been published in these last few weeks, and we know
of many more in progress.  We are very enthusiastic about this point.</p>

<h4>The iBook: a Minor Technical Note</h4>

<p>Sveta Gerdes recently joined FIG and has begun working on encoding
subsystems.  Due to some lovely technical work by Bob Olson and Ed
Frank, the entire SEED can now be run on a laptop without an external
drive.  It does consume 20-30 gigabytes of the internal disk.  Sveta
now uses an iBook (basic cost of about $1499 + $129 for and additional
512 MB of internal memory) that supports a full version of the SEED
on the internal disk.<p>

<h4>Leading to a Proposed Annotation Environment</h4>

<p>This leads us to briefly discuss the basic working environment that we
are proposing for annotation teams:</p><ol>

    <li> A central SEED server is optional, but is often useful for two
       reasons: it can be used to support people that have only really
       inexpensive laptops or desktop computers, and it can be used as
       a framework for offering access to the world at large.  Most
       annotation efforts will include one, and it can be run on
       anything from a $1500 Linux box to a $4000 Mac G5.</li>

    <li> Some members of the annotation team will have moderate laptops
       ($1600-3000 machines) running local copies of the SEED. Others
       will use personal computers to work over the network on the
       central SEED server.</li>

    <li> Function assignments and annotations are exchanged between
       individual SEED versions of members of the team (and the
       central server, if it exists) each evening.</li></ol>

<p>The currently available peer-to-peer tools in the SEED are not
adequate to effectively support this style of synchronization.  We
consider it a high priority to get them to this point in the next
month or so.</p>

<p>This leads us to a central tenant of the proposed environment:
assignments and annotations are the output of an annotations effort,
and there must be an external representation that defines exactly what
is meant by their content.  In the presence of such a definition,
individual users can use whatever annotation software they wish, as
long as the results can be integrated and synchronized with those of
the entire team.</p>

<p>Once environments are established that contain multiple annotation
systems, all synchronizing their results on a daily basis, systems
will evolve to take over niches (sorry, perhaps the level of
enthusiasm is clouding our judgement here...).</p>

<h4>Tutorials</h4>

<p>Ross is planning on teaching a 3-day course in annotations and
subsystems technology at UCSD on Oct 4-6.</p>

<p>We plan on having a subsystems tutorial and general discussion of SEED
technology at MIT in Oct (probably Oct 18-19).</p>

<p>The cancelled tutorial at Los Alamos will be rescheduled for Sante Fe
in November.</p>

<p>If you wish to know about any of the details, please check the FIG
Forum or contact Veronika (Veronika@TheFIG.info).</p>


<h4>The Funding Situation</h4>

<p>During FIG's first year, the funding situation might reasonably be
characterized as somewhat precarious.  We survived due to a few
relatively small contracts, and some donations from people and
institutions.  We are deeply grateful to those who helped support our
goals during that period.  The plan was always to establish a basic
open source platform and then to use it to secure portions of grants
that would build upon those capabilities.  Numerous grants were
submitted in which FIG was included.</p>

<p>Finally, in August FIG signed a subcontract finalizing its
participartion in a 5 year project led by PI Rick Stevens and
Co-directed by Ross to build a National Microbial Pathogen Database
(<a href="http://www.niaid.nih.gov/dmid/genomes/brc/awards.htm">http://www.niaid.nih.gov/dmid/genomes/brc/awards.htm</a>).  This is an
$18M grant made to the University of Chicago. We intend to participate
in a number of grants; in some we play a very minor role helping to
develop independent teams based on SEED technology, and in others we
will play a more active role.  The SEED technology will prove useful
in a number of the emerging application areas ranging from pathogen
databases to analysis of environmental samples, and we encourage
groups to consider basing long-range developments on it.</p>
]]>
</content>
</entry>

<entry>
<title>July 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/07/july-2004-newsl.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-07-19T17:31:52Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.111</id>
<created>2004-07-19T17:31:52Z</created>
<summary type="text/plain"> The SEED Developers Meeting in Bielefeld Subsystem Tutorials The Subsystems Forum Gene Calling Computational Servers...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>The SEED Developers Meeting in Bielefeld</li>
<li>Subsystem Tutorials</li>
<li>The Subsystems Forum</li>
<li>Gene Calling</li>
<li>Computational Servers</li>
</ul>]]>
<![CDATA[<h4>The SEED Developers Meeting in Bielefeld</h4>

<p>We have just returned from the SEED Developes meeting held in
Bielefeld, Germany on July 5-9.  It was a remarkably good experience.
There were three distinct events going on:</p><ol>

      <li> On July 5-6 a tutorial was held on how to annotate subsystems
         using the SEED.  Participants from Germany, France, Denmark, the
         Netherlands, Poland, and Russia were there.  It was a
         productive exchange, and it seems very likely that a number
         of expert biologists will complete their versions of
         subsystems and publish them to the clearinghouse.</li>

     <li> On July 5-9, Ross taught a course in use of the SEED to a
         number of students and post-docs from Bielefeld.  It has
         become apparent that the right way to teach a course about
         use of the SEED is to focus on development of subsystems.
         Before we had the subsystems-encoding component of the SEED
         operational, we taught several courses in use of the SEED for
         comparative analysis, and we sincerely believe they were
         quite successful.  However, it is now clear that developing
         a 4-lecture course with assignments focusing on the topic of
         encoding subsystems is needed badly; this is clearly the
         emphasis that should exist in a "short module" that could be
         included in standard courses in molecular biology,
         biochemistry, and microbiology.</li>

      <li> Finally, on July 7-9 a major effort was initiated to link
         the SEED with GenDB.  GenDB is a serious annotation system
         developed by the team at Bielefeld.  A substantial amount of
         effort and resources have gone into a system that offers many
         of the features needed to carefully annotate prokaryotic
         genomes.  Many of these features are missing in the
         SEED.  On the other hand, the SEED offers many services in
         comparative analysis which are missing in GenDB.  The systems
         complement one another in several ways.  Hence, we decided to
         consider the question "Can they easily be coupled?".  The
         team at Bielefeld had already spent time developing a
         detailed protocol for linking major components of their
         system (allowing almost independent development, with each
         component having access to the data maintained by other
         components).  In short, they had thought about the general
         issues that inevitably come up in such exercises.  Within the
         3-day session,</li><ol>

                <li> GenDB was installed at Argonne (the SEED was
                   already running on the Bielefeld server under
                   Solaris). </li>

                <li> The access protocols were rapidly installed by the
                   Bielefeld team, which allowed GenDB to access data
                   stored in the SEED.  This allowed a user running
                   GenDB to "import" one of the many genomes stored in
                   the SEED into GenDB.  It is hard to convey the
                   level of enthusiasm that led to this rapid
                   integration.  Among other things, this will allow
                   users to perform detailed editing of gene locations
                   (adding missed genes, adjusting starts, deleting
                   genes that are apparently not real, etc.)  This
                   type of detailed editing will be necessary for work
                   with many types of features (e.g., transposons and
                   regulatory sites), and the SEED simply lacks the
                   capabilities to do it properly at this point.</li>

                <li> Detailed plans for modifying the SEED to accept
                   updates relating to features, to smooth out the
                   details of gene calling, and to get computational
                   servers up and running (at both Bielefeld and
                   Argonne) were all discussed.  It seems very likely
                   that we will be able to distribute DVDs containing
                   both systems, which we believe will be a major step
                   forward for support of annotation projects, within
                   just a few months.</li>

                <li> Meetings over the internet using the Access Grid
                   were used on a daily basis to support coordinating
                   the Argonne, FIG and Bielefeld teams.  These will
                   continue on a weekly basis.</li></ol></ol>

<p>A great deal was accomplished in a very short time.  The only known
failure related to Ross' promising to locate a "missing gene" in
Corynebacterium glutamicum.  He concedes that the group could not find
one in time, but still insists that one will result from the exercises
that people began.  His optimism may exceed his judgement.</p>

<p>It was overall a truly magnificent meeting.</p>

<h4>Subsystem Tutorials</h4>

<p>The 2-day tutorial held in Bielefeld was completely different than the
1-day one held in Chicago.  Although both events were fun and
productive, the addition of the extra day allowed people to get
much deeper into the topic.  The plan called for people starting to
work on their subsystems of interest by noon of the first day,
which allowed people to confront central issues more quickly.  By the
end of the second day, everyone was fairly far along, and a few had
relatively large spreadsheets.  While it is certainly true that it
takes months to accurately analyze most of these systems (if not
years), it is also true that the existing functional assignments can
be dramatically improved with just a day or two of effort by someone
interested in participating in this project.</p>

<p>We now believe that annotation of subsystems needs to be the major
focus of our efforts over the coming 4-6 months.  We must extend the
software, offer as many tutorials as possible, and actively seek
experts who are willing to participate.  The next subsystems tutorial
will be held at Los Alamos on August 26-27.  Two more are in the very
loose planning stage, but we expect to be able to clarify those plans
by the time we send the next newsletter.</p>

<h4>The Subsystems Forum</h4>

<p>Over beers, a number of us discussed the issue of recording the events
that will be taking place over the next 2-3 years.  If the genes
implementing the central functional roles of cellular subsystems are
actually worked out, it would be desirable to keep an electronic
record of the key events.  In addition, it would be nice to have a
means of rapidly exchanging information on the status of conjectures,
wet lab confirmations, and so forth.  Hence, we funded the development
of a "Subsystems Forum", which was started just before the Bielefeld
meeting.  It will be hosted at the University of Chicago.  We are
urging those people developing annotated subsystems to post both
questions and "open problems".  The URL for the forum is </p>

    <a href="http://TheSEED.uchicago.edu/SubsystemsForum">http://TheSEED.uchicago.edu/SubsystemsForum</a>

<p>or  <a href="http://subsys.info">http://subsys.info</a></p>


<h4>Gene Calling</h4>

<p>The subject of gene calling was discussed  at the developers meeting.
The Bielefeld team has put up a server which already calls protein encoding
genes and tRNAs. It  can be accessed via</p>

        <a href="https://www.cebitec.uni-bielefeld.de/software/gendb/cgi-bin/seed_upload.cgi">https://www.cebitec.uni-bielefeld.de/software/gendb/cgi-bin/seed_upload.cgi</a>
<p>The general plan is as follows:</p><ol>

    <li> We will develop the tools to construct a SEED-formatted
       organism from the output of the Bielefeld gene calling server,
       including the rRNA caller developed by Niels Larsen.</li>

    <li> GenDB offers a framework for examining and evaluating the
       relative merits of different gene calling algorithms.  We will
       work on making it possible to easily compare the output from
       different systems.</li>

    <li> Ralph Butler and Ross will be working on a system to recall
       starts for genes  from a single subsystem.</li>

    <li> It is important that, if we succeed in improving calls
       substantially that we make the results available to NCBI so
       that the improvements are not lost.</li></ol>

<p>While the existing gene calls are excellent for some organisms, they
are truly awful for others. Hopefully we can make rapid progress in
improving the situation over the coming months.</p>

<h4>Computational Servers</h4>

<p>ANL wrote a grant in which it was proposed to (among other things)
supply computational servers to support adding new organisms to the
SEED.  It is not clear exactly what servers are needed to support
things like gene calling, adding organisms to GenDB and adding
organisms to the SEED.  We will work the details out over the coming
few months, coordinate with efforts that already exist to provide such
services, and hopefully have things running somoothly by the fall.</p>

]]>
</content>
</entry>

<entry>
<title>June 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/06/june-2004-newsl.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-06-25T18:58:27Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.123</id>
<created>2004-06-25T18:58:27Z</created>
<summary type="text/plain"> Subsystem Development and the Project to Annotate 1000 Genomes SEED Developers Meeting Funding of SEED Development...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>Subsystem Development and the Project to Annotate 1000 Genomes</li>
<li>SEED Developers Meeting</li>
<li>Funding of SEED Development</li>
</ul>]]>
<![CDATA[<h4>Subsystem Development and the Project to Annotate 1000 Genomes</h4>

<p>The major SEED development effort for the last two months has centered
on annotation of subsystems.  The basic goal is to develop and support
the tools needed by an expert to annotate/analyze a specific metabolic
or nonmetabolic subsystem over a large collection of genomes.  We
believe that development of these tools and the framework needed to
effectively support comparative analysis is essential to straightening
out the existing annotations.</p>

<p>The initial set of tools was released in March, and a small group of
people began using them.  The main objective was simply to make them
reasonably reliable, to identify exactly what functionality was most
needed, and to establish a framework for saving and exchanging
results.  By May, we had demonstrated the basic utility of the tools.
In at least two cases, the subsystem annotations that were developed
during the shakedown phase resulted in serious conjectures relating to
"missing genes".  One of these conjectures has evidently been proven
wrong, but it did result in a rapid identification of the correct gene
(this is a tail best told over beers).</p>

<p>On June 18, we had a 1-day tutorial on use of the SEED to annotate
subsystems, and it was a remarkably pleasant event.  We had close to
30 people attend.  The main complaint was "so little time, so much to
do... we need two days".  Hence, in the future subsystem/SEED
tutorials will be 2-day affairs.  The next one will occur on July 5-6
at the University of Bielefeld in Germany, and it will be held as part
of the SEED Developers Meeting (which will be July 5-9).  We are
planning on having one at Los Alamos in August, if schedules permit,
and one in Boston in the fall.  Planning on these last two needs to
progress, and we will try to keep everyone posted.</p>

<p>There is a great deal that could be said about the significance of the
subsystem annotation effort.  While many SEED users and developers are
more interested in annotating single genomes or using the SEED for
other purposes, we at FIG view the annotation of subsystems to be the
essential core of The Project to Annotate 1000 Genomes, and as such it
is one of the most critical and exciting components of the SEED
collabortion.  We believe that this will become obvious and accepted
during the next 4-6 months.  Time will tell.</p>

<h4>SEED Developers Meeting</h4>

<p>The next major SEED event will be the meeting at Bielefeld on July
5-9.  There are a number of goals for that gathering:</p><ol>

      <li> We will hold the subsystems tutorial on the first 2 days, and
         it appears that a number of people will attand just that
         component of the gathering.</li>

      <li> Ross is supposed to teach a class for graduate students
         during the week.  There will clearly be a major overlap in
         content between the subsystems tutorial and the class (which
         will, hopefully, include sections on searching for missing
         genes and where the technology is going).</li>

      <li> On Wed-Friday, we plan on focusing on issues relating to
         supporting the development and maintenance of a set of
         communicating systems.  Initially, we will focus on GenDB
         (the system developed at supported by the team at Bielefeld)
         and the SEED (the system FIG/ANL/Burnham have been
         developing).  Both systems are open source, both groups are
         deeply interested in creating a joint framework that supports
         effective use of a loose integration of the systems, and both
         groups are forging ahead as quickly as possible.</li></ol>

<p>The Bielefeld team has developed and is now shaking down a web service
that will call genes in prokaryotic genomes.  We have found this
extremely useful in our efforts to produce a new release of the SEED.
Ralph Butler is working on a tool to help predict more accurate start
locations for CDSs, and we are slowly improving our installed
genomes.  Of course, we are taking RefSeq from NCBI, and they are
making substantial and constant progress in cleaning up their
collection.  The issue of how everfything fits together, how users can
install their own data on local copies of the SEED, and how exchange
of data occurs will all be major topics at Bielefeld.</p>


<h4>Funding of SEED Development</h4>

<p>Until very recently support for development of the SEED came from a
few consulting contracts obtained by FIG and by a few subcontracts
that were (and are) deeply appreciated.  This got FIG through last
year, and it got the SEED to the point where it has demonstrable
utility.</p>

<p>The basic model for funding SEED development goes as follows:</p><ol>

    <li> We hope to have many institutions participating.  Each is
       responsible for getting its own funding.</li>

    <li> Any institution that wishes can base projects (and proposals)
       on the SEED.  They may, or may not, include FIG as a
       participant or subcontractor.  No one should feel any pressure
       to include FIG, since the SEED technology is completely
       available free of charge to anyone.</li>

    <li> FIG has played a leadership role until now due to the fact that
       Ross Overbeek did the majority of design and implementation
       during the first year.  Argonne National Lab has begun to play
       a major role, the University of Chicago is building a major
       project upon the SEED, and it seemd likely that actual
       leadership of the effort will occur on a largely informal basis
       and be shared between a growing number of senior participants
       (or junior, for that matter, if they demonstrate they can do
       it).</li>

<p>FIG and the Computational Institute at University of Chicago have been
awarded a sizable five-year contract to construct a National Pathogen
Database.  This immediately solidifies the future of the SEED effort,
since that project will certainly be based upon SEED technology.  It
will also ensure that rapid development of new capabilities (most
notably inclusion of expression, structural, and variation data) will
occur.  We encourage other organizations to build upon the SEED (thus,
amortizing development costs over more projects).</p>

<p>That is all for now.  We plan to write a more extensive newsletter
after the Bielefeld meeting.  There are a number of exciting things
that are planned for that gathering, and we anticipate that there will
be much to report.</p>]]>
</content>
</entry>

<entry>
<title>April 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/04/april-2004-news.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-04-29T15:38:17Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.105</id>
<created>2004-04-29T15:38:17Z</created>
<summary type="text/plain"> The next meeting of SEED Developers Subsystem Annotations...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>The next meeting of SEED Developers</li>
<li>Subsystem Annotations</li>
</ul>]]>
<![CDATA[<h4>The Next Meeting of SEED Developers</h4>

<p>The next meeting of the SEED Developers will be in Bielefeld, Germany
on July 5-10.  The meeting will be split into two parts:</p>
<h5>July 5-6 will be devoted to Annotation of Subsystems</h5>

	<p>The main activity during these two days will be an intensive
	tutorial directed towards researchers that wish to annotate
	specific subsystems using the SEED.  We will help people begin
	to analyze their subsystem of interest as it manifests itself
	in the 250+ genomes in the current SEED.  This tutorial, as
	well as the whole topic of subsystem annotation, will be
	discussed in below.</p>

	<p>A second component of these two days will involve an
	open discussion between researchers currently using the SEED
	to annotate subsystems.  The topics will include
<ol>
	   <li> Which subsystems are now being analyzed?</li>

	   <li> How should we prioritize the work?</li>

	   <li> How should we be presenting the research conjectures that are
	      already apparent from the existing efforts?</li></ol></p>

   <h5>July 7-10 will be largely devoted towards linking systems</h5>

        <p>There are at least three systems that we will be working with.
	 GenDB (the annotation system being developed at Bielefeld), the
	SEED, and Niels Larsen's "tree viewer".  An over-simplified 
	analysis would be that </p>

	       <blockquote>GenDB is a "DNA-oriented" system, allowing one to
	       identify and manipulate features on chromosomes (or
	       contigs), the SEED is largely "protein-oriented", and
	       Niels has focused on tools for presenting
	       data.</blockquote>

        <p>The merge of the capabilities of GenDB and the SEED will be
	achieved (we hope) by architecting a fairly general interface
	for both systems (allowing GenDB to interface with other
	"protein-oriented" systems and the SEED to interface with
	other "DNA-oriented" systems).  Hopefully, a user will be able
	to flip back and forth between environments, and work done in
	each environment will be communicated back to the other.</p>

	<p>Niels' interface tools will be needed to display and browse
	large trees.  This is particularly useful when overlaying gene
	sets onto functional overviews (e.g., during interpretation of
	microarray data, examination of complementary metabolisms
	between host and symbiont, and so forth).  Niels is also
	building open source tools, so we believe that this represents
	a good overlap in interests.  He will be visiting from Denmark.</p>

<p>The success of the efforts to merge the systems, get everyone familiar
with the new Bielefeld server for calling genes in prokaryotes, and
discuss servers for computing similarities will depend to some extent
on how much time we devote in June to get ready.  We are planning on
having weekly discussions via conference calls (access grid
communication is definitely needed and hopefully will become the basis
for these meetings before to long).</p>

<p>It is likely that in parallel to these activities Ross, with help from
his friends, will offer a tutorial in use of the SEED for students at
Bielefeld.  We previously had a very good experience with a 2-day
intensive class at Franklin and Marshall College.  This time, we plan
on making it less intensive (1 hour per day) over a somewhat longer
period.  The basic idea would be to introduce the students to the
SEED, suggest ways to find research topics, and so forth.</p>

<p>We really do not have an accurate idea yet who will be coming.  It
seems likely that there will be a number of people coming for only the
first few days.  In any event, please contact <a href="mailto:Alice.McHardy@CeBiTec.Uni-Bielefeld.DE">Alice McHardy</a>, <a href="mailto:fm@Genetik.Uni-Bielefeld.DE">Folker Meyer</a>, or <a href="mailto:Ross@TheFIG.info">Ross Overbeek</a> if
you plan on attending.  Everyone will be welcome, but we do have to
get a sense of who will be there in order to handle local
arrangements.  Alice can help you by suggesting a hotel. </p>

<h4>Subsystem Annotations</h4>


<p>Annotation of subsystems across many genomes is rapidly becoming a
central component of the FIG collaboration.  Researchers from at least
six distinct institutions are now actively working on the project.
This is pretty exciting, since the initial tools were released only
last month.</p>

<p>We believe that the current tools are working well in the sense that

   <ol><li> backups now occur automatically, </li>
      
   <li> the peer-to-peer exchange of subsystems between completely
      different versions of the SEED (e.g., with different versions of
      RefSeq) seems to work well, and</li>

   <li> weekly updates introduce features that are rapidly reducing the
      effort required to construct a reasonable spreadsheet for a
      subsystem.</li></ol></p>

<p>We believe that major new tools that will significantly increase the
productivity of participants should be available within two weeks.
After those become available, tested, and incrementally improved, we
will make the entire set of tools available to researchers on a public
server (probably at the Uinversity of Chicago, initially).</p>

<p>The key idea of subsystem annotation is simple: a person wishing to
study a subsystem using comparative analysis analyzes exactly which
organisms it occurs in, which alternative variants can be
distinguished, which functional roles are present in each of the
organisms, and which genes implement those roles.  This is roughly
the raw data behind what you see in many biological review articles.
By organizing the data in a form that can be supported within the
SEED, it becomes much easier to analyze new genomes as they become
available, to clarify outstanding uncertainties, and to curate the
system over time.  We believe that something like what we are now
implementing will be the foundation for many, many researchers
analysis over the coming years.  As thousands of diverse genomes
become available, this becomes the framework for a person to maintain
an accurate portrait of the system or systems with which he works.</p>

<p>We will hold a tutorial on how to use the system in early July in
Bielefeld, as we mentioned above.  We will probably hold one at
Argonne National Lab in Chicago before then.  The tutorial will
include a general overview of how to use the SEED, so we suspect that
it will be of general interest.  If you would be interested in
attending a 2-3 day tutorial in use of the SEED and annotation of
subsystems, please contact <a href="mailto:Veronika@TheFIG.info">Veronika</a> or <a href="mailto:Ross@TheFIG.info">Ross</a>
.</p>
]]>
</content>
</entry>

<entry>
<title>March 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/03/march-2004-news-1.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-03-03T17:46:31Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.108</id>
<created>2004-03-03T17:46:31Z</created>
<summary type="text/plain"> The End of the Development Stage The Initial Deployment Stage The Project to Annotate 1000 Genomes...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>The End of the Development Stage</li>
<li>The Initial Deployment Stage</li>
<li>The Project to Annotate 1000 Genomes</li>
</ul>]]>
<![CDATA[<p>Well, the SEED Developers' meeting in San Diego has completed.  With
the start of these meetings, the SEED Project is now moving from an
development stage into what might be called an "initial deployment
stage".   The development stage succeeded in producing a system that
has some remarkable properties:</p>
<ol><li>It does support comparartive analysis of a rich set of
       genomes.  The nonredundant protein database used to develop
       similarities now has 1.92 million entries, and the entire
       database contains over 270 more or less complete genomes.
</li><li>
    It is being used in at least two sequencing/annotation efforts
       as a framework to analyze the data.
</li><li>
    It has been used successfully in several short courses as a
       framework for students to explore genomic data.  We could that
       within two days, the central functionality could be conveyed,
       and students were able to focus on meaningful analysis issues.
       In one case, several potential research topics were exposed.
</li><li>
    It is good enough to justify initiating what we consider a key
       FIG effort: the Project to Annotate 1000 Genomes.
</li><li>
    The system can be easily installed.  The process is not yet
       perfect, problems arise in almost every case, but we can
       usually have a system up with only a few minutes of human
       effort (the time to transmit copies over the network and to
       build the database is substantial, but normally requires no
       intervention).  It runs on both Macs and Linux systems using
       either Postgres of MySQL.
</li><li>
    It supports a rudimentary p2p update/data exchange capability.
       This part still needs work, but not too much.  It will, like
       the configuration/installation process, become solid fairly
       rapidly as deployment begins.
</li></ol><p>
We view the next stage as potentially quite exciting, and certainly
pretty demanding.  The most important development has been our
increase in manpower: the SEED effort now includes four part-time
researchers and we have increasing help from collaborators (perhaps,
"participants in the project" is a better term).  We are receiving
help in enhancing the system from a number of individuals, and in some
cases efforts have already begun to make it possible for independently
developed tools to be easily integrated within the SEED.  So,
development and manpower seem likely to explode.  Controlling the
results to produce a reliable, distributable system that is useful to
everyone should not be too difficult.
</p>
<h4>The Initial Deployment Stage</h4>


<p>So, what is going to happen during "initial deployment", and how will
we know when that stage completes?  Here is how we see it now:
</p><ol><li>
   We will put up a public server in March.  We plan on
      cross-linking with all of the larger integrations (and as many
      of the useful smaller efforts as we can).  The original WIT
      effort at Argonne National Lab had to be brought down due to
      issues that arose in a hacker attack, and we plan on using this
      initial public SEED as a replacement to which other projects
      will link.  This requires defining a protocol for exposing
      links.  This should be completed in March.
</li><li>
   We will finally offer versions to any sequencing/annotation
      efforts that wish to use it (with the understanding that there
      will be some initial learning and help required to get things
      going smoothly).  The ANL/FIG team will do its best to help this
      deployment, but we want everyone to realize that this effort is
      largely a volunteer one.  We welcome the experience, since
      exposing shortcomings and correcting them is exactly what is
      needed at this stage, but we also expect users to view
      themselves as participating in the development.
</li><li>
   We will begin the Project to Annotate a 1000 Genomes.
</li><li>
   A number of individuals cooperating within the SEED effort will
      together construct a web service to provide gene calls for
      prokaryotic sequence.  We believe that these will be relatively
      accurate.  The service is planned to reside at the University of
      Bielefeld in Germany.  A second server to provide computation of
      similarities for genomes to be added to the SEED is planned at
      Argonne National Lab.  Along with the SEED itself, these servers
      will make it possible for university sequencing/annotation
      projects to have access to state-of-the-art annotation tools.
</li><li>
   We will call an end to "initial deployment" when we have <ul>
</li><li>
      successully installed five systems in a row by having users
         just following instructions (i.e., without our help),
</li><li>
      have sequencing/annotation efforts that can routinely add new
         genomes and update versions of old ones, 
</li><li>
      have over 20 users that synchronize weekly using p2p
         operations, and
</li><li>
      have a set of at least 10 annotated systems that are
         routinely distributed and updated via p2p operations.
</li></ul></ol>

<h4>The Project to Annotate 1000 Genomes</h4>

<p>
There are differences of opinion concerning what makes the SEED
Project important.  Ross' view is that the SEED should be thought of
as a workbench for producing "subsystem annotations", and that these
annotations will eventually be understood to be the most important
development growing out of the SEED Project.  The roles of 
</p><ol><li>
	supporting initial annotations and 
</li><li>
	helping individual researchers explore genomic data 
</li></ol><p>
are important, but not nearly as important as the role of supporting
subsystem annotations.  At this point, we estimate that roughly 50% of
genes in the public archives have solid function assignments, about
20% are completely uncharacterized, and about 30% have either very
broad class characterizations or are over specified.  Many of the
genes within this last 30% have been given accurate characterizations
in review articles, but the contents from these review articles often
fail to reach the public sequence archives.  We propose to use the
SEED as a framework for supporting development of "subsystem
annotations", which can be viewed as the organized data to be included
in a review article.  From this perspective, reviews form the
essential cutting edge for annotations, and that the SEED should
become a vehicle for reducing the effort to produce a review.  By
providing this service, along with the capability to easily exchange
and export subsystem annotations, the SEED will facilitate the flow of
assignments from reviews to sequence annotations.  A researcher basing
his career on analysis of a specific subsystem will have a framework
for producing a sequence of reviews, using tools that support
maintenance of the subsystem annotation (largely automating the
addition of new genomes as they become available).
</p><p>
We believe that there are many, many individuals with extensive
experience in a given subsystem that would be willing to produce and
maintain  one of these "subsystem annotations".  Each such curated
subsystem would amount to the raw data standing behind a detailed
review article.  The key issues that must be addressed are roughly
these:
</p><ul><li>
	Once a detailed subsystem has been carefully constructed,
	   it must not be lost.  This is the most common worry.  An
	   expert using a system like the SEED is always concerned
	   that some shift of IDs, new release or whatever will result
	   in loss of the expert's work.  By making it straightforward
	   to export an annotated subsystem and exchange them via p2p
	   operations, we believe that we have addressed this issue.
</li><li>
        The system must significantly reduce the effort required to
           extract and relate relevant data.  It is not unusual for an
           expert to spend years in developing a detailed picture of a
           subsystem; we believe that this can be dramatically reduced
           by the development of appropriate tools, but it is
           essential that the individuals developing tools participate
           closely in the annotation process; otherwise, it is likely
           that powerful tools will be developed that do not address
           the rate-limiting operations.
</li></ul>

<p>A minimal notion of "curated subsystem" would include</p>
<ol><li>
	a list of the functional roles included in the subsystem
           (for metabolic subsystems, this amounts to a list of
           catalytic domains) and
</li><li>
        a spreadsheet with genomes along one axis, and functional
           roles along the other.  Each cell would contain a list of
           the genes in the given organism that implement a specfific
           functional role.
</li></ol>
<p>The rows of the spreadsheet each represent the genes implementing the
set of funjctional roles in a given organism, while each column
presents a reliable set of genes implementing a specific functional
role.
</p><p>
An extended notion would include a number of other items as well:
</p><ol start="3"><li>
        a diagram representing the relationships between the
           functional roles (in the case of a pathway, a depiction of
           the reactions that make up the pathway),
</li><li>
	a discussion of the "variants" represented by the entries
           in the spreadsheet, and
</li><li>
	a detailed commentary each set of genes implementing a
           functional role (describing what can be inferred about the
           evolutionary origins of the set of genes).
</li></ol>
<p>Recently a major effort has been launched to include within the SEED
the capability of curating these subsystem annotations and exchanging
them via p2p operations.</p>

<p>In March, Ross, Andrei, Veronika and Gary Olsen will begin curation of
specific subsystems, synchronizing all assignments and annotations on
a weekly basis.  Once this initial effort is running smoothly, we
intend to rapidly expand the set of individuals participating.
</p><p>
So, in the next newsletter, expect a detailed discussion of the
gene-calling and similarity servers, along with a discussion of the
outcome of the initial efforts to begin annotation of subsystems.
</p><p>
Finally, our sincere thanks to Dusko Ehrlich and Barny Whitman for
sending statements supporting the utility of the SEED.  We owe you.</p>]]>
</content>
</entry>

<entry>
<title>February 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/02/february-2004-n.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-02-13T18:08:27Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.110</id>
<created>2004-02-13T18:08:27Z</created>
<summary type="text/plain"> Meeting in San Diego A Separate Matter...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>Meeting in San Diego </li>
<li>A Separate Matter</li>
</ul>]]>
<![CDATA[<p>We are writing this short newsletter to make some comments on the
upcoming SEED Developers Meeting in San Diego (Feb 22-24) and to cover
a small number of miscellaneous items.</p>

</p>For those coming to the meeting, it is critical that you establish
contact with Andrei Osterman or Ross Overbeek.  Their email addresses
are as follows:</p>

<table><tr><td>Ross Overbeek</td><td>&nbsp;</td><td>         Andrei Osterman</td></tr>
   <tr><td> Ross@TheFIG.info  </td><td>&nbsp;</td>            <td>Andrei@TheFIG.info</td></tr>
    </table>

<p>The meeting will start at the North Campus of Burnham Institute at
10am on Sunday, Feb 24.  The address is<p>

     <table><tr><td>The Burnham Institute</td></tr>
     <tr><td>10901 North Torrey Pines Road </td></tr>
     <tr><td>La Jolla, California, 92037</td></tr>
     <tr><td>858-646-3100</td></tr></table>

<p>However, we strongly recommend that you link up a bit before that.
There is nothing worse than being somewhat disoriented in a strange
town.   The recommended list of hotels is as follows:</p>

<table><tr><td>Best Western Stratford Inn</td></tr>
<tr><td>710 Camino Del Mar</td></tr>
<tr><td>Del Mar, California  92014</td></tr>
<tr><td>Tel.:        858-755-1501, x.121 (Dana Hill, Director of <tr><td>Sales</td></tr>
<tr><td>Fax:        858-755-4704</td></tr>
<tr><td>Toll:        1-800-446-7229</td></tr>
<tr><td>Web:        www.pacificahost.com</td></tr>
<tr><td>E-mail:    dana@pacificahost.com</td></tr>
<tr><td>Rate:        $79 up</td></tr>
<tr><td>&nbsp;</td></tr>
<tr><td>Del Mar Inn</td></tr>
<tr><td>720 Camino Del Mar</td></tr>
<tr><td>Del Mar, California 92037</td></tr>
<tr><td>Tel.:      858-755-9765</td></tr>
<tr><td>Fax:      858-792-8196</td></tr>
<tr><td>Toll:      1-800-451-4515</td></tr>
<tr><td>Web:     www.delmarinn.com</td></tr>
<tr><td>Rate:     $99 up</td></tr>
<tr><td>&nbsp;</td></tr>
<tr><td>Hampton Inn</td></tr>
<tr><td>11920 El Camino Real</td></tr>
<tr><td>San Diego, California  92130</td></tr>
<tr><td>Tel.:      858-792-5557 (Mike Macklosky, Sales Manager)</td></tr>
<tr><td>Fax:      858-792-7263</td></tr>
<tr><td>Web:     www.hamptoninndelmar.com</td></tr>
<tr><td>Rate:     $94</td></tr></table>

<p>There is a free shuttle service to the Burnham Institute from these
hotels, which is what makes them particularly attractive (the prices
are pretty good, as well).  Ross is planning on staying at the Best
Western Stratford Inn and suggests that anyone who wishes meet him for
breakfast on Sunday morning at 8am.</p>

<p>This will be a fairly unstructured meeting.  The basic rules will be
as follows:</p><ol>

   <li> Participants should come with specific objectives (things they
      want done or data they want to walk away with).  It is their
      responsibility to formulate a strategy to meet those
      objectives.   For example, Ross has roughly these goals:<ol>

      <li> He will try to set up a collaboration to call genes
         (including starts) accurately for prokaryotes.  The initial
         emphasis will probably be for organisms in which some
         objective (e.g., proteomic) data exists. </li>

      <li> The SEED is weak on DNA-related issues.  The team at
         Bielefeld University has built an open source annotation
         system that is quite strong in these areas.  A major topic
         will be the issue of what it would take to merge systems,
         exchange tools, or whatever.  No serious discussions have yet
         taken place, but we hope to have some at the meeting.</li>

      <li> Ross will arrive with the pre-release version of the SEED
         that will be modified for an official release about mid
         March.  He will offer copies to those who wish them.</li>

      <li> He will demonstrate the new annotation tools/strategy and
         wishes to seriously discuss how to launch the major
         annotation effort in March.</li></ol>

      These goals may not match very closely with what other
      participants want.  That is ok.</li></ol></li>

   <li> We will occasionally have large discussions covering topics of
      general interest, but it is expected that smaller groups
      focusing on specific issues will emerge.</li>

  <li> If there is a key topic that you wish to speak on (for, say, up
      to 30 minutes), feel free to prepare a talk.  However, we will
      not be scheduling talks, and our view is that they should happen
      spontaneously and may be for limited groups.</li>

   <li> For those people with wireless cards (airports), if you plan on
      using them at the Burnham, you must send your ID to Andrei
      Osterman.  On a Mac, you can get this from the System
      preferences/Network settings display.  This is critical if you
      plan to use the airport on Sunday.  Andrei needs to get his
      system support people to help set this up the week before (i.e.,
      next week). </li>

<p>Ross has tried this unstructured approach in the past, and it has
worked well.  Some of the rest of us are a bit nervous, but we shall
see. </p>


<h4>A Separate Matter</h4>

<p>FIG was recently asked (by a granting agency) to justify efforts to
build integrations.  The precise request was as follows:</p>
   <blockquote>Please describe the usefulness of KEGG, WIT, ECOCYC, and ERGO to
    the microbiology community</blockquote>
<p>We have already written about what advantages such systems provide, but
we believe that the thrust was more "what have they actually done to
help the community?"  To answer this, we need microbiologists who have
actually benefited in some way from one or more of these systems.  If
any such person would like to write 2-3 sentences, we would be
grateful.  Please note that we are not asking for comparisons of the
systems, just general support for the view that integrations to
support comparative analysis will become increasingly significant.</p>]]>
</content>
</entry>

<entry>
<title>January 2004 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/01/january-2004-ne-1.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-01-13T19:52:25Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.122</id>
<created>2004-01-13T19:52:25Z</created>
<summary type="text/plain"> SEED Tech-Developers Meeting...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>SEED Tech-Developers Meeting</li>
</ul>]]>
<![CDATA[<p>This newsletter is coming out after only a short break, since we
wanted to notify people about the first planned SEED Tech-Developers
Meeting.  It will be held in San Diego at the Burnham Institute on
February 22-24.  We realize that this is terribly short notice, but
plans were not final until today.</p>

<p>The meeting will be for people actually writing code for the
SEED; we intend to hold these meetings to help finalize what goes into
the next release.  The topics now on the agenda are as follows:</p><ol>

    <li> We have made a beta release (0.0.0) available to a few people,
       and comments/criticisms are coming in.  These will be reviewed.</li>

    <li> It has become essential that we immediately address the issue
       of adding genomes to the SEED, since we now have 20-40 that
       need to be added as soon as possible.  This will include</li><ol>

	    <li> construction of a server to call genes in prokaryotes,</li>

	    <li> reformatting data from several key sources (i.e.,
	       building tools to simplify additions from these major
	       sources),</li>

	    <li> how much effort to put into initial automated
	       annotations (and what is wrong with the existing tool),
	       and</li>

	    <li> dividing up the work and responsibilities.</li></ol>

    <li> We need to finalize the additions to the SEED needed to support
       the annotation of subsystems.  Ross wants to launch the
       subsystem annotation effort in March, so it is critical that
       these tools be finalized and blessed at the meeting (they will
       be reviedwed and corrected in response to criticisms from the
       annotators, but we need an initial tool that is basically
       adequate to get things moving).</li>

    <li> We will begin discussions on the following two topics (the
       actual coding/integration will begin after the Feb meeting and,
       hopefully, will move to center stage for the April or June
       meetings):</li><ol>

	    <li> The SEED does not handle processing DNA well.  We need
	       facilities for displaying/aligning/editing regions of
	       DNA.  Several users have pointed towards capbilities
	       offered by Artemis as an example.  Bielefeld's system
	       has addressed these issues.  We need to talk about what
	       capabilities are needed and how to best introduce
	       them.</li>

	    <li> It is clear that the SEED will be used heavily in
	       interpretation of expression data (ESTs, microarrays, and
	       proteomic data).  The plans on how this topic will be
	       handled will be talked about over beers.</li></ol></ol>

<p>Other topics can be added, but these should consume most of the three
days.  We recommend that those who come spend all three full days, but
it is not essential.  Let us know what your plans/restrictions are,
and we will try to work things out.  Again, this is a working meeting
for groups/individuals who intend to actively participate in SEED
development.  We are going to schedule another series of meetings
targeted towards SEED users and annotators.  These should start by
March; we will keep you posted.</p>

<p>If you have questions or need help, please contact <a href="mailto:Ross@TheFIG.info">Ross</a>, <a href="mailto:Veronika@TheFIG.info">Veronika</a>, or <a href="mailto:Andrei@TheFIG.info">Andrei</a>.
]]>
</content>
</entry>

<entry>
<title>Manifesto</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2004/01/manifesto-1.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2004-01-01T20:37:30Z</issued>
<id>tag:www.figresearch.com,2004:/news//3.86</id>
<created>2004-01-01T20:37:30Z</created>
<summary type="text/plain">The Project to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions, and Construct the Corresponding Stoichiometric Matrices...</summary>
<author>
<name>Ross Overbeek</name>

<email>ross@thefig.info</email>
</author>
<dc:subject>General</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<b>The Project to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions,  and Construct the Corresponding Stoichiometric Matrices</b>
]]>
<![CDATA[<h4>Introduction</h4>
<p>In December, 2003 The Fellowship for Interpretation of Genomes (FIG) initiated  The Project to Annotate 1000 Genomes.  The explicit goal was to develop a technology for more accurate, high-volume annotation of genomes and to use this technology to provide superior annotations for the first 1000 sequenced genomes.  Members of FIG were convinced that the current approaches for high-throughput annotation, based on protein families and automated pipelines that processed genomes sequentially, would ultimately fail to produce annotations of the desired accuracy.  We believe that the key to development of high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes.  The existing annotation approaches, in which teams analyze a whole genome at a time, ensure that annotators have no special expertise relating to the vast majority of genes they annotate.  By having individuals annotate single subsystems over a large collection of genomes, we allow individuals with expertise in specific pathways (or, more generally, subsystems) to perform their task with relatively high accuracy. 
</p><p>
The early stages of the effort began at FIG, but quickly spread to a number of cooperating institutions, most notably Argonne National Lab.  During the first year of the project, we have developed detailed encodings of subsystems that include a majority of the genes from subsystems that make up the core cellular machinery.  More importantly, we have developed the initial versions of technology needed to support the project. 
</p><p>
The Project to Annotate 1000 Genomes has reached the stage where it is clear that it will very shortly produce what we call informal metabolic reconstructions that cover the majority of central metabolism as it is implemented in the close to 300 more-or-less complete genomes that are now available.  We think of an informal metabolic reconstruction as a partitioning of the cellular machinery into subsystems, the specification of the functional roles that make up each subsystem, and the inventory of which genes in a specific organism implement the functional roles.  What is needed to support both qualitative analysis and effective quantitative modeling is to convert these informal metabolic reconstructions into formal metabolic reconstructions. By a formal reconstruction, we mean an accurate encoding of the metabolic network.  The goal of such an encoding is to construct a list of metabolites and a detailed reaction network that is internally consistent (in the sense that metabolites that are produced by reactions are connected as substrates to other reactions or to specific transporters,  and that all metabolites that act as substrates are produced by other reactions or provided by transporters).  Perhaps, a better way to put this is that all apparent anomalies are highlighted as such, and the essential components of the metabolic network are accurately encoded.  The output of such an effort is normally what is termed a
stoichiometric matrix, the basic resource required to support stoichiometric modeling.  One of the central goals of this enlarged effort is to develop accurate stoichiometric matrices for each of the 1000 genomes; we refer to this component of the effort as The Project to Produce 1000 Stoichiometric Matrices.
</p><p>
It is our belief that the development of the technology required to mass-produce accurate genome annotations will ultimately allow fully automated annotation pipelines to achieve relatively high accuracy.  Similarly, the existence of 1000 accurate formal metabolic reconstructions would constitute a resource that would allow rapid and accurate development of stoichiometric matrices for newly-sequenced genomes.  That is, besides producing accurate annotations, informal metabolic reconstructions, formal metabolic reconstructions, and stoichiometric matrices for a large collection of diverse genomes, we believe that the expanded project will produce technology that will support nearly automatic, very rapid characterization of new genomes.
</p><p>
All of the encoded subsystems, metabolic reconstructions and stoichiometric matrices will be made freely available on open web sites.  In addition, the software environments used to develop the encoded subsystems and stoichiometric matrices will be developed and supported as open source software.  By making the fundamental data items, the encoded subsystems and stoichiometric matrices, freely available to the community, we expect to stimulate development of alternative software systems to support curation and maintenance of these items.
</p>
<h4>The Project to Annotate 1000 Genomes</h4>
<p>We have chosen to conceptually break the Project to Annotate 1000 Genomes into three stages.  We discuss these stages as if they occurred sequentially; in fact, all three stages are now in progress.  To understand the three stages, the reader must have at least a rudimentary grasp of what we mean by an encoded subsystem and an informal metabolic reconstruction.  When we speak of a subsystem, we think of a set of related functional roles.  In a specific organism, a set of genes implement these roles, and we think of those genes as constituting the subsystem in that organism.  That is, we are really dealing with an abstract notion of subsystem (in which the subsystem is a set of functional roles) and instances of the subsystem in a specific organism (in which a set of genes implements the abstract functional roles).  Precisely the same subsystem and functional roles exist in distinct organisms, although obviously the genes are unique to each organism.
</p><p>
Subsystems are thought of as possibly having multiple variants.  Organisms that have operational versions of a subsystem may well have genes that implement slightly different subsets of the functional roles that make up the subsystem.  Each subset of functional roles that exists in at least one organism with an operational version of the subsystem constitutes an operational variant.
</p><p>
We think of an informal metabolic reconstruction for an organism as a set of operational variants of subsystems that are believed to exist for the organism.  In this conceptualization, one does not have a meaningful functional hierarchy or DAG; rather, we simply have an inventory of functional roles that are implemented in the organism, along with the variants of subsystems that they implement.  We do believe that the task of imposing an actual hierarchy is relatively straightforward in comparison with the effort required to construct the set of operational variants.   In some contexts, we have included a functional overview in which the subsystems are embedded at the lowest levels.  It is clear that, given a diverse collection of informal metabolic reconstructions, the development of appropriate functional hierarchies can be generated with relatively few resources.
</p><p>
Our encoding of a subsystem can now be reduced to
<ol><li>
a specification of a set of functional roles (this amounts to the abstract subsystem) and
</li><li>sets of genes which implement the operational variants in a number of genomes.  These genes are given as a subsystem spreadsheet in which each row corresponds to a single genome, each column corresponds to a single functional role, and each cell contains the set of genes in that genome that are believed to implement the given functional role.</li>
</ol>
<p>The Project to Annotate 1000 Genomes amounts to an effort to produce detailed and comprehensive encodings of several hundred subsystems, which will impose assigned functions on genes in each of the genomes.  The total percent of genes that can be assigned functions this way is probably on the order of 50-70% in most genomes (in large eukaryotic genomes the total is obviously substantially lower).   The percent will grow as our understanding grows.  What should be noted is that the accuracy of these assignments will be substantially better than that of current assignments, and the conserved cellular machinery almost all falls within the projected subsystems. 
</p><p>
Once we have produced our initial set of annotations, we believe that automated pipelines and protein families are excellent tools for propagating them.  Protein families are, in fact, a key component of annotation and provide the fundamental mechanism for projection of function between genes. The added dimension provided by subsystems, along with the manual curation required to develop accurate initial encodings of subsystems, is an essential technology for increasing the accuracy and effectiveness of protein families.  Ultimately the encoded subsystems will be used to make incremental, essential corrections to collections of protein families (like those supported by UniProt and COGs), and a basis for much more accurate annotation will emerge.
</p><p>
We now proceed to describe the details of the three stages.
</p>
<h5>Stage 1: Development of Initial Encodings of Subsystems</h5>
<p>The initial stage of the project will involve development of approximately 100-150 subsystems that will cover most of the conserved cellular machinery in prokaryotes (and all of the central metabolic machinery in eukaryotes).  This work will be done largely by trained annotators who achieve a limited mastery of specific subsystems via review articles and detailed analysis of the collection of genomes.  These individuals can define the abstract subsystems and add most genomes to the emerging spreadsheets, but not without error.  They are necessarily far less skilled than experts who have invested tens of years in study of specific subsystems.
</p><p>
These initial subsystems will have many uses.  They can be used to enhance sets of curated protein families, to clarify identification of gene starts, and to develop a consistent set of annotations.  They will form the basis of informal metabolic reconstructions, and will be used to support the development of formal metabolic reconstructions.  However, given the relative lack of expertise of these initial annotators and the fact that they will seldom have access to the wet lab facilities needed to remove ambiguities in assignments, errors will inevitably remain.
</p>
<h5>Stage 2: The Use of True Experts and the Wet Lab to Refine the Encodings</h5>
</p><p>
The second stage will involve the gradual refinement and enhancement of the original subsystem encodings by domain experts.  Almost every subsystem spreadsheet makes it clear that numerous detailed questions remain to be answered.   These questions relate to correcting gene calls, correction of frameshifts, refining function assignments, and removing ambiguities (either via bioinformatics based analysis or through actual wet lab efforts).
</p><p>
The participation of domain experts will be critical, but it seems most likely that a relatively small set will choose to get involved until the utility of the approach becomes obvious.  We already have some domain experts (in translation, transcription, and  a limited number of metabolic subsystems) participating in the effort.  We believe that this number will grow rapidly over the next 2-3 years.
</p><p>
It should be emphasized that upon completion of step 2 we will have accurate annotations and a solid foundation for the construction of stoichiometric matrices.
</p>
<h5>Stage 3: Understanding the Evolutionary History of the Genes within the Subsystem</h5>
<p>
The third stage involves determination of the evolutionary history of the genes within the subsystem.  To understand what this involves and the utility of this type of analysis, we must simply recommend two papers by the team led by Roy Jensen:
</p><ol>
<li>Ancient origin of the tryptophan operon and the dynamics of evolutionary change by Xie, Keyhani, Bonner, Jensen, Microbiol Mol Biol Rev. 2003 Sep;67(3):303-42</li>
<li>Inter-genomic displacement via lateral transfer of bacterial trp operons in an overall context of vertical genealogy, by Xie, Song, Keyhani, Bonner, Jensen, BMC Biology, 2004, 2:15</li>
</ol>
<p>These papers elegantly display the exact style of analysis required to uncover and clarify the evolutionary history of the relevant genes.    Essentially, trees must be built containing all of the genes implementing each specific functional role (multiple trees may be needed for distinct forms).  Those trees that display a common topology indicate which columns in the spreadsheet can be used to infer the most probable vertical  history of the subsystem.  Once the overall history has been clarified, it becomes possible to attempt clarification of horizontal transfers, to reconstruct the history of clusters on the chromosome, and in some cases to tie the analysis to regulatory issues.
</p><p>
The effort required to do this style of analysis well is high.  While we expect the initial efforts to go slowly, we also expect experience and advances in tools to dramatically reduce the required effort.  In any event, it is clear that this stage will not be completed in the next few years, but will undoubtedly stimulate large amounts of related research.
</p>
<h5>Filling in the Missing Pieces</h5>
<p>The encoded subsystems produced by the Project to Annotate 1000 Genomes offer a detailed picture of exactly what components have been identified and are present in each genome.  Perhaps as significant, they vividly display exactly what is missing or ambiguous, allowing one to arrive at an accurate inventory of gaps in our understanding.
</p><p>
The issue of how best to address these gaps is an integral part of the project.  The technology that is emerging is what we refer to as the bioinformatics-driven wet lab.  This concept refers to the development of a wet lab that utilizes conventional biochemical and genetic techniques in a framework designed to maximize the overall number of confirmations.  It is driven by predictions arising from the analysis of subsystems, and it targets a prioritized list of conjectures.  That is, the explicit goal is to fill in as many gaps and remove as many ambiguities as possible for resources consumed.
</p><p>
Although it is inconceivable that one experimental group would be able to assess all of the functional predictions, we believe that integrating an experimental component into our annotation/modeling effort will directly support our main goal.  In addition to verification of key predictions and removal of central ambiguities, it will validate the overall approach and set an example for other groups worldwide.
</p>
<h4>The Project to Develop 1000 Stoichiometric Matrices</h4>
<p>
We believe that the informal metabolic reconstructions are of substantial value by themselves.    Indeed, numerous applications are quite obvious.  However, they are not enough to support quantitative modeling.    Whole genome modeling will require development of stoichiometric matrices, an effort that will pay many dividends.  The most immediate payout is as quality control on the informal metabolic reconstruction.   Just as the use of subsystems imposes a critical set of consistency checks on the assignment of function to genes, an attempt to develop an internally consistent reaction network imposes a strong consistency check on both the annotations and assertions of the presence of specific  subsystems.
</p><p>
Over the last 4-5 years, the success of stoichiometric modeling has set the stage for large-scale employment of the technology.   The key limiting factor is the development of the stoichiometric matrix itself.  This is a time-consuming task that frequently requires on the order of a year for a skilled practitioner.  Many actual modeling efforts have foundered on just the technical difficulties in producing this basic datum.  Bernhard Palsson has pioneered much of the key research that has led to the recent successes.  Spending large amounts of effort, his team has built a very few of these stoichiometric matrices, iteratively improving their accuracy.
  They have successfully used these matrices to support initial modeling efforts on the organisms, and the results have gained international recognition. 
</p><p>
PalssonÕs team originated the The Project to Produce 1000 Stoichiometric Matrices, and they will play the lead role in converting the informal metabolic reconstructions into formal reconstructions and produce the matrices.   The team at FIG and Argonne National Laboratory will participate in the effort, coordinating closely with PalssonÕs team.   At this point, the Palsson team and the teams at FIG, ANL, and The Burnham Institute are all working on issues relating to tools to automate the generation of matrices from informal metabolic reconstructions.
</p>
<h4>The Participants</h4>
<p>We expect participants in both projects from many institutions worldwide, probably with both academic and commercial interests.   Initially, it is likely that the effort will be led from FIG, ANL and PalssonÕs team at UCSD.    We are planning on Roy Jensen playing a role relating to quality control and development of tools to support Stage 3 analysis.  Andrei Osterman from the Burnham Institute will lead wet lab efforts to challenge in silico predictions.
</p><p>
If the effort is successful, we would hope to stimulate numerous research efforts worldwide, and we welcome broad participation.  Ultimately, leadership and participation will broaden rapidly, if the effort is successful.
</p>
<h4>A Proposed Schedule</h4>
<p>Let us begin by estimating the point at which 1000 genomes will become available.  One simple approach would go as follows:</p>
<ol>
<li>The number of genomes will double approximately every 18 months.</li>
<li>We now have about 300 more-or-less complete genomes.</li>
<li>Therefore, we should have approximately 1000 genomes in just a bit under 3 years (by sometime in 2007)</li>
</ol>
<p>There is a great deal in this analysis that is far from certain.  However, let us use this estimate as a working hypothesis.</p>
<h5>2005</h5>
<p>During 2005, Stage 1 will be completed for the vast majority of subsystems.  Stage 2 will be initiated for 30-50 subsystems.  Less than 10 will move deeply into stage 3.</p>
<p>We will actively attempt to produce 10-15 stoichiometric matrices.  We will focus on diverse organisms of interest to DOE and a set of gram-positive pathogens.</p>
<p>We will begin a detailed review for quality assurance by a small number of expert biochemists and microbiologists.</p>
<p>We expect wet lab confirmations to begin, but this is one area in which funding plays an essential role.  We expect funding to support targeted confirmation/rejection of the numerous conjectures arising from the bioinformatics to begin in 2005-2006.  It is possible to fairly accurately predict the potential flow of confirmations, but we cannot predict available funding. We believe that the bioinformatics-driven wet lab, in which conjectures are prioritized and grouped, would allow a relatively small group (of 3-4 postdocs and technician) to characterize up to 50 novel gene families encoding the most important functional roles in central metabolic subsystems of diverse organisms per year.</p>
<h5>2006</h5>
<p>During 2006,  the vast majority of subsystems will enter Stage 2.  We will attempt to move a large number into Stage 3 (this is truly difficult to predict; it depends hugely on success with the early attempts, our ability to reduce the required effort, and the research aims of the participants).
</p><p>
We would plan on completing at least 200 more stoichiometric matrices.
</p><p>
If the wet lab component of the effort is fully functional, we would expect a steady stream of confirmations, and (based on our past experience) we would project roughly that 75-90% of the tested conjectures will be validated.
</p>
<h5>2007</h5>
<p>During 2007 we would plan on pushing Stage 2 and 3 analysis as far as possible.  We believe that we will have the subsystems needed to cover the vast majority of well understood subsystems and many that are not well understood.  
</p><p>
We would plan on completing initial stoichiometric matrices for several hundred more genomes.  Since the majority of the genomes will not become available until this year, of necessity many of the stoichiometric matrices will not be reasonably complete before sometime in 2008 or 2009.
</p><p>
If the wet lab component of the effort is fully functional, we would expect the stream of successful conjectures to stimulate numerous labs to join the effort.  Ultimately, the role of the wet lab component that is tightly-coupled to the project is to demonstrate the huge improvement in efficiency that can be attained by coupling the wet lab effort to well-chosen, targeted conjectures generated from the subsystems. 
</p>
<h4>A Short Note on the Analysis of Environmental Samples</h4>
<p>It is becoming clear that analysis of environmental samples will become increasingly significant.   Consider a framework in which we have 1000 genomes and detailed informal metabolic reconstructions for all of them.  We believe that, given a substantial environmental sample,
</p><ol>
<li>it will be possible to produce accurate estimates of which organisms are present (where an "organism" in this context should probably be viewed as "some organism within a very constrained phylogenetic neighborhood"),</li>
<li>it will be possible to produce fairly precise estimates of the metabolism of the organisms believed to be present, and<li>
<li>it will be possible to compared the predicted metabolism with the actual enzymes detected in the environmental sample.</li>
</ol>
<p>The hope is clearly that we will be able to make accurate estimates, given 1000 well-annotated genomes.</p>
<h4>Summary</h4>
<p>The value of a collection of 1000 genomes depends directly on the quality of the annotations, the corresponding metabolic reconstructions, and the extent to which the foundations of modeling have been established.
</p><p>
The Project to Annotate 1000 Genomes is based directly on the notion of building a collection of carefully created and curated subsystems.  The fact that the individuals who encode these subsystems annotate the same subsystem over a broad collection of genomes allows them to gain an understanding of detailed variation and at least a minimal grasp of the review literature.  They will be annotating genes for which they develop some detailed familiarity.  We place this technology in direct opposition to the existing approaches in which individuals annotate complete genomes (assuring an almost complete lack of familiarity with the majority of genes being annotated), and automated pipelines are badly limited by the ambiguities and errors in existing annotations.
</p><p>
The Project to Produce 1000 Stoichiometric Matrices has the potential of laying the foundations for quantitative modeling.  Many, if not most, existing modeling efforts are dramatically hampered by the fact that very, very few stoichiometric matrices  now exist, and the cost of developing more using existing approaches is quite high.
</p><p>
The development of a wet lab component that challenges a carefully prioritized set of conjectures flowing from both the subsystems analysis and the initial modeling based on quantitative modeling is essential.  It will confirm the relative efficiency of this approach (which might reasonably be characterized as "picking the low-hanging fruit"), and in the process establish a paradigm that directly challenges the more common approach to establishing priorities.
</p><p>
We claim to understand the key technology needed to develop high-throughput development of annotations, metabolic reconstructions, and stoichiometric matrices.  By the summer of 2005, this should be completely obvious.
</p>]]>
</content>
</entry>

<entry>
<title>December 2003 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2003/12/december-2003-n.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2003-12-20T17:15:31Z</issued>
<id>tag:www.figresearch.com,2003:/news//3.106</id>
<created>2003-12-20T17:15:31Z</created>
<summary type="text/plain"> General Comments on the Schedule for Delivering the SEED An Increase in Programming Support Peer-to-Peer Updating FIG Development Meetings The &quot;Annotate a Thousand Genomes&quot; Efffort...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>General Comments on the Schedule for Delivering the SEED</li>
<li>An Increase in Programming Support</li>
<li>Peer-to-Peer Updating</li>
<li>FIG Development Meetings</li>
<li>The "Annotate a Thousand Genomes" Efffort</li>
</ul>]]>
<![CDATA[ 
<h4>General Comments on the Schedule for Delivering the SEED</h4>


<p>We promised timely reports on progress, so we thought that it would be
a good idea to send out one last newsletter this year.  Progress has
been very rapid.  The SEED was shown at the SC 2003 supercomputing
conference, and everything went amazingly smoothly.  Early versions of
the peer-to-peer updating capability were shown, and a genome was
added in real-time.  That is, we took a genome with just gene calls,
added it to the SEED, computed similarities, made automated
assignments, and displayed the results during the demo period.  Many
thanks to the brave souls that made it happen -- there were points
where we wondered whether it could all be made solid quickly enough.</p>

<p>The alpha-test version of the SEED has been installed and operational
at several sites for over a month now.  We are learning a lot about
its utility and what is needed to improve it.  Natalia Maltsev is
hosting a class on Jan 8-9 that will include two parts: it will
present the tools developed by her team to support high-throughput
annotation of genomes using grid-technology, and it will present the
initial release of the SEED.  It will be a small meeting to help make
sure that a few sequencing/annotation projects that intend to use the
Argonne/FIG annotation tools see what is involved.</p>

<p>We are now working on making a set of DVDs that contain both Mac and
Linux versions of the SEED with an install script.  We have
experienced some difficulties in getting installations to go
smoothly, and it is critical that we get the installation script
working reliably before a large-scale distribution occurs.  In
addition, we are finishing the implementation of protocols for sharing
annotations.  This must all be done and ready to hand out at Natalia's
workshop in January.</p>

<p>For those initial users of Mac OS X versions of the SEED, a number of
minor problems arose when people converted from Jaguar to Panther.  We
believe that we understand exactly how to help people make the
transition, but be aware that inital versions of the SEED may well
stop functioning if you make the switch.  Needless to say, we have
learned a number of lessons from helping people make the switch, and
the versions that will be released in January should run fine under
either Jaguar or Panther.</p>

<p>Later in January a class in looking for missing genes will be taught
at Franklin and Marshall College in Pennsylvania.</p>

<h4>An Increase in Programming Support</h4>

<p>Most of the SEED development was done by Ross Overbeek with help from
a number of friends.  This month Terry Disz and Bob Olson from Argonne
National Lab began putting in substantantial amounts of effort, and
next month FIG will hire one full-time and one part-time programmer.
Part of this expansion is due to FIG receiving its first grant (from
DOE).  This is a big moment, and we anticipate that it will be just
the first of several grants we receive this year.</p>

<p>This represents a huge increase in our ability to support and
accelerate development of the SEED.  Even so, it is quite likely that
a majority of the major advances will come from collaborators over the
coming year.  This means that you can expect progress to rapidly
accelerate (and, really, it has not been so slow up to now!).</p>

<h4>Peer-to-Peer Updating</h4>


<p>The concept of developing a system that supports peer-to-peer updating
is really pretty interesting.  The standard way of constructing and
deploying systems is based on a central source of both the initial
system and updates.  The peer-to-peer model is one in which any two
users can share and update either data or code.</p>

<p>The classical system lends itself to a hierarchical structure: the
main source supplies systems to, say, sequencing/annotation projects.
These projects install a single local system that collects and manages
function assignments and annotations.  Periodically, the project site
synchronizes annotations with the main site, propagating work up and
then back down through the hierarchy.</p>

<p>In the peer-to-peer model, everyone has a local version of the system,
and everyone has the ability to exchange/synchronize with anyone else.
Informal "hubs" may form at nodes attempting to maintain current,
comprehensive collections of genomes, but everyone has and maintains
their own research environment.  The way this would work for the
average sequencing project would be
<oi>
	<li> someone would download (or take from DVDs) an initial SEED,</li>
	<li> the local genomes would be loaded,</li>
	<li> other memebers of the project would download copies from the
	   initial version (over a network, through DVDs, or by
	   simply copying between hard drives),</li>
	<li> periodically (either daily or weekly), members of the
	   project would exchange annotations and assignments.</li>
        <li> as updated versions of the SEED become available, the new
	   code propagates in a peer-to-peer manner, and,</li>
	<li> new genomes are integrated and exchanged.</li></ol></p>

<p>Such an "anarchistic" model introduces numerous questions and
potential conflicts, but it does completely remove dependencies on
central sources; if you want a new genome immediately, you just add it
to your copy (or get it from someone who has already added it).  If
you have 200 kb of new sequence, you are not faced with sending it to
a central source and waiting for them to create an updated version
that is returned to you; rather, you just add the data to your copy,
do your analysis, and exchange it (or not) with whomever you wish.</p>

<p>This is basically the model we are proposing for the SEED, and the
initial "launch" is very close.</p>

<h4>FIG Development Meetings</h4>

<p>The FIG developers will meet at least three times per year (one week
meetings) to integrate code and prepare releases.  We anticipate
having two meetings per year in the US (one in Chicago) and one per
year in Europe.  Because there is so much to do initially, we are
proposing to meet three times in the next 6-7 months.  The first
meeting will probably be in the second half of February, the second in
Europe in April, and the third in at Chicago in June.  The locations
of the first and second meetings are still not decided, although the
second will probably be in Bielefeld, Germany.  The first cannot be in
Chicago for two reasons: the weather in February can be terribly
unpredictable, and it takes 90 days to get clearances for foreign
visitors.  We plan on having these meetings be completely informal
code integration efforts.  People will focus on defining the future
additions to the FIG software, integrating software from other
independent efforts, and preparing new releases.  We anticipate
releasing new versions of the FIG software about a week after each
meeting.  If you have suggestions regarding schedule or location,
comments would be welcome.</p>


<h4>The "Annotate a Thousand Genomes" Efffort</h4>

<p>Developing the SEED is a major effort initiated by FIG, and we believe
that it will be of utility for numerous projects.  Initially, we
believe that it will be used largely by sequencing/annotation teams
and individual investigators wishing to use comparative analysis using
large collections of genomes.  However, a number of FIG fellows wish
to focus on the problem of developing reliable annnotations, and they
believe that they finally see exactly how to do it.  We have discussed
this briefly in past newsletters.  The project will involve gathering
experts in specififc subsystems and supporting analysis of each
subsystem accross the entire collection of organisms.  We would add
whatever functionality is needed to the SEED to support these experts,
and the explicit goal would be accurate annotations for the thousand
genomes we believe will be available by the end of about three years.</p>

<p>This effort, too, is beginning to make progress.  Most notably, a
number of biologists with specific areas of expertise have expressed a
desire to paricipate actively, as soon as the project is a bit more
well-defined, and the tools are in place.  We intend to officially
launch things in February.</p>

<p>Our first step is to define precisely what "deliverables" we would
want from each expert.  This may sound like an insignificant issue,
yet a number of us are now convinced that getting it right is the key
to saving large amounts of effort.  We are actively discussing the
point, and we will offer an initial position in the next newsletter.</p>

<p>Once the initial release of the SEED occurs (hopefully within a very,
very short time now), expect to see focus rapidly shift to defining
and starting the annotations effort.</p>
]]>
</content>
</entry>

<entry>
<title>November 2003 Newsletter</title>
<link rel="alternate" type="text/html" href="http://www.figresearch.com/archives/2003/11/november-2003-n.html" />
<modified>2006-12-10T13:36:36Z</modified>
<issued>2003-11-17T18:45:01Z</issued>
<id>tag:www.figresearch.com,2003:/news//3.112</id>
<created>2003-11-17T18:45:01Z</created>
<summary type="text/plain"> The SEED...</summary>
<author>
<name>Veronika Vonstein</name>

<email>veronika@thefig.info</email>
</author>
<dc:subject>Newsletters</dc:subject>
<content type="text/html" mode="escaped" xml:lang="en" xml:base="http://www.figresearch.com/news/">
<![CDATA[<ul>
<li>The SEED</li>
</ul>]]>
<![CDATA[<p>A great deal has been happening.  We have settled on "The SEED" as a
name for the new system, and here is how things stand on SEED
development: </p><ol>


    <li> We have built a "SEED disk" for Mac OS X machines.  We have
       built six copies so far; debugging and improvements are
       progressing very quickly.</li>

    <li> We have gotten our new dual-processor G5 server.  It is a
       wonderful machine.  We are using it for development and the
       SEED runs well on it.  We will put up a public server as soon
       as we load it with completely current copies of all of the 140-150
       published genomes.</li>  

    <li> The SEED will be shown at Supercomputing 2003 (a big yearly
       show in the world of supercomputers) later this month.  We will
       have two G5 Macs and two linux machines (one running Postgres,
       and one running MySQL).  As of today they support peer-to-peer (p2p) 
       updating of code, annotations and assignments.</li>

    <li> By next week we plan on supporting the ability for users to add
       their own genomes, and probably the ability to acquire initial
       copies over the internet (we will not immediately offer this
       service -- it will take at least a month to get the initial
       release ready).</li>

    <li> We have a problem getting data entered initially.  We have
       found at least 2-3 groups who will help us; one by helping
       acquire and prepare published genomes, one might help us call
       prokaryotic genes, and a third might help us organize the
       eukayotic data.  I hesitate to announce names, since the
       commitments are not firm yet (I believe that everyone is
       waiting for us to organize our side of the collaborations).
       However, we do deeply appreciate the offers of help, and we
       believe that the pipelines will be flowing at high volume by
       the end of the year.</li>

    <li> The SEED is now actively in use by a large-scale sequencing
       effort.  It will be used as the framework for analysis of genes
       and development of metabolic overviews.  We have been
       approached by a number of other groups who are now planning on
       using the system, if their grants go through.  These are all
       important, since each such group will become a valuable source
       of criticism.  There is a real difference between the demands
       of people who just occasionally use the system and those that
       use it on a daily basis.  The speed of improvement will be
       directly proportional to the number of serious users.</li>

    <li> The system is being used to support a class at the University
       of Chicago.  The real impact of this effort was to force a bit
       more organization in our development.  Having a class of
       students dependent on the use of the system is exposing bugs
       and shortcomings pretty quickly. FIG is actively planning on
       helping to develop course materials and to aid in sharing of
       course materials.  We feel that it is important that we build
       up a collection of web-based "lectures" and "assignments".
       Initially, we will focus on developing three 2-week "modules"
       that would be suitable for use in any number of biology
       courses.  The modules will be</li><ul> 

                  <li>Searching for Missing Genes</li>

                  <li>How to Annotate a Genome</li>

                  <li>How to Annotate a Subsystem</li></ul></ol>


<p>So, the SEED now runs on Linux and Macs.  The next month will be spent
in getting installation procedures and instructions simplified, as
well as loading the data for the initial release.  Once the initial
release occurs, we expect a very rapid propagation.  If our ability to
support peer-to-peer updates of code is solid, we will be able to
handle bug fixes and new features without it being a big deal.  If
not, we will rapidly be overwhelmed.  There is, of course, an ongoing
tension between making things solid and adding new features; it always
exists, and we are trying to make the right choices.</p>

<p>The "annotate a 1000 genomes" project is moving from the discussion
stage to becoming a clear goal of FIG.  The basic position may be
summarized as follows:</p><ol>

           <li> Think of a spreadsheet with each row corresponding to a
              genome, each column corresponding to a subsystem (e.g.,
              "synthesis of histidine"), and each cell corresponding
              to a summary of how the given subsystem is implemented
              in the specified organism.  The number of rows will grow
              at ever increasing speeds, but the number of columns
              (i.e., subsystems) is relatively constant.</li>

           <li> Most current annotation efforts focus on annotating a
              single genome or a small set of closely related genomes.
              The key to "high-throughput annotation" is not the
              pipeline approach that is widely employed, but rather
              the development of a detailed understanding of the
              variants of a specific subsystem, followed by consistent
              annotation of that subsystem through all existing
              genomes (i.e., annotation of columns, rather than rows).</li>

           <li> There are on the order of 80-120 subsystems that make up
              the "core machinery" of the cell.  Here, we explictly
              exclude a few key categories that should be treated
              separately (e.g., signalling, transport, and regulatory
              genes).  The study of each of these subsystems needs to
              be done systematically.</li>

           <li> One of the best examples of how this should be done, in
              our opinion, was published in the Sept, 2003 issue of
              mmbr.  The paper, "Ancient Origin of the Tryptophan
              Operon and the Dynamics of Evolutionary Change" by Roy
              Jensen's team, is a vivid portrait of exactly what
              is needed for each of the key subsystems.  We highly
              recommend it; it is an excellent review of the
              tryptophan biosynthesis subsystem, but even more
              importantly it is a precise template of what is needed
              for each subsystem.</li>

           <li> The speed with which an analysis of the sort discussed
              in the tryptophan review can be constructed is highly
              dependent on the tools the analyst uses.  A good
              biologist might well also suggest that it depends on
              insight formed from years of work on a specific
              subsystem, but it seems clear to us that the
              availability of appropriate tools is a major issue.  The
              right tools do not yet exist (at least in our view).
              FIG will partiipate actively in the development of these
              tools.</li>

           <li> FIG does not intend to limit its participation to tool
              building.  Indeed, several of us are deeply interested
              in actively taking subsystems, performing the needed
              analysis, cleaning up the annotations, and producing the
              corresponding reviews.</li></ol>

<p>The FIG team is growing rapidly (in the sense that a lot of
interesting people are starting to closely collaborate on this strange
effort).  It is most encouraging.</p>

<p>More in a few weeks.</p>

<p>Take care,</p>

<p>The FIG Team</p>
]]>
</content>
</entry>

</feed>