« November 2006 | Main | January 2007 »

December 2006 Archives

December 1, 2006

Final Search Fixes for Version 17

The Attribute Situation

Attributes now has its own category on the development blog.

The Even Newer Attribute System is now operating on the SEED instance running on the Development Server. It has been tested to make sure all the little bits work, and will be moved to the attribute server and integrated into the annotator SEED next month. In the meantime, you can see the attribute console at http://web-1.nmpdr.org/next/FIG/Attributes.cgi.

The most important change that still needs to be made is getting the select.cgi collections converted. This is a self-inflicted wound, because I wanted the subsystem attributes to be attached to the actual subsystems instead of a thing called "Subsystem". In order to do that, I need to make changes to select.cgi.

The Even Newer Attribute System differs from the New Attribute System in several ways. There is now a single table of attributes implemented as a relationship between TargetObjects and AttributeKeys. Target objects are identified by ID only, rather than ID and type, which makes the system more like the Old Attribute System.

The TargetObject entity is virtual, which means that there is no data in the TargetObject table. There is, however, another entity called AttributeGroup that allows arbitrary grouping of attribute keys. There is only one level of grouping, but an attribute can belong to many different groups.

The AttrDBRefresh script is used to do batch attribute processing. It has options for backing up attributes to a tab-delimited file, loading attributes from a tab-delimited file, and migrating attributes from an instance of the SEED.

The attribute backup and load files are expected to contain an object ID, an attribute key name, and one or more values in each line. There is also a facility for uploading a single attribute from the web. In this case, the file must still be tab-delimited, but you specify the columns containing the object ID and the attribute value in the upload form.

December 2, 2006

More Minor Changes

  • The GO attributes from Ross and Kaitlyn have been loaded into the attribute system on the Development Server and copied into the NMPDR data base. An example GO attribute can be seen on the Streptomyces coelicolor peg 3361 page.
  • The feature filter rows have been moved to the bottom of the signature genes form.
  • The drug target page has been updated. I attempted to combine independent changes from both Matt and Leslie.

Version 17 on Staging Server

NMPDR version 17 is now on the staging server.

December 4, 2006

A Search Improvement

The text index on the Feature table was missing, which was making all the searches very slow. I have rebuilt the indexes on both the Development and Staging servers, and it helps.

December 5, 2006

Final Push for Version 17

  • The text search index somehow got dropped, making keyword searches very slow. This has been fixed.
  • I have modified the BBH server to accept batch requests. When requesting BBH data, you can specify an SQL pattern instead of a feature ID. So, for example, fig|100226.1.% would compute BBHs for all features of Streptomyces coelicolor. To control the bandwidth in cases like this, you can also specify a list of target genomes. Only BBHs that land in the target genomes will be returned by the server. This dramatically improves the performance of the Signature Genes Tool.
  • The mini-forms on the search results page (NMPDR and GBrowse have been converted to fake buttons so that they work after a copy and paste and may be right-clicked to open the results in a new window or tab. They look like real buttons, but do not have an animated click effect.

These fixes, along with a whole bunch of web page improvements by Leslie, have been moved to the staging server.

December 6, 2006

Processing Tab-Delimited Files

The Tracer.pm module contains several utilities for processing tab-delimited files. Although none of these methods do anything spectacular, they take some of the tedium out of this kind of programming.

  • Open opens a file and dies with an error message if the open fails.
  • GetLine reads a line from the file, chops off the new-line, and splits it into pieces on tab boundaries.
  • PutLine takes a list of text elements, joins them together with tabs, appends a new-line, and writes the result. This is basically the opposite of GetLine.

A sample program using these methods is shown below the fold.

Continue reading "Processing Tab-Delimited Files" »

General Cleanup of Drug Target Files

I have created a new script, DrugClean, which prepares a drug target output file for NMPDR. The script is invoked as follows:

DrugClean -macFile fileName1 fileName2 ... fileNameN

The macFile switch is only necessary if the input files are all in Macintosh format.

The script will remove duplicate entries and entries for PEGs that are not in the current version of the Sprout database. It also converts the file to Unix format. I have changed targets.cgi to expect Unix files, so when we get new drug targets files this script must be run on them or they won't work.

I also fixed a performance problem with the organism files, and they now load in around 10 seconds instead of 50.

Version 17 is Now Live

Bigger, faster, and richer in content, version 17 is now live.

To celebrate, I will be spending the next few hours huddled in a corner whimpering and shaking uncontrollably.

December 11, 2006

Some Drug Target Fixes

Some improvements have been made to the drug target pages in version 17.

  • The performance of the organism lists has been greatly improved. Previously, the category for a peg (drug, toxin, or vaccine), was determined by searching the category files for each peg. Now the category files are read once and the category information is kept in memory.
  • There is now a column for a peg's best hit to the human genome. Only hits with a score of 1e-15 or lower are shown. The column contains the hit score, and clicking on it will bring up the protein page for the specified human peg.
  • The header columns no longer appear when there is no data to display.
  • The template has been changed to a wide display so more data can fit on a screen.

Because these are version 17 changes (version 18 doesn't exist yet), they were slipstreamed into the live NMPDR.

December 14, 2006

Strangeness is Due Soon

I have finished testing the Even Newer attribute system, and I will be putting it in place on the annotator seed tonight. Things may be a bit slow while this task is in progress.

December 15, 2006

Attribute Server has Cut Over

The new attribute server is now running, and is being used as the attribute repository on the annotator seed.

The biggest difference between the old system and the new one is the fact that you must create an attribute key before you can assign any values to it. New attributes can be defined at http://anno-3.nmpdr.org/attrib_server/Attributes.cgi. To use the attribute server, you need the following line in you FIG_Config.pm file.

$attrURL = "http://anno-3.nmpdr.org/attrib_server/AttribXMLRPC.cgi";

Experience has shown us that for best results you need to reduce the number of calls made to the server. For this reason, in your get_attributes call, you may specify a list reference for either of the first two parameters, and it will return values that match anything in the list.

December 18, 2006

Version 18

This morning I will begin building version 18. There are several important changes I need to make before version 18 can be loaded, so for a while there will be no official data in the Development NMPDR; however, I don't want to make radical code changes in version 17 now that it's live, so we will have to limp along for a while.

December 20, 2006

Progress Report on Version 18

  • The Vibrio group now consists only of pathogenic Vibrio and other Vibrio. Previously, non-pathogenic Vibrio were further specialized, but since there were only two of them, it didn't make a lot of sense.
  • The bug in ERDB that prevented the Feature text search index from building has been fixed.
  • The search URL is always displayed, and there is no longer a Show URL checkbox.
  • If the user presses ENTER on a search form and we are not displaying results, an attempt is made to search. This will not make any difference in Firefox, but it means that IE users do not need to manually click the GO button.

The load files have been created for all the tables except the Feature table. The Feature table load stalled because of the latency required to communicate with the attribute server. I changed it to retrieve attributes once per genome instead of once per feature and it is now moving considerably faster.

Update to Attribute Methods

It is now possible to specify a regular expression as a value pattern in get_attributes.

Unlike the Original Attribute System or the New Attribute System, the Even Newer Attribute System does not allow general SQL wildcards. Instead, a sort of generic search is provided: if the last character of a pattern is %, then it will be treated as an SQL wildcard character. So, you can specifiy fig|100226.1.peg.% as an object ID to retrieve attributes for all the PEGs of Streptomyces coelicolor, but you cannot do %.rna.% to retrieve attributes for all the RNAs in the system. There are two reasons for this. First, in order to satisfy the latter query, MySQL will end up reading every single row in the attribute table. Second, the underscore is a wild card character in SQL, and we have them all over the place. Only recognizing a percent sign at the end made things much less messy.

The values are filtered in-memory instead of via SQL, so it is possible to allow fancier capabilities for them. Thus, for attribute values, and only for attribute values, you can specify a regular expression in addition to the single-percent generic pattern.

Some examples may help clarify all this.

  • $fig->get_attributes("fig|$genomeID.peg%", "PUBMED%") will retrieve all the PUBMED attributes for the PEGs of the given genome. There are three PUBMED attributes that will match the second operand: PUBMED, PUBMED_CURATED_RELEVANT, and PUBMED_CURATED_NOTRELEVANT.
  • $fig->get_attributes(undef, "PUBMED_CURATED%", "/^[^,]+,$id,/") will retrieve all curated PUBMED attributes for a given document number. In a curated PUBMED attribute, the document information consists of multiple comma-separated fields. The second field is the document number, so the PERL pattern is designed to only match if the specified number is betwen the first and second commas for the value field.
  • $fig->get_attributes([$genomeID, "fig|$genomeID%"], undef, "/^http:\/\/\w*\.?nih\.gov/i") will return all attributes related to the given genome that have an associated URL pointing to the NIH web site. Note that we can't use the PERL m operator: the attribute engine can only recognize a regular expression if it's enclosed in slashes. It does, however, allow modifiers at the end. In this case the i operator is used to make the match case-insensitive. This call also used a list in the first operand to ask for attributes of the genome itself (first operand in the list) as well as all the genome's features (second operand in the list).
  • $fig->get_attributes("/$genomeID/", 'PUBMED') is an attempt to get all PUBMED attributes related to the specified genome, but it will not work because regular expressions are only allowed for values, not for object IDs (or attribute keys for that matter). To get this effect, you must use the list-based approach shown above.

About December 2006

This page contains all entries posted to NMPDR Development Blog in December 2006. They are listed from oldest to newest.

November 2006 is the previous archive.

January 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 4.01