FIGTable of Contents
FIG Genome Annotation SystemIntroductionHiding/Caching in a FIG objectPublic Methodsabbrevadd_genomeadjacent_genesall_featuresall_features_detailedall_features_detailed_fastall_rolesassert_genomesaugment_pathbeg_ofbetweenboundaries_ofby_fig_idby_genome_idcachedCacheTrickcgi_urlclean_spacesclean_tmpclose_genescompute_clusterscompute_genome_similaritycontig_checksumcontig_md5sumcontig_ofcreate_sim_askfor_pooldb_handledelete_genomesDESTROYdisplay_id_and_seqdisplay_seqdomain_colorec_nameend_ofenqueue_similaritiesexpand_ecexport_similarity_requestfeature_aliasesfeature_locationFIGfile_headfile_readfilter_regionsfind_contig_with_checksumflatten_dumperftypegenes_in_regiongenome_and_peg_ofgenome_countsgenome_domaingenome_md5sumgenome_ofgenome_pegsgenome_rnasgenome_szdnagenome_versiongenome_with_md5sumgenomesgenus_speciesgenus_species_domainget_active_sim_poolsget_hostname_by_adapterget_local_hostnameget_peer_last_updateget_release_infoget_seed_idget_sim_pool_infoget_sim_workget_system_namego_number_to_terminterpret_error_codeis_completeis_genomeis_locked_fidmaxminncbi_contig_descriptionneworg_and_color_oforg_oforganism_directoryorgid_of_orgnameorgname_of_orgidparse_genome_argspartial_genus_matchingpegs_ofplug_urlpostprocess_computed_simsread_contigread_fasta_recordregions_spannedreload_tablerev_compreverse_comprnas_ofrunrun_gathering_outputsame_seqsschedule_sim_pool_postprocessingset_genus_speciesset_peer_last_updatesim_work_donestandard_genetic_codestrand_oftemp_urlTitletop_linktranslateuniprot_aliases_bulkunlock_fidupstream_ofverify_dirwikipedia_linkrewrite_db_xrefs_brcabstract_coupled_toadd_annotationadd_annotation_batchadd_chr_clusters_and_pinsadd_chromosomal_clustersadd_pch_pinsannotations_madeassign_functionbbh_listbbhsby_aliasby_raw_aliascoupled_tocoupling_and_evidencecoupling_evidencedsimsextract_assignments_from_annotationsfeature_annotationsfunction_offunction_of_bulkget_translationin_cluster_within_pch_pin_withmap_peg_to_idsmapped_prot_idsmerged_related_annotationsnsimsosimsparse_datepossibly_truncatedread_all_annotationsread_annotation_recordtranslatabletranslate_functiontranslated_function_oftranslation_lengthAttributesadd_attributechange_attributeclean_attribute_key()delete_attributedelete_matching_attributesessentialform_oidget_attributesget_cv_attributesparse_oidquery_attributesThe SEED Data Store InterfacevirulentSplitting and Joining Attributes "oids"add_cv_termattribute_locationauto_assignchoose_functionerase_attribute_entirelyget_genome_keysget_genomes_with_attributeget_group_key_infoget_group_keysget_peg_keysget_peg_keys_for_genomeget_valuesguess_value_formatjoin_attribute_oid()key_inforead_attribute_transaction_logsearch_cv_filesearch_indexsplit_attribute_oid()update_attributes_metadata()Protein Familiesall_cidsall_protein_familiescid_to_protsext_ids_in_familyext_in_familyext_sz_familyfamilies_by_sourcefamilies_for_proteinfamily_by_functionfamily_functionids_in_familyin_familynumber_of_cidsnumber_of_familiesnumber_of_proteins_in_familiesprot_to_cidproteins_in_familysz_familyAbstract Set RoutinesKEGG methodsall_compoundsall_mapsall_reactionscascas_to_cidcatalyzed_bycatalyzescomp2reactdisplayable_reactionec_to_mapsids_of_compoundids_of_compound_like_nameis_BRC_genomeis_NMPDR_genomelargest_clustersmap_namemap_to_ecsnames_of_compoundneighborhood_of_roleprotein_subsystem_to_rolesreaction2compreaction_directionreversiblerole_to_mapsroles_of_functionseqs_with_roleseqs_with_roles_in_genomesvalid_reaction_idBidirectional Best Hitsbest_bbh_candidatesbest_bbh_candidates_additionalDNA Sequencesall_contigscontig_lncontigs_ofdna_seqextract_seqget_dna_seqnumber_of_contigsTaxonomycrude_estimate_of_distanceis_archaealis_bacterialis_environmentalis_eukaryoticis_prokaryoticis_viralsort_fids_by_taxonomysort_genomes_by_taxonomytaxonomy_ofLiterature Methodsactive_subsystemsSubsystem Methodsall_constructsall_subsystem_classificationsall_subsystemsall_usable_subsystemsassigned_pegsassigned_pegs_in_subsystemsassigned_pegs_not_in_ssdistributable_subsystemget_genome_assignment_dataget_genome_statsget_genome_subsystem_countget_genome_subsystem_dataget_subsystemget_valid_cache_fileindex_subsystemsinstall_subsystem_directory_on_servernmpdr_subsystemperform_subsystem_salvagereadSpreadsheetForGenomessubsystem_classificationsubsystem_curatorsubsystem_genomessubsystem_infosubsystem_infosubsystem_rolessubsystem_to_rolessubsystem_versionsubsystems_for_ecsubsystems_for_genomesubsystems_for_pegsubsystems_for_pegsubsystems_for_rolesubsystems_rolesPEG Translationsfind_genome_by_contenttough_search($pegs, $seq_of, $tran_peg, $sought)Linksfid_linksfids_with_link_toSearch DatabasePeg Searches and SimilaritiesUtility Methodsis_ecrun_in_backgroundExternal Interface Methodsdropped_genomehas_genomelink_to_systemFeature Update Methodsadd_featurecall_startchange_location_of_featureclearinghouse_next_feature_idclearinghouse_register_featuresdelete_featurepick_gene_boundariesgenome_to_ggMarkup Helper Methods_MarkupFileNameReadMarkupsWriteMarkupsUserData Helper MethodsAllowsUpdatesCleanupUserDataGetCapabilitiesGetDefaultGetPreferencesSetCapabilitiesSetDefaultSetPreferencesUserData UtilitiesGetInputKVRecordGetObjectCapabilityFileGetUserDataDirectoryGetUserDataFilemodel_directoryProcessUpdatesPutOutputKVRecordFIG::Job moduleinit_dasFIG Genome Annotation SystemIntroductionThis is the main object for access to the SEED data store. The data stor
itself is a combination of flat files and a database. The flat files ca
be moved easily between systems and the database rebuilt as needed.A reduced set of this object's functions are available via the SFXlateThe key to making the FIG system work is proper configuration of th
FIG_Config.pm file. This file contains names and URLs for the ke
directories as well as the type and login information for the database.FIG was designed to operate as a series of peer instances. Each instance i
updated independently by its owner, and the instances can be synchronize
using a process called a peer-to-peer update. The term
SEED instance and peer are used more-or-less interchangeably.The POD documentation for this module is still in progress, and is provide
on an AS IS basis without warranty. If you have a correction and you'r
not a developer, EMAIL the details to bruce@gigabarb.com and I'll fol
it in.NOTE: The usage example for each method specifies whether it is staticor dynamicIf the method is static and has no parameters (FIG::something()) it ca
also be invoked dynamically. This is a general artifact of th
way PERL implements object-oriented programming.Hiding/Caching in a FIG objectWe save the DB handle, cache taxonomies, and put a few other odds and ends in th
FIG object. We expect users to invoke these services using the object $fig constructe
using:use FIG;my $fig = new FIG;$fig is then used as the basic mechanism for accessing FIG services. It is, of course
just a hash that is used to retain/cache data. The most commonly accessed item is th
DB filehandle, which is accessed via $self->db_handle.We cache genus/species expansions, taxonomies, distances (very crudely estimated) estimate
between genomes, and a variety of other things.Public Methodsnewmy $fig = FIG->new();This is the constructor for a FIG object. It uses no parameters. If tracin
has not yet been turned on, it will be turned on here. The tracing type an
level are specified by the configuration variables $FIG_Config::trace_levels$FIG_Config::trace_type. These defaults can be overridden using th
environment variables Trace and TraceType, respectively.CacheTrickmy $value = $fig->CacheTrick($self, $field => $evalString);This is a helper method used to create simple field caching in another object. If th
named field is found in $self, then it will be returned directly. Otherwise, the eva
string will be executed to compute the value. The value is then cahced in the $sel
object so it can be retrieved easily when needed. Use this method to make a FI
data-access object more like an object created by PPO or ERDB.selfHash or blessed object containing the cached fields.fieldName of the field desired.evalStringString that can be evaluated to compute the field value.RETURNReturns the value of the desired field.go_number_to_termdb_handlemy $dbh = $fig->db_handle;Return the handle to the internal DBrtns object. This allows direct access t
the database methods.cachedmy $x = $fig->cached($name);Return a reference to a hash containing transient data. If no hash exists with th
specified name, create an empty one under that name and return it.The idea behind this method is to allow clients to cache data in the FIG object fo
later use. (For example, a method might cache feature data so that it can b
retrieved later without using the database.) This facility should be used sparingly
since different clients may destroy each other's data if they use the same name.nameName assigned to the cached data.RETURNReturns a reference to a hash that is permanently associated with the specified name
If no such hash exists, an empty one will be created for the purpose.get_system_namemy $name = $fig->get_system_name;Returns seed, indicating that this is object is using the SEE
database. The same method on an SFXlate object will return sprout.DESTROYThe destructor releases the database handle.same_seqsmy $sameFlag = FIG::same_seqs($s1, $s2);Return TRUE if the specified protein sequences are considered equivalent and FALS
otherwise. The sequences should be presented in nr-analysis form, which is i
reverse order and upper case with the stop codon omitted.The sequences will be considered equivalent if the shorter matches the initia
portion of the long one and is no more than 30% smaller. Since the sequences ar
in nr-analysis form, the equivalent start potions means that the sequence
have the same tail. The importance of the tail is that the stop point of a PE
is easier to find than the start point, so a same tail means that the tw
sequences are equivalent except for the choice of start point.s1First protein sequence, reversed and with the stop codon removed.s2Second protein sequence, reversed and with the stop codon removed.RETURNReturns TRUE if the two protein sequences are equivalent, else FALSE.is_locked_fid$fig->is_locked_fid($fid);returns 1 iff $fid is lockedunlock_fid$fig->unlock_fid($user,$fid);Sets a unlock on annotations for $fid.delete_genomes$fig->delete_genomes(\@genomes);Delete the specified genomes from the data store. This requires makin
system calls to move and delete files.add_genomemy $ok = $fig->add_genome($genomeF, $force, $skipnr);Add a new genome to the data store. A genome's data is kept in a director
by itself, underneath the main organism directory. This method essentiall
moves genome data from an external directory to the main directory an
performs some indexing tasks to integrate it.genomeFName of the directory containing the genome files. This should be
fully-qualified directory name. The last segment of the director
name should be the genome ID.forceThis will ignore errors thrown by verify_genome_directory. This is bad, and you shoul
never do it, but I am in the situation where I need to move a genome from one machin
to another, and although I trust the genome I can't.skipnrWe don't always want to add the proteins into the nr database. For example wih a metagnome that has been called by blastx. This will just skip appending the proteins into the NR file.RETURNReturns TRUE if successful, else FALSE.parse_genome_argsmy ($mode, @genomes) = FIG::parse_genome_args(@args);Extract a list of genome IDs from an argument list. If the argument list is empty
return all the genomes in the data store.This is a function that is performed by many of the FIG command-line utilities. Th
user has the option of specifying a list of specific genome IDs or specifying non
in order to get all of them. If your command requires additional arguments in th
command line, you can still use this method if you shift them out of the argument lis
before calling. The $mode return value will be all if the user asked for all o
the genomes or some if he specified a list of IDs. This is useful to know if
for example, we are loading a table. If we're loading everything, we can delete th
entire table; if we're only loading some genomes, we must delete them individually.This method uses the genome directory rather than the database because it may be use
before the database is ready.args1, args2, ... argsNList of genome IDs. If all genome IDs are to be processed, then this list should b
empty.RETURNReturns a list. The first element of the list is all if the user is asking for al
the genome IDs and some otherwise. The remaining elements of the list are th
desired genome IDs.reload_table$fig->reload_table($mode, $table, $flds, $xflds, $fileName, $keyList, $keyName);Reload a database table from a sequential file. If $mode is all, the tabl
will be dropped and re-created. If $mode is some, the data for the individua
items in $keyList will be deleted before the table is loaded. Thus, the loa
process is optimized for the type of reload.modeall if we are reloading the entire table, some if we are only reloadin
specific entries.tableName of the table to reload.fldsString defining the table columns, in SQL format. In general, this is
comma-delimited set of field specifiers, each specifier consisting of th
field name followed by the field type and any optional qualifiers (such a
NOT NULL or DEFAULT); however, it can be anything that would appea
between the parentheses in a CREATE TABLE statement. The order in whic
the fields are specified is important, since it is presumed that is th
order in which they are appearing in the load file.xfldsReference to a hash that describes the indexes. The hash is keyed by index name
The value is the index's field list. This is a comma-delimited list of field name
in order from most significant to least significant. If a field is to be indexe
in descending order, its name should be followed by the qualifier DESC. Fo
example, the following $xflds value will create two indexes, one for name followe
by creation date in reverse chronological order, and one for ID.{ name_index => "name, createDate DESC", id_index => "id" }fileNameFully-qualified name of the file containing the data to load. Each line of th
file must correspond to a record, and the fields must be arranged in order an
tab-delimited. If the file name is omitted, the table is dropped and re-create
but not loaded.keyListReference to a list of the IDs for the objects being reloaded. This parameter i
only used if $mode is some.keyName (optional)Name of the key field containing the IDs in the keylist. If omitted, genome i
assumed.enqueue_similaritiesFIG::enqueue_similarities(\@fids);Queue the passed Feature IDs for similarity computation. The actua
computation is performed by create_sim_askfor_pool. The queue is
persistent text file in the global data directory, and this metho
essentially writes new IDs on the end of it.fidsReference to a list of feature IDs.export_similarity_requestCreates a similarity computation request from the queued similarities an
the current NR.We keep track of the exported requests in case one gets lost.create_sim_askfor_pool$fig->create_sim_askfor_pool($chunk_size);Creates an askfor pool, which a snapshot of the current NR and similarit
queue. This process clears the old queue.The askfor pool needs to keep track of which sequences need to b
calculated, which have been handed out, etc. To simplify this task w
chunk the sequences into fairly small numbers (20k characters) an
allocate work on a per-chunk basis. We make use of the relationa
database to keep track of chunk status as well as the seek location
into the file of sequence data. The initial creation of the poo
involves indexing the sequence data with seek offsets and lengths an
populating the sim_askfor_index table with this information and wit
initial status information.chunk_sizeNumber of features to put into a processing chunk. The default is 15.get_sim_workmy ($nrPath, $fasta) = $fig->get_sim_work();Get the next piece of sim computation work to be performed. Returned ar
the path to the NR and a string containing the fasta data.sim_work_done$fig->sim_work_done($pool_id, $chunk_id, $out_file);Declare that the work in pool_id/chunk_id has been completed, and output writte
to the pool directory (get_sim_work gave it the path).pool_idThe ID number of the pool containing the work that just completed.chunk_idThe ID number of the chunk completed.out_fileThe file into which the work was placed.schedule_sim_pool_postprocessing$fig->schedule_sim_pool_postprocessing($pool_id);Schedule a job to do the similarity postprocessing for the specified pool.pool_idID of the pool whose similarity postprocessing needs to be scheduled.postprocess_computed_sims$fig->postprocess_computed_sims($pool_id);Set up to reduce, reformat, and split the similarities in a given pool. We buil
a pipe to this pipeline:Then we put the new sims in the pool directory, and then copy to NewSims.pool_idID of the pool whose similarities are to be post-processed.get_active_sim_pools@pools = $fig->get_active_sim_pools();Return a list of the pool IDs for the sim processing queues that hav
entries awaiting computation.compute_clustersmy @clusterList = $fig->compute_clusters(\@pegList, $subsystem, $distance);Partition a list of PEGs into sections that are clustered close together o
the genome. The basic algorithm used builds a graph connecting PEGs t
other PEGs close by them on the genome. Each connected subsection of the grap
is then separated into a cluster. Singleton clusters are thrown away, an
the remaining ones are sorted by length. All PEGs in the incoming lis
should belong to the same genome, but this is not a requirement. PEGs o
different genomes will simply find themselves in different clusters.pegListReference to a list of PEG IDs.subsystemSubsystem object for the relevant subsystem. This parameter is not used, but i
required for compatability with Sprout.distance (optional)The maximum distance between PEGs that makes them considered close. If omitted
the distance is 5000 bases.RETURNReturns a list of lists. Each sub-list is a cluster of PEGs.get_sim_pool_infomy ($total_entries, $n_finished, $n_assigned, $n_unassigned) = $fig->get_sim_pool_info($pool_id);Return information about the given sim pool.pool_idPool ID of the similarity processing queue whose information is desired.RETURNReturns a four-element list. The first is the number of features in th
queue; the second is the number of features that have been processed; th
third is the number of features that have been assigned to
processor, and the fourth is the number of features left over.get_local_hostnamemy $result = FIG::get_local_hostname();Return the local host name for the current processor. The name may b
stored in a configuration file, or we may have to get it from th
operating system.get_hostname_by_adaptermy $name = FIG::get_hostname_by_adapter();Return the local host name for the current network environment.get_seed_idmy $id = FIG::get_seed_id();Return the Universally Unique ID for this SEED instance. If on
does not exist, it will be created.get_release_infomy ($name, $id, $inst, $email, $parent_id, $description) = FIG::get_release_info();Return the current data release information..The release info comes from the file FIG/Data/RELEASE. It is formatted as:For instance:If no RELEASE file exists, this routine will create one with a new unique ID. Thi
lets a peer optimize the data transfer by being able to cache ID translation
from this instance.Titlemy $title = $fig->Title();Return the title of this database. For SEED, this will return SEED, for Sprou
it will return NMPDR, and so forth.FIGmy $realFig = $fig->FIG();Return this object. This method is provided for compatability with SFXlate.get_peer_last_updatemy $date = $fig->get_peer_last_update($peer_id);Return the timestamp from the last successful peer-to-peer update wit
the given peer. If the specified peer has made updates, comparing thi
timestamp to the timestamp of the updates can tell you whether or no
the updates have been integrated into your SEED data store.We store this information in FIG/Data/Global/Peers/<peer-id>.peer_idUniversally Unique ID for the desired peer.RETURNReturns the date/time stamp for the last peer-to-peer updated performe
with the identified SEED instance.set_peer_last_update$fig->set_peer_last_update($peer_id, $time);Manually set the update timestamp for a specified peer. This inform
the SEED that you have all of the assignments and updates from
particular SEED instance as of a certain date.clean_spacesRemove any extra spaces from input fields. This will (currently) remove ^\s, \s$, and concatenate multiple spaces into one.my $input=$fig->clean_spaces($cgi->param('input'));cgi_urlmy $url = FIG::$fig->cgi_url();Return the URL for the CGI script directory.top_linkmy $url = FIG::top_link();Return the relative URL for the top of the CGI script directory.We determine this based on the SCRIPT_NAME environment variable, fallin
back to FIG_Config::cgi_base if necessary.temp_urlmy $url = FIG::temp_url();Return the URL of the temporary file directory.plug_urlmy $url2 = $fig->plug_url($url);ormy $url2 = $fig->plug_url($url);Change the domain portion of a URL to point to the current domain. This essentiall
relocates URLs into the current environment.urlURL to relocate.RETURNReturns a new URL with the base portion converted to the current operating host
If the URL does not begin with http://, the URL will be returned unmodified.file_readmy $text = $fig->file_read($fileName);ormy @lines = $fig->file_read($fileName);ormy $text = FIG::file_read($fileName);ormy @lines = FIG::file_read($fileName);Read an entire file into memory. In a scalar context, the file is returne
as a single text string with line delimiters included. In a list context, th
file is returned as a list of lines, each line terminated by a lin
delimiter. (For a method that automatically strips the line delimiters
use Tracer::GetFile.)fileNameFully-qualified name of the file to read.RETURNIn a list context, returns a list of the file lines. In a scalar context, return
a string containing all the lines of the file with delimiters included.file_headmy $text = $fig->file_head($fileName, $count);ormy @lines = $fig->file_head($fileName, $count);ormy $text = FIG::file_head($fileName, $count);ormy @lines = FIG::file_head($fileName, $count);Read a portion of a file into memory. In a scalar context, the file portion i
returned as a single text string with line delimiters included. In a lis
context, the file portion is returned as a list of lines, each line terminate
by a line delimiter.fileNameFully-qualified name of the file to read.count (optional)Number of lines to read from the file. If omitted, 1 is assumed. If th
non-numeric string * is specified, the entire file will be read.RETURNIn a list context, returns a list of the desired file lines. In a scalar context, return
a string containing the desired lines of the file with delimiters included.minmy $min = FIG::min(@x);ormy $min = $fig->min(@x);Return the minimum numeric value from a list.x1, x2, ... xNList of numbers to process.RETURNReturns the numeric value of the list entry possessing the lowest value. Return
undef if the list is empty.maxmy $max = FIG::max(@x);ormy $max = $fig->max(@x);Return the maximum numeric value from a list.x1, x2, ... xNList of numbers to process.RETURNReturns the numeric value of t/he list entry possessing the highest value. Return
undef if the list is empty.betweenmy $flag = FIG::between($x, $y, $z);ormy $flag = $fig->between($x, $y, $z);Determine whether or not $y is between $x and $z.xFirst edge number.yNumber to examine.zSecond edge number.RETURNReturn TRUE if the number $y is between the numbers $x and $z. The chec
is inclusive (that is, if $y is equal to $x or $z the function return
TRUE), and the order of $x and $z does not matter. If $x is lower tha
$z, then the return is TRUE if $x <= $y <= $z. If $z is lower
then the return is TRUE if $x >= I$<$y> >= $z.standard_genetic_codemy $code = FIG::standard_genetic_code();Return a hash containing the standard translation of nucleotide triples to proteins
Methods such as translate can take a translation scheme as a parameter. This metho
returns the default translation scheme. The scheme is implemented as a reference to
hash that contains nucleotide triplets as keys and has protein letters as values.translatemy $aa_seq = &FIG::translate($dna_seq, $code, $fix_start);Translate a DNA sequence to a protein sequence using the specified genetic code
If $fix_start is TRUE, will translate an initial TTG or GTG code t
M. (In the standard genetic code, these two combinations normally translat
to V and L, respectively.)dna_seqDNA sequence to translate. Note that the DNA sequence can only contai
known nucleotides.codeReference to a hash specifying the translation code. The hash is keyed b
nucleotide triples, and the value for each key is the corresponding protei
letter. If this parameter is omitted, the standard_genetic_code will b
used.fix_startTRUE if the first triple is to get special treatment, else FALSE. If TRUE
then a value of TTG or GTG in the first position will be translated t
M instead of the value specified in the translation code.RETURNReturns a string resulting from translating each nucleotide triple into
protein letter.reverse_compmy $dnaR = FIG::reverse_comp($dna);ormy $dnaR = $fig->reverse_comp($dna);Return the reverse complement os the specified DNA sequence.NOTE: for extremely long DNA strings, use rev_comp, which allows you t
pass the strings around in the form of pointers.dnaDNA sequence whose reverse complement is desired.RETURNReturns the reverse complement of the incoming DNA sequence.rev_compmy $dnaRP = FIG::rev_comp(\$dna);ormy $dnaRP = $fig->rev_comp(\$dna);Return the reverse complement of the specified DNA sequence. The DNA sequenc
is passed in as a string reference rather than a raw string for performanc
reasons. If this is unnecessary, use reverse_comp, which processes string
instead of references to strings.dnaReference to the DNA sequence whose reverse complement is desired.RETURNReturns a reference to the reverse complement of the incoming DNA sequence.verify_dirFIG::verify_dir($dir);or$fig->verify_dir($dir);Insure that the specified directory exists. If it must be created, the permissions wil
be set to 0777.runFIG::run($cmd);or$fig->run($cmd);Run a command. If the command fails, the error will be traced.run_gathering_outputFIG::run_gathering_output($cmd, @args);or$fig->run_gathering_output($cmd, @args);Run a command, gathering the output. This is similar to the backtic
operator, but it does not invoke the shell. Note that the argument lis
must be explicitly passed one command line argument per argument t
run_gathering_output.If the command fails, the error will be traced.interpret_error_code($exitcode, $signal, $msg) = &FIG::interpret_error_code($rc);Determine if the given result code was due to a process exiting abnormall
or by receiving a signal.augment_pathFIG::augment_path($dirName);Add a directory to the system path.This method adds a new directory to the front of the system path. It looks in th
configuration file to determine whether this is Windows or Unix, and uses th
appropriate separator.dirNameName of the directory to add to the path.read_fasta_recordmy ($seq_id, $seq_pointer, $comment) = FIG::read_fasta_record(\*FILEHANDLE);ormy ($seq_id, $seq_pointer, $comment) = $fig->read_fasta_record(\*FILEHANDLE);Read and parse the next logical record of a FASTA file. A FASTA logical recor
consists of multiple lines of text. The first line begins with a > symbo
and contains the sequence ID followed by an optional comment. (NOTE: comment
are currently deprecated, because not all tools handle them properly.) Th
remaining lines contain the sequence data.This method uses a trick to smooth its operation: the line terminator characte
is temporarily changed to \n> so that a single read operation brings i
the entire logical record.FILEHANDLEOpen handle of the FASTA file. If not specified, STDIN is assumed.RETURNIf we are at the end of the file, returns undef. Otherwise, returns
three-element list. The first element is the sequence ID, the second i
a pointer to the sequence data (that is, a string reference as opposed t
as string), and the third is the comment.display_id_and_seqFIG::display_id_and_seq($id_and_comment, $seqP, $fh);or$fig->display_id_and_seq($id_and_comment, \$seqP, $fh);Display a fasta ID and sequence to the specified open file. This method is designe
to work well with read_fasta_sequence and rev_comp, because it takes a
input a string pointer rather than a string. If the file handle is omitted i
defaults to STDOUT.The output is formatted into a FASTA record. The first line of the output i
preceded by a > symbol, and the sequence is split into 60-characte
chunks displayed one per line. Thus, this method can be used to produc
FASTA files from data gathered by the rest of the system.id_and_commentThe sequence ID and (optionally) the comment from the sequence's FASTA record
The IDseqPReference to a string containing the sequence. The sequence is automaticall
formatted into 60-character chunks displayed one per line.fhOpen file handle to which the ID and sequence should be output. If omitted
\*STDOUT is assumed.display_seqFIG::display_seq(\$seqP, $fh);or$fig->display_seq(\$seqP, $fh);Display a fasta sequence to the specified open file. This method is designe
to work well with read_fasta_sequence and rev_comp, because it takes a
input a string pointer rather than a string. If the file handle is omitted i
defaults to STDOUT.The sequence is split into 60-character chunks displayed one per line fo
readability.seqPReference to a string containing the sequence.fhOpen file handle to which the sequence should be output. If omitted
STDOUT is assumed.flatten_dumperC<< FIG::flatten_dumper( $perl_ref_or_object_1, ... ); >>$fig->flatten_dumper( $perl_ref_or_object_1, ... );Takes a list of perl references or objects, and "flattens" their Data::Dumper() outpu
so that it can be printed on a single line.ec_namemy $enzymatic_function = $fig->ec_name($ec);Returns the enzymatic name corresponding to the specified enzyme code.ecCode number for the enzyme whose name is desired. The code number is actuall
a string of digits and periods (e.g. 1.2.50.6).RETURNReturns the name of the enzyme specified by the indicated code, or a null strin
if the code is not found in the database.all_rolesmy @roles = $fig->all_roles;Return a list of the known roles. Currently, this is a list of the enzyme codes and names.The return value is a list of list references. Each element of the big list contains a
enzyme code (EC) followed by the enzymatic name.expand_ecmy $expanded_ec = $fig->expand_ec($ec);Expands "1.1.1.1" to "1.1.1.1 - alcohol dehydrogenase" or something like that.clean_tmpFIG::clean_tmp();Delete temporary files more than two days old.We store temporary files in $FIG_Config::temp. There are specific classes of file
that are created and should be saved for at least a few days. This routine can b
invoked to clean out those that are over two days old.genomesmy @genome_ids = $fig->genomes($complete, $restrictions, $domain);Return a list of genome IDs. If called with no parameters, all genome ID
in the database will be returned.Genomes are assigned ids of the form X.Y where X is the taxonomic id maintained b
NCBI for the species (not the specific strain), and Y is a sequence digit assigned t
this particular genome (as one of a set with the same genus/species). Genomes als
have versions, but that is a separate issue.completeTRUE if only complete genomes should be returned, else FALSE.restrictionsTRUE if only restriction genomes should be returned, else FALSE.domainName of the domain from which the genomes should be returned. Possible values ar
Bacteria, Virus, Eukaryota, unknown, Archaea, an
Environmental Sample. If no domain is specified, all domains will b
eligible.RETURNReturns a list of all the genome IDs with the specified characteristics.is_completemy $flag = $fig->is_complete($genome);Return TRUE if the genome with the specified ID is complete, else FALSE.genomeID of the relevant genome.RETURNReturns TRUE if there is a complete genome in the database with the specified ID
else FALSE.is_genomemy $flag = $fig->is_genome($genome);Return TRUE if the specified genome exists, else FALSE.genomeID of the genome to test.RETURNReturns TRUE if a genome with the specified ID exists in the data store, else FALSE.assert_genomes$fig->assert_genomes(gid, gid, ...)>>Assert that the given list of genomes does exist, and allow is_genome() to succeed for them.This is used in FIG-based computations in the context of the RAST genome-import code, so tha
genomes that currently exist only in RAST are treated as present for the purposes of FIG.pm-base
code.genome_countsmy ($arch, $bact, $euk, $vir, $env, $unk) = $fig->genome_counts($complete);Count the number of genomes in each domain. If $complete is TRUE, only complet
genomes will be included in the counts.completeTRUE if only complete genomes are to be counted, FALSE if all genomes are to b
countedRETURNA six-element list containing the number of genomes in each of six categories-
Archaea, Bacteria, Eukaryota, Viral, Environmental, and Unknown, respectively.genome_domainmy $domain = $fig->genome_domain($genome_id);Find the domain of a genome.genome_idID of the genome whose domain is desired.RETURNReturns the name of the genome's domain (archaea, bacteria, etc.), or undef i
the genome is not in the database.genome_pegsmy $num_pegs = $fig->genome_pegs($genome_id);Return the number of protein-encoding genes (PEGs) for a specifie
genome.genome_idID of the genome whose PEG count is desired.RETURNReturns the number of PEGs for the specified genome, or undef if the genom
is not indexed in the database.genome_rnasmy $num_rnas = $fig->genome_rnas($genome_id);Return the number of RNA-encoding genes for a genome
"$genome_id" is indexed in the "genome" database, and 'undef' otherwise.genome_idID of the genome whose RNA count is desired.RETURNReturns the number of RNAs for the specified genome, or undef if the genom
is not indexed in the database.genome_szdnamy $szdna = $fig->genome_szdna($genome_id);Return the number of DNA base-pairs in a genome's contigs.genome_idID of the genome whose base-pair count is desired.RETURNReturns the number of base pairs in the specified genome's contigs, or undefgenome_versionmy $version = $fig->genome_version($genome_id);Return the version number of the specified genome.Versions are incremented for major updates. They are put in as majo
updates of the form 1.0, 2.0, ...Users may do local "editing" of the DNA for a genome, but when they do
they increment the digits to the right of the decimal. Two genomes remai
comparable only if the versions match identically. Hence, minor updating should b
committed only by the person/group responsible for updating that genome.We can, of course, identify which genes are identical between any two genomes (by matchin
the DNA or amino acid sequences). However, the basic intent of the system is t
support editing by the main group issuing periodic major updates.genome_idID of the genome whose version is desired.RETURNReturns the version number of the specified genome, or undef if the genome is not i
the data store or no version number has been assigned.genome_md5summy $md5sum = $fig->genome_md5sum($genome_id);Returns the MD5 checksum of the specified genome.The checksum of a genome is defined as the checksum of its signature file. The signatur
file consists of tab-separated lines, one for each contig, ordered by the contig id
Each line contains the contig ID, the length of the contig in nucleotides, and th
MD5 checksum of the nucleotide data, with uppercase letters forced to lower case.The checksum is indexed in the database. If you know a genome's checksum, you can us
the genome_with_md5sum method to find its ID in the database.genomeID of the genome whose checksum is desired.RETURNReturns the specified genome's checksum, or undef if the genome is not in th
database.genome_with_md5summy $genome = $fig->genome_with_md5sum($cksum);Find a genome with the specified checksum.The MD5 checksum is computed from the content of the genome (see genome_md5sum). This metho
can be used to determine if a genome already exists for a specified content.cksumChecksum to use for searching the genome table.RETURNThe ID of a genome with the specified checksum, or undef if no such genome exists.contig_md5summy $cksum = $fig->contig_md5sum($genome, $contig);Return the MD5 checksum for a contig. The MD5 checksum is computed from the conten
of the contig. This method retrieves the checksum stored in the database. The checksu
can be compared to the checksum of an external contig as a cheap way of seeing if the
match.genomeID of the genome containing the contig.contigID of the relevant contig.RETURNReturns the checksum of the specified contig, or undef if the contig is not in th
database.genus_speciesmy $gs = $fig->genus_species($genome_id);Return the genus, species, and possibly also the strain of a specified genome.This method converts a genome ID into a more recognizble species name. The species nam
is stored directly in the genome table of the database. Essentially, if the strain i
present in the database, it will be returned by this method, and if it's not present
it won't.genome_idID of the genome whose name is desired.RETURNReturns the scientific species name associated with the specified ID, or undef if th
ID is not in the database.set_genus_speciesmy $gs = $fig->set_genus_species($genome_id, $genus_species_strain);Sets the contents of the GENOME file of the specified genome IDDoes not (currently) update the relational DB.genome_idID of the genome whose name is desired.genus_species_strainThe new biological name that will correspond to the genome_id.RETURNReturns 1 if the write was successful, and undef if write fails.org_ofmy $org = $fig->org_of($prot_id);Return the genus/species name of the organism containing a protein. Note that in this contex
protein is not a certain string of amino acids but a protein encoding region on a specifi
contig.For a FIG protein ID (e.g. fig|134537.1.peg.123), the organism and strai
information is always available. In the case of external proteins, we can usuall
determine an organism, but not anything more precise than genus/species (an
often not that). When the organism name is not present, a null string is returned.prot_idProtein or feature ID.RETURNReturns the displayable scientific name (genus, species, and strain) of the organism containin
the identified PEG. If the name is not available, returns a null string. If the PEG is not found
returns undef.orgid_of_orgnamemy $genomeID = $fig->orgid_of_orgname($genomeName);Return the ID of the genome corresponding to the specified organism name, or
null string if the genome is not found.genomeNameName of the organism, consisting of the organism's genus, species, an
unique characterization, separated by spaces.RETURNReturns the genome ID number for the named organism, or an empty string i
the genome is not found.orgname_of_orgidmy $genomeName = $fig->orgname_of_orgid($genomeID);Return the name of the genome corresponding to the specified organism ID.genomeIDID of the relevant genome.RETURNReturns the name of the organism, consisting of the organism's genus, species, an
unique characterization, separated by spaces, or a null string if the genome is no
found.genus_species_domainmy ($gs, $domain) = $fig->genus_species_domain($genome_id);Returns a genome's genus and species (and strain if that has been properl
recorded) in a printable form, along with its domain. This method is simila
to genus_species, except it also returns the domain name (archaea
bacteria, etc.).genome_idID of the genome whose species and domain information is desired.RETURNReturns a two-element list. The first element is the species name and th
second is the domain name.domain_colormy $web_color = FIG::domain_color($domain);Return the web color string associated with a specified domain. The colors ar
extremely subtle (86% luminance), so they absolutely require a black background
Archaea are slightly cyan, bacteria are slightly magenta, eukaryota are slightl
yellow, viruses are slightly silver, environmental samples are slightly gray
and unknown or invalid domains are pure white.domainName of the domain whose color is desired.RETURNReturns a web color string for the specified domain (e.g. #FFDDFF fo
bacteria).org_and_color_ofmy ($org, $color) = $fig->org_and_domain_of($prot_id);Return the best guess organism and domain html color string of an organism
In the case of external proteins, we can usually determine an organism, but no
anything more precise than genus/species (and often not that).prot_idRelevant protein or feature ID.RETURNReturns a two-element list. The first element is the displayable organism name, and the secon
is an HTML color string based on the domain (see domain_color).partial_genus_matchingReturn a list of genome IDs that match a partial genus.For example partial_genus_matching("Listeria") will return all genome IDs that begin with Listeria, and this can also be restricted to complete genomes with another argument like this partial_genus_matching("Listeria", 1)abbrevmy $abbreviated_name = FIG::abbrev($genome_name);ormy $abbreviated_name = $fig->abbrev($genome_name);Abbreviate a genome name to 10 characters or less.For alignments and such, it is very useful to be able to produce an abbreviation of genus/species
That's what this does. Note that multiple genus/species might reduce to the same abbreviation, s
be careful (disambiguate them, if you must).The abbreviation is formed from the first three letters of the species name followed by th
first three letters of the genus name followed by the first three letters of the species name an
then the next four nonblank characters.genome_nameThe name to abbreviate.RETURNAn abbreviated version of the specified name.wikipedia_linkmy $wikipedia_link = $fig->wikipedia_link($genome_name);Check if Wikipedia has a page about this genome. If so, return it's url.genome_nameThe genome to find.RETURNThe url of the wikipedia page.organism_directorymy $organism_directory = $fig->organism_directory($genome_id);Get the directory that contains the organism data. This is just like th
FIGV version.genome_idThe id of the organism, e.g. 83333.1.RETURNA string containing the path to the organism directory.ncbi_contig_description<my $name = ncbi_contig_description($contig_id)>Looks up the NCBI description line for this contig identifier. Values are cache
in the directory $FIG_Config::var/ncbi_contigs.ftypemy $type = FIG::ftype($fid);ormy $type = $fig->ftype($fid);Returns the type of a feature, given the feature ID. This just amount
to lifting it out of the feature ID, since features have IDs of the formwher
x.y is the genome I
f is the type of featur
n is an integer that is unique within the genome/typefidFIG ID of the feature whose type is desired.RETURNReturns the feature type (e.g. peg, rna, pi, or pp), or undef if th
feature ID is not a FIG ID.genome_ofmy $genome_id = $fig->genome_of($fid);ormy $genome_id = FIG::genome_of($fid);Return the genome ID from a feature ID.fidID of the feature whose genome ID is desired.RETURNIf the feature ID is a FIG ID, returns the genome ID embedded inside it; otherwise, i
returns undef.genome_and_peg_ofmy ($genome_id, $peg_number = FIG::genome_and_peg_of($fid);my ($genome_id, $peg_number = $fig->genome_and_peg_of($fid);Return the genome ID and peg number from a feature ID.prot_idID of the feature whose genome and PEG number as desired.RETURNReturns the genome ID and peg number associated with a feature if the featur
is represented by a FIG ID, else undef.by_fig_idmy @sorted_by_fig_id = sort { FIG::by_fig_id($a,$b) } @fig_ids;Compare two feature IDs.This function is designed to assist in sorting features by ID. The sort is b
genome ID followed by feature type and then feature number.aFirst feature ID.bSecond feature ID.RETURNReturns a negative number if the first parameter is smaller, zero if both parameter
are equal, and a positive number if the first parameter is greater.by_genome_idmy @sorted_by_genome_id = sort { FIG::by_genome_id($a,$b) } @genome_ids;Compare two genome IDs.This function is designed to assist in sorting genomes by ID.aFirst genome ID.bSecond genome ID.RETURNReturns a negative number if the first parameter is smaller, zero if both parameter
are equal, and a positive number if the first parameter is greater.genes_in_regionmy ($features_in_region, $beg1, $end1) = $fig->genes_in_region($genome, $contig, $beg, $end, size_limit);Locate features that overlap a specified region of a contig. This includes features that begin or en
outside that region, just so long as some part of the feature can be found in the region of interest.It is often important to be able to find the genes that occur in a specific region o
a chromosome. This routine is designed to provide this information. It returns all gene
that overlap positions from $beg through $end in the specified contig.The $size_limit parameter limits the search process. It is presumed that no features are longer than th
specified size limit. A shorter size limit means you'll miss some features; a longer size limit significantl
slows the search process. For prokaryotes, a value of 10000 (the default) seems to work best.genomeID of the genome containing the relevant contig.contigID of the relevant contig.begPosition of the first base pair in the region of interest.endPosition of the last base pair in the region of interest.size_limitMaximum allowable size for a feature. If omitted, 10000 is assumed.RETURNReturns a three-element list. The first element is a reference to a list of the feature IDs found. The secon
element is the position of the leftmost base pair in any feature found. This may be well before the region o
interest begins or it could be somewhere inside. The third element is the position of the rightmost base pai
in any feature found. Again, this can be somewhere inside the region or it could be well to the right of it.regions_spannedmy ( [ $contig, $beg, $end ], ... ) = $fig->regions_spanned( $loc );ormy ( [ $contig, $beg, $end ], ... ) = FIG::regions_spanned( $loc );The location of a feature in a scalar context isThis routine takes as input a scalar location in the above for
and reduces it to one or more regions spanned by the gene. Thi
involves combining regions in the location string that are on th
same contig and going in the same direction. Unlike boundaries_of
which returns one region in which the entire gene can be found
regions_spanned handles wrapping through the orgin, feature
split over contigs and exons that are not ordered nicely alon
the chromosome (ugly but true).locThe location string for a feature.RETURNReturns a list of list references. Each inner list contains a contig ID, a startin
position, and an ending position. The starting position may be numerically greate
than the ending position (which indicates a backward-traveling gene). It i
guaranteed that the entire feature is covered by the regions in the list.filter_regionsmy @regions = FIG::filter_regions( $contig, $min, $max, @regions );ormy \@regions = FIG::filter_regions( $contig, $min, $max, @regions );ormy @regions = FIG::filter_regions( $contig, $min, $max, \@regions );ormy \@regions = FIG::filter_regions( $contig, $min, $max, \@regions );Filter a list of regions to those that overlap a specified section of
particular contig. Region definitions correspond to those produce
by regions_spanned. That is, [contig,beg,end]
In the function call, either $contig or $min and $max can b
undefined (permitting anything). So, for example,my @regions = FIG::filter_regions(undef, 1, 5000, $regionList);would return all regions in $regionList that overlap the firs
5000 base pairs in any contig. Conversely,my @regions = FIG::filter_regions('NC_003904', undef, undef, $regionList);would return all regions on the contig NC_003904.contigID of the contig whose regions are to be passed by the filter, or undefminLeftmost position of the region used for filtering. Only regions which contai
at least one base pair at or beyond this position will be passed. A valu
of undef is equivalent to zero.maxRightmost position of the region used for filtering. Only regions which contai
at least one base pair on or before this position will be passed. A valu
of undef is equivalent to the length of the contig.regionListA list of regions, or a reference to a list of regions. Each region is
reference to a three-element list, the first element of which is a conti
ID, the second element of which is the start position, and the thir
element of which is the ending position. (The ending position can b
before the starting position if the region is backward-traveling.)RETURNIn a scalar context, returns a reference to a list of the filtered regions
In a list context, returns the list itself.close_genesmy @features = $fig->close_genes($fid, $dist);Return all features within a certain distance of a specified other feature.This method is a quick way to get genes that are near another gene. It call
boundaries_of to get the boundaries of the incoming gene, then passe
the region computed to genes_in_region.So, for example, if the specified $dist is 500, the method would selec
a region that extends 500 base pairs to either side of the boundaries fo
the gene $fid, and pass it to genes_in_region for analysis. Th
features returned would be those that overlap the selected region. Not
that the flaws inherent in genes_in_region are also inherent in thi
method: if a feature is more than 10000 base pairs long, it may no
be caught even though it has an overlap in the specified region.fidID of the relevant feature.distDesired maximum distance.RETURNReturns a list of feature IDs for genes that overlap or are close to the boundarie
for the specified incoming feature.adjacent_genesmy ($left_fid, $right_fid) = $fig->adjacent_genes($fid, $dist);Return the IDs of the genes immediately to the left and right of a specifie
feature.This method gets a list of the features within the specified distance o
the incoming feature (using close_genes), and then chooses the tw
closest to the feature found. If the incoming feature is on the + strand
these are features to the left and the right. If the incoming feature i
on the - strand, the features will be returned in reverse order.fidID of the feature whose neighbors are desired.distMaximum permissible distance to the neighbors.RETURNReturns a two-element list containing the IDs of the features on either sid
of the incoming feature.compute_genome_similarityCompute a rough estimate of "similarity" between genomes using the following algorithm:if the % identify > 70, count a "too-similar{Genome2}"else count a "not-too-similar{Genome2}"For each Genome2, if the "too-similar{Genome2}" count > "not-too-similar{Genome2}" count,
the Genome1-Genome2 matches are too similar.else, they are notUsed for filtering candidate PCHs in remove_clustered_pchs2.pl.univ_hashHash where the keys are the annotations for the universal proteins to be use
in the similarity computation.match_lenMinimum length of similarity match required to be considered for genome similarity.num_genesNumber of genes to consider for the com.putation.RETURNList of lists of the form [genome2, is-similar, count of too-similar hits, count of not-too-similar hist]feature_locationmy $loc = $fig->feature_location($fid);ormy @loc = $fig->feature_location($fid);;Return the location of a feature. The location consist
of a list of (contigID, begin, end) triples encode
as strings with an underscore delimiter. So, for example
NC_002755_100_199 indicates a location starting at positio
100 and extending through 199 on the contig NC_002755. I
the location goes backward, the start location will be highe
than the end location (e.g. NC_002755_199_100).In a scalar context, this method returns the locations as
comma-delimited stringIn a list context, the locations are returned as a listfidID of the feature whose location is desired.RETURNReturns the locations of a feature, either as a comma-delimite
string or a list.contig_ofmy $contigID = $fig->contig_of($location);Return the ID of the contig containing a location.This method only works with SEED-style locations (contigID_beg_end)
For more comprehensive location parsing, use the Location object.locationA SEED-style location (contigID_beg_end), or a comma-delimited lis
of SEED-style locations. In the latter case, only the first location in the list wil
be processed.RETURNReturns the contig ID from the first location in the incoming string.beg_ofmy $beg = $fig->beg_of($location);Return the beginning point of a location.This method only works with SEED-style locations (contigID_beg_end)
For more comprehensive location parsing, use the Location object.locationA SEED-style location (contigID_beg_end), or a comma-delimited lis
of SEED-style locations. In the latter case, only the first location in the list wil
be processed.RETURNReturns the beginning point from the first location in the incoming string.end_ofmy $end = $fig->end_of($location);Return the ending point of a location.This method only works with SEED-style locations (contigID_beg_end)
For more comprehensive location parsing, use the Location object.locationA SEED-style location (contigID_beg_end), or a comma-delimited lis
of SEED-style locations. In the latter case, only the first location in the list wil
be processed.RETURNReturns the contig ID from the first location in the incoming string.upstream_ofmy $dna = $fig->upstream_of($peg, $upstream, $coding);Return the DNA immediately upstream of a feature. This method contains code lifted fro
the upstream.pl script.pegID of the feature whose upstream DNA is desired.upstreamNumber of base pairs considered upstream.codingNumber of base pairs inside the feature to be included in the upstream region.RETURNReturns the DNA sequence upstream of the feature's begin point and extending into the codin
region. Letters inside a feature are in upper case and inter-genic letters are in lower case
A hyphen separates the true upstream letters from the coding region.strand_ofmy $strand = $fig->contig_of($location);Return the strand (+ or -) of a location.This method only works with SEED-style locations (contigID_beg_end)
For more comprehensive location parsing, use the Location object.locationA comma-delimited list of SEED-style location (contigID_beg_end).RETURNReturns + if the list describes a forward-oriented location, and - if the lis
described a backward-oriented location.find_contig_with_checksummy $contigID = $fig->find_contig_with_checksum($genome, $checksum);Find a contig in the given genome with the given checksum.This method is useful for determining if a particular contig has already bee
recorded for the given genome. The checksum is computed from the contig contents
so a matching checksum indicates that the contigs may have the same content.genomeID of the genome whose contigs are to be examined.checksumChecksum value for the desired contig.RETURNReturns the ID of a contig in the given genome that has the caller-specified checksum
or undef if no such contig exists.contig_checksummy $checksum = $fig->contig_checksum($genome, $contig);ormy @checksum = $fig->contig_checksum($genome, $contig);Return the checksum of the specified contig. The checksum is computed from th
contig's content in a parallel process. The process returns a space-delimited lis
of numbers. The numbers can be split into a real list if the method is invoked i
a list context. For bread_contigRead a single contig from the contigs file.boundaries_ofusage: ($contig,$beg,$end) = $fig->boundaries_of($loc)The location of a feature in a scalar context isThis routine takes as input such a location and reduces it to a singl
description of the entire region containing the gene.all_features_detailedmy $featureList = $fig->all_features_detailed($genome);Returns a list of all features in the designated genome, with their location, alias
and type information included. This is used in the GenDB import and Sprout load t
speed up the process.Deleted features are not returned!genomeID of the genome whose features are desired.RETURNReturns a reference to a list of tuples. Each tuple consists of four elements: (1) the featur
ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the featur
aliases (as a comma-delimited list of named aliases), and (4) the feature type.all_features_detailed_fastmy $featureList = $fig->all_features_detailed($genome, $min, $max, $contig);Returns a list of all features in the designated genome, with various useful informatio
included.Deleted features are not returned!genomeID of the genome whose features are desired.min (optional)If specified, the minimum contig location of interest. Features not entirely to the righ
of this location are ignored.max (optional)If specified, the maximum contig location of interest. Features not entirely to the lef
of this location are ignore.contig (optional)If specified, the contig of interest. Features not on this contig are ignored.RETURNReturns a reference to a list of tuples. Each tuple consists of four elements: (1) the featur
ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the featur
aliases (as a comma-delimited list of named aliases), (4) the feature type, (5) the leftmos
index of the feature's first location, (6) the rightmost index of the feature's last location
(7) the current functional assignment, (8) the user who made the assignment, and (9) th
quality of the assignment (which is usually an empty string).all_featuresmy @fidList = $fig->all_features($genome,$type);Returns a list of all feature IDs of a specified type in the designated genome. You woul
usually use justwhich simply invoke this routine.genomeID of the genome whose features are desired.type (optional)Type of feature desired (peg, rna, etc.). If omitted, all features are returned.RETURNReturns a list of the IDs for the desired features.pegs_ofusage: $fig->pegs_of($genome)Returns a list of all PEGs in the specified genome. Note that order is no
specified.rnas_ofusage: $fig->rnas_of($genome)Returns a list of all RNAs for the given genome.feature_aliasesusage: @aliases = $fig->feature_aliases($fid) O
$aliases = $fig->feature_aliases($fid)Returns a list of aliases (gene IDs, arbitrary numbers assigned by authors, etc.) for the feature
These must come from the tbl files, so add them there if you want to see them here.In a scalar context, the aliases come back with commas separating them.uniprot_aliases_bulkmy $hash = $fig->uniprot_aliases_bulk(\@fids, $no_del_check);Return a hash mapping the specified feature IDs to lists of their unipro
aliases.fidsA list of FIG feature IDs.no_del_checkIf TRUE, deleted feature IDs will not be removed from the feature ID lis
before processing. The default is FALSE, which means deleted feature ID
will be removed before processing.RETURNReturns a hash mapping each feature ID to a list of its uniprot aliases.rewrite_db_xrefs_brcConvert an alias to a db_xref. This uses the BRC format db_xref, which is a conglomeration of NCBI, GO, and BioMoby.This method will return a correctly formatted db_ref if the argument is one of our currently recognized formats, otherwise it returns undef.This example code should provide the functions you wantforeach my $alias ($fig->feature_aliases($peg)
if (my $dbxref=$fig->rewrite_db_xrefs_brc($alias)) {print "The dbxref is $dbxref\n"
else {print "The alias is $alias\n"
}For a list of approved dbxrefs, see http://www.brc-central.org/cgi-bin/brc-central/dbxref_list.cgiby_aliasusage: $peg = $fig->by_alias($alias)Returns a FIG id if the alias can be converted. Right now we convert aliase
of the form NP_* (RefSeq IDs), gi|* (GenBank IDs), sp|* (Swiss Prot), uni|* (UniProt)
kegg|* (KEGG) and maybe a few moreby_raw_aliasusage: $peg = $fig->by_raw_alias($alias)Returns all FIG ids having the given alias. Unlike by_alias, we do not attempt an
kind of normalization. I'm not sure this function is needed, but by_alias searche
only in ext_alias table whereas here I'm searching in the features table. ext_alia
does not have all external aliases which is keeping my code from working. In particular
it lacks EnsemblGene. It would be nice to combine these two functions. -E
=cutsub by_raw_alias
my($self,$alias) = @_
my($rdbH,$relational_db_response)
my ($peg);$rdbH = $self->db_handle;if (($relational_db_response = $rdbH->SQL("SELECT id FROM features WHERE aliases LIKE \'%,$alias,%\'")) && (@$relational_db_response > 0)) {if (@$relational_db_response == 1) {$peg = $relational_db_response->[0]->[0];return wantarray() ? ($peg) : $peg;} elsif (wantarray()) {return map { $_->[0] } @$relational_db_response;}}return wantarray() ? () : "";}sub to_alias
my($self,$fid,$type) = @_;my @aliases = $self->feature_aliases($fid);if ($type){@aliases = grep { $_ =~ /^$type\|/ } @aliases;}if (wantarray()){return @aliases;}elsif (@aliases > 0){return $aliases[0];}else{return "";}}possibly_truncatedusage: $fig->possibly_truncated($feature_id) or $fig->possibly_truncated($genome, $loc)Returns true iff the feature or location occurs near the end of a contig.map_peg_to_ids<my $gnum, $pnum = $fig-map_peg_to_ids($peg)>>Map a peg ID to a pair of numbers describing that peg.In order to conserve storage and increase performance for some operations (th
functional coupling computation, for instance), we provide a mechanism by which a full pe
(of the form fig|X.Y.peg.Z) is mapped to a pair of integers: a genome number and a PE
index. We maintain a table genome_mapping that retains the mapping between genome I
and local genome number. No effort is expended to ensure this mapping is at all coheren
between SEED instances; this is purely a local mechanism for performance enhancement.$pegID of the peg to be mapped.RETURNA pair of numbers ($gnum, $pnum)abstract_coupled_tomy @coupled_to = $fig->abstract_coupled_to($peg);Return a list of functionally coupled PEGs.pegID of the protein encoding group whose functionally-coupled proteins are desired.RETURNReturns a list of 4-tuples, each consisting of the ID of a couple
PEG, a score, a "type" which indicates the method that produced th
score, and "extra data" in the form of a pointer to a list. If ther
are no PEGs functionally coupled to the incoming PEG, it will retur
an empty list. If the PEG data is not present, it will return an empty list.coupled_tomy @coupled_to = $fig->coupled_to($peg);Return a list of functionally coupled PEGs.The new form of coupling and evidence computation is based on precomputed data
The old form took minutes to dynamically compute things when needed. The old for
still works, if the directory Data/CouplingData is not present. If it is present
it theis assumed to contain comprehensive coupling data in the form of precomputed score
and PCHs.If Data/CouplingData is present, this routine returns a list of 2-tuples [Peg,Sc]. I
returns the empty list if the peg is not coupled. It returns undef if Data/CouplingDatapegID of the protein encoding group whose functionally-coupled proteins are desired.RETURNReturns a list of 2-tuples, each consisting of the ID of a coupled PEG and a score. I
there are no PEGs functionally coupled to the incoming PEG, it will return an empt
list. If the PEG data is not present, it will return undef.coupling_evidenceusage: @evidence = $fig->coupling_evidence($peg1,$peg2)The new form of coupling and evidence computation is based on precomputed data
The old form took minutes to dynamically compute things when needed. The old for
still works, ikf the directory Data/CouplingData is not present. If it is present
it is assumed to contain comprehensive coupling data in the form of precomputed score
and PCHs.If Data/CouplingData is present, this routine returns a list of 3-tuples [Peg3,Peg4,Rep]
Here, Peg3 is similar to Peg1, Peg4 is similar to Peg2, and Rep == 1 iff this is
"representative pair". That is, we take all pairs and create a representative se
in which each pair is not "too close" to any other pair in the representative set
Think of "too close" as being roughly 95% identical at the DNA level. This keeps (usually
a single pair from a set of different genomes from the same species.It returns the empty list if the peg is not coupled. It returns undef, if Data/CouplingDat
is not there.coupling_and_evidenceusage: @coupling_data = $fig->coupling_and_evidence($fid,$bound,$sim_cutoff,$coupling_cutoff,$keep_record)A computation of couplings and evidence starts with a given peg and produces a list o
3-tuples. Each 3-tuple is of the formEvidence is a list of 2-tuples of FIDs that are close in other genomes (producin
a "pair of close homologs" of [$peg,CoupledToFID]). The maximum score for a singl
PCH is 1, but "Score" is the sum of the scores for the entire set of PCHs.NOTE: once the new version of precomputed coupling is installed (i.e., when Data/CouplingDat
is filled with the precomputed relations), the parameters on computing evidence are ignored.If $keep_record is true, the system records the information, asserting coupling for eac
of the pairs in the set of evidence, and asserting a pin from the given $fd through al
of the PCH entries used in forming the score.add_chr_clusters_and_pinsusage: $fig->add_chr_clusters_and_pins($peg,$hits)The system supports retaining data relating to functional coupling. If a use
computes evidence once and then saves it with this routine, data relating t
both "the pin" and the "clusters" (in all of the organisms supporting th
functional coupling) will be saved.$hits must be a pointer to a list of 3-tuples of the sort returned b
$fig->coupling_and_evidence.translatableusage: $fig->translatable($prot_id)The system takes any number of sources of protein sequences as input (and builds an n
for the purpose of computing similarities). For each of these input fasta files, it save
(in the DB) a filename, seek address and length so that it can go get the translation i
needed. This routine simply returns true iff info on the translation exists.translation_lengthusage: $len = $fig->translation_length($prot_id)The system takes any number of sources of protein sequences as input (and builds an n
for the purpose of computing similarities). For each of these input fasta files, it save
(in the DB) a filename, seek address and length so that it can go get the translation i
needed. This routine returns the length of a translation. This does not require actuall
retrieving the translation.get_translationmy $translation = $fig->get_translation($prot_id);The system takes any number of sources of protein sequences as input (and builds an n
for the purpose of computing similarities). For each of these input fasta files, it save
(in the DB) a filename, seek address and length so that it can go get the translation i
needed. This routine returns the stored protein sequence of the specified PEG feature.prot_idID of the feature (PEG) whose translation is desired.RETURNReturns the protein sequence string for the specified feature.mapped_prot_idsusage: @mapped = $fig->mapped_prot_ids($prot)This routine is at the heart of maintaining synonyms for protein sequences. The syste
determines which protein sequences are "essentially the same". These may differ in lengt
(presumably due to miscalled starts), but the tails are identical (and the heads are not "too" extended)
Anyway, the set of synonyms is returned as a list of 2-tuples [Id,length] sorte
by length.function_ofmy $function = $fig->function_of($id, $user);ormy @functions = $fig->function_of($id);In a scalar context, returns the most recently-determined functiona
assignment of a specified feature by a particular user. In a lis
context, returns a list of 2-tuples, each consisting of a user I
followed by a functional assighment by that user. In this case
the list contains all the functional assignments for the feature.idID of the relevant feature.userID of the user whose assignment is desired (scalar context only)RETURNReturns the most recent functional assignment by the given user in scala
context, and a list of functional assignments in list context. Each assignmen
in the list context is a 2-tuple of the form [$user, $assignment].function_of_bulkmy $functionHash = $fig->function_of_bulk(\@fids, $no_del_check);Return a hash mapping the specified proteins to their master functional assignments.fidsReference to a list of feature IDs.no_del_checkIf TRUE, then deleted features will not be removed from the list. The defaul
is FALSE, which means deleted feature will be removed from the list.RETURNREturns a reference to a hash mapping feature IDs to their main functional assignments.translated_function_ofusage: $function = $fig->translated_function_of($peg,$user)You get just the translated function.translate_functionusage: $translated_func = $fig->translate_function($func)Translates a function based on the function.synonyms table.assign_functionusage: $fig->assign_function($peg,$user,$function,$confidence)Assigns a function. Note that confidence can (and should be if unusual) included
Note that no annotation is written. This should normally be done in a separat
call of the form$userR = $user;$userR =~ s/^master://; $fig->add_annotation($fid,$userR,"Set master function to\n$function\n");nsimsNew sims code.This code takes advantage of a network similarity server if it is available.We gather sims in the following manner:osimsusage: @sims = $fig->osims($peg,$maxN,$maxP,$select,$max_expand, $filters)Returns a list of similarities for $peg such thatBy "expanded", we refer to taking a "raw similarity" against an entry in the non-redundan
protein collection, and converting it to a set of similarities (one for each of th
proteins that are essentially identical to the representative in the nr).Each entry in @sims is a refence to an array. These are the values in each array position:bbhsmy @bbhList = $fig->bbhs($peg, $cutoff);Return a list of the bi-directional best hits relevant to the specified PEG.pegID of the feature whose bidirectional best hits are desired.cutoffSimilarity cutoff. If omitted, 1e-10 is used.RETURNReturns a list of 3-tuples. The first element of the list is the best-hit PEG; the second element is the score. A lower score indicates a better match. The third element is the normalized bit score for the pair, and is normalized to the length of the protein.bbh_listmy $bbhHash = $fig->bbh_list($genomeID, \@featureList);Return a hash mapping the features in a specified list to their bidirectional best hit
on a specified target genome.(Modeled after the Sprout call of the same name.)genomeIDID of the genome from which the best hits should be taken.featureListList of the features whose best hits are desired.RETURNReturns a reference to a hash that maps the IDs of the incoming features to the best hit
on the target genome.dsimsusage: @sims = $fig->dsims($peg,$maxN,$maxP,$select)Returns a list of similarities for $peg such thatBy "expanded", we refer to taking a "raw similarity" against an entry in the non-redundan
protein collection, and converting it to a set of similarities (one for each of th
proteins that are essentially identical to the representative in the nr).The "dsims" or "dynamic sims" are not precomputed. They are computed using a heuristic whic
is much faster than blast, but misses some similarities. Essentially, you have an "index" o
representative sequences, a quick blast is done against it, and if there are any hits these ar
used to indicate which sub-databases to blast against.in_cluster_withusage: @pegs = $fig->in_cluster_with($peg)Returns the set of pegs that are thought to be clustered with $peg (on th
chromosome).add_chromosomal_clustersusage: $fig->add_chromosomal_clusters($file)The given file is supposed to contain one predicted chromosomal cluster per line (eithe
comma or tab separated pegs). These will be added (to the extent they are new) to thos
already in $FIG_Config::global/chromosomal_clusters.in_pch_pin_withusage: $fig->in_pch_pin_with($peg)Returns the set of pegs that are believed to be "pinned" to $peg (in th
sense that PCHs occur containing these pegs over significant phylogeneti
distances).add_pch_pinsusage: $fig->add_pch_pins($file)The given file is supposed to contain one set of pinned pegs per line (eithe
comma or tab seprated pegs). These will be added (to the extent they are new) to thos
already in $FIG_Config::global/pch_pins.add_annotationmy $okFlag = $fig->add_annotation($fid, $user, $annotation, $time_made);Add an annotation to a feature.fidID of the feature to be annotated.userName of the user making the annotation.