Note:
The following terms are synonymous in the context of data files: field, column, attribute, and slot. Most data file slots correspond one-to-one with a PGDB slot; their semantics are explained in "Guide to the Pathway Tools Schema", which appears as a file entitled "Pathway-Tools-Schema.pdf" in the data file download directory. following data file slots are not described in the Pathway Tools Schema:
| Format | Description |
|---|---|
| Tabular | Each tabular file contains data for one class of objects, such as reactions
or pathways. This type of file contains a single table of tab-delimited
columns and newline-delimited rows. The first row contains headers which
describe the data beneath them. Each of the remaining rows represents an
object, and each column is an attribute of the object. Column names that
would otherwise be the same contain a number x having values 1, 2, 3, etc.
to distinguish them. Comment lines can be anywhere in the file and must
begin with the following symbol:
# |
| Attribute-Value | Each attribute-value file contains data for one class of objects, such
as genes or proteins. A file is divided into entries, where one entry
describes one database object.
An entry consists of a set of attribute-value pairs, which describe properties of the object, and relationships of the object to other object. Each attribute-value pair typically resides on a single line of the file, although in some cases for values that are long strings, the value will reside on multiple lines. An attribute-value pair consists of an attribute name, followed by the string " - " and a value, for example: LEFT - NADPA value that requires more than one line is continued by a newline followed by a /. Thus, literal slashes at the beginning of a line must be escaped as //. A line that contains only // separates objects. Comment lines can be anywhere in the file and must begin with the following symbol: #Starting in version 6.5 of Pathway Tools, attribute-value files can also contain annotation-value pairs. Annotations are a mechanism for attaching labeled values to specific attribute values. For example, we might want to specify a coefficient for a reactant in a chemical reaction. An annotations refers to the attribute value that immediately precedes the annotation. An annotation-value pair consists of a caret symbol "^" that points upward to indicate that the annotation annotates the preceding attribute value, followed by the annotation label, followed by the string " - ", followed by a value. The same attribute name or annotation label with different values can appear any number of times in an object. An example annotation-value pair that refers to the preceding attribute-value pair is: LEFT - NADP |
| FASTA | Each object (either a polypeptide or a polynucleotide) in the file
begins with a line that begins with the following symbol:
>On the same line as the > is a comment describing the object. The remaining lines contain the object's sequence. The sequence is typically broken into multiple lines, each of which must have the same arbitrary length, except the last line, which may be shorter. NCBI describes the FASTA format in more detail. |
| Ocelot | The file contains a Lisp format dump of the entire database. |
| BioPAX | The file contains a Biological Pathways Exchange (BioPAX) - Level 2 and Level 3 dumps of the database. |
| SBML | The file contains an SBML dump of the reaction network within the database. |
| File Name | Brief Description | Format |
|---|---|---|
| enzymes.col | Enzymatic reactions and enzymes | Tabular |
| genes.col | Genes | Tabular |
| pathways.col | Pathways and the genes that encode enzymes in each pathway | Tabular |
| protcplxs.col | Protein complexes and the genes that encode each subunit in a complex | Tabular |
| transporters.col | Transporters, their subunit structures, and what they transport | Tabular |
| func-associations.col | Various functional associations between genes (not available for most organisms) | Tabular |
| bindrxns.dat | Binding reactions between proteins and DNA sites | Attribute-Value |
| classes.dat | PGDB classes and their relationships | Attribute-Value |
| compounds.dat | Chemical compounds | Attribute-Value |
| dnabindsites.dat | DNA binding sites | Attribute-Value |
| enzrxns.dat | Enzymatic reactions | Attribute-Value |
| genes.dat | Genes | Attribute-Value |
| pathways.dat | Pathways, including relationships among reactions | Attribute-Value |
| promoters.dat | Promoters | Attribute-Value |
| protein-features.dat | Protein features (for example, active sites) | Attribute-Value |
| proteins.dat | Proteins | Attribute-Value |
| protligandcplxes.dat | Complexes between proteins and small-molecule ligands | Attribute-Value |
| pubs.dat | Publications | Attribute-Value |
| reactions.dat | Chemical reactions | Attribute-Value |
| regulation.dat | Regulatory interactions of all types | Attribute-Value |
| regulons.dat | Transcription factors | Attribute-Value |
| species.dat | List of all Species (this file is in only the MetaCyc DB) | Attribute-Value |
| terminators.dat | Terminators | Attribute-Value |
| transunits.dat | Transcription units | Attribute-Value |
| protseq.fsa | Protein sequences (in 13.5 , renamed from formerly protseq.fasta) | FASTA |
| dnaseq.fsa | Nucleotide sequences for all genes (new in 13.5) | FASTA |
| [xxx]base.ocelot | Entire database | Ocelot |
| biopax-level2.owl | All pathways, reactions, etc. that can be represented in BioPAX level 2 format | BioPAX |
| biopax-level3.owl | All pathways, reactions, etc. that can be represented in BioPAX level 3 format | BioPAX |
| File Name | Description |
|---|---|
| enzymes.col | For each enzymatic reaction in the PGDB, the file lists the reaction
equation, up to 4 pathways that contain the reaction, up to 4 cofactors
for the enzyme, up to 4 activators, up to 4 inhibitors, and the subunit
structure of the enzyme.
Columns (multiple columns are indicated in parentheses):
|
| genes.col | For each gene in the PGDB, the file lists its names (including up to
4 synonyms), location, product, and up to 4 parent classes (types). Note:
Gene Type is a class in the gene ontology designed by Dr. M. Riley.
Columns (multiple columns are indicated in parentheses):
|
| pathways.col | For each pathway in the PGDB, the file lists the genes that encode
the enzymes in that pathway.
Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all pathways in the PGDB):
|
| protcplxs.col | For each protein complex in the PGDB, the file lists the genes that
encode the subunits of the complex.
Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all protein complexes in the PGDB):
|
| transporters.col | For each transporter in the PGDB, the file lists the transport reaction
equation and the transporter's subunit composition.
Columns:
|
| func-associations.col | This file contains all functional associations among genes in
the EcoCyc Pathway/Genome Database. There are three types of functional associations included in this file. The sections are separated by two rows each starting with a '#'. These functional associations include:
Pathway Functional Associations
Protein Complex Functional Associations
Transcription Factor/Regulated Gene Pairs
|
The meanings of the attributes are explained in the chapter "Guide to the Pathway Tools Schema" in the Pathway Tools User's Guide, which is available as part of the Pathway Tools software distribution.
| classes.dat | For each class, the file lists its names and its parent classes (types).
This file covers every class in the Pathway Tools ontology.
Attributes:
|
| bindrxns.dat | This file lists binding reactions between proteins and DNA binding sites such as promoters.
Attributes:
|
| compounds.dat | This file lists all chemical compounds in the PGDB.
Attributes:
|
| dnabindsites.dat | This file lists all DNA binding sites in the PGDB.
Attributes:
|
| enzrxns.dat | This file lists all enzymatic reactions in the PGDB.
Attributes:
|
| genes.dat | This file lists all genes in the PGDB.
Attributes:
|
| pathways.dat | This file lists all pathways in the PGDB.
Attributes:
|
| promoters.dat | This file lists all promoters in the PGDB.
Attributes:
|
| protein-features.dat | This file lists all the protein features (such as active sites) in the PGDB.
Attributes:
|
| proteins.dat | This file lists all proteins in the PGDB.
Attributes:
|
| protligandcplxes.dat | This file lists all the complexes of proteins with small-molecule ligands in the PGDB.
Attributes:
|
| pubs.dat | This file lists all non-PubMed publications referenced in the PGDB.
Attributes:
|
| reactions.dat | This file lists all chemical reactions in the PGDB.
Attributes:
|
| regulation.dat | This file lists all the regulatory relationships in the PGDB.
Attributes:
|
| regulons.dat | This file lists all transcription factors in the PGDB and the genes
that they regulate by binding upstream of the transcription unit containing
those genes.
Attributes:
|
| terminators.dat | This file lists all terminators in the PGDB.
Attributes:
|
| transunits.dat | This file lists all transcription units in the PGDB.
Attributes:
|
| protseq.fsa | This file lists the amino acid sequence of each protein monomer in the PGDB. (In 13.5 , renamed from formerly protseq.fasta) |
| dnaseq.fsa | This file lists the DNA nucleotide sequence of each gene in the PGDB. (New in 13.5) Includes RNAs. The extent of each sequence is the coding region, on its coding strand. |
Many classes of objects (e.g. enzymatic-reactions, pathways, transcription units, promoters, etc.) can have associated evidence codes. You can read more about the Pathway Tools Evidence Ontology here. There are four top-level codes. All experimental evidence codes start with EV-EXP, and all computational evidence codes start with EV-COMP. The other top-level codes are EV-AS (author statement) and EV-IC (inferred by curator).
Because evidence codes are typically associated with citations, they are stored in the CITATIONS slot. Values of the citations slot can take the following possible formats:
The CITATIONS slot is included in most of the attribute-value files. Evidence for enzyme or transporter function is found in enzrxns.dat. Evidence for non-enzymatic function of a protein is found in proteins.dat. Examples of CITATIONS lines include:
To strip out objects with only computational evidence, search for all the top-level code prefixes in the CITATIONS line. Any given object may have both experimental and computational evidence, so it is not sufficient just to look for the EV-COMP tag -- you must also make sure that the object lacks an EV-EXP tag. Some objects may have no attached evidence codes. In EcoCyc, it is probably safe to assume that these are experimentally determined, as the assignments probably predated our use of evidence codes, and were added at a time when EcoCyc contained only experimentally determined data. In other PGDBs of course that assumption does not hold.
For example, to find all EcoCyc pathways with experimental evidence, you would look in the pathways.dat file for all pathways that have at least one CITATIONS line with an EV-EXP tag (you would also have to decide whether or not to accept EV-AS and EV-IC tags). To find all transcription units with experimental evidence, you would do the same with the transunits.dat file. However, you should bear in mind that not all experimental evidence is of equal value. In particular, many transcription units are predicted based on high-throughput gene expression analysis which, although experimental in nature, is not generally considered high-quality. Before extracting your data, consider carefully which evidence codes you wish to include or exclude.
To find all EcoCyc genes with experimentally determined functions, the situation is more complicated still, as the evidence codes are not assigned to the genes themselves. To determine whether a particular gene has an experimentally determined function, you will need to look at its product (identified by the PRODUCT attribute) or a complex that includes the product (identified by the COMPONENT-OF attribute on the polypeptide). An evidence code may be attached directly to the protein in the proteins.dat file. However, if the gene codes for an enzyme, the evidence code will instead be found attached to the enzymatic-reaction entry or entries (identified by the CATALYZES attribute on the protein) in the enzrxns.dat file. If the gene is a transcription factor, the evidence code will be found in the corresponding entries (identified by the REGULATES attribute) in the regulation.dat file. Thus, you may need to follow several links and use several files to find the full list of evidence codes for any given gene. To extract all genes with experimental evidence, you might find it easiest to work backwards -- extract all objects from enzrxns.dat, regulation.dat and proteins.dat with experimental evidence codes, and follow them backwards (via the ENZYME, REGULATOR, COMPONENTS and GENE attributes) to get the list of relevant genes.