Data used in the paper "A systematic Study of Genome Context Methods"


Gold Standard

Each file contains a list of gene pairs from a certain genome from version 13.1 of BioCyc, along with a label of "1" if the pair is considered to be functionally related or "-1" otherwise. A pair of genes is considered to be functionally related if the products of the two genes catalyze reactions in the same metabolic pathway, belong to the same protein complex, or take part in the same signaling pathway, as found in version 13.1 of BioCyc.

The full set of samples contains all pairs of genes in the corresponding genome. This set is not used in the paper to report results because of the likely high ratio of mislabeled negative samples. The known-function set is designed to discard all genes for which no knowledge of the function is available. These genes would result on mislabeled negative examples since their function is not known and, as a result, a positive label will never be assigned to a pair involving those genes. The sm-enzyme set of samples contains pairs only for genes whose product is an enzyme in a small-molecule reaction.

The gene IDs used in these files are the BioCyc gene IDs. For Escherichia coli K-12, the UniProt IDs for the products of the genes are given in a separate file. This file lists the EcoCyc gene ID, the EcoCyc ID of its product and the UniProt ID of this product (if any). A gene might have more than one product, in which case there will be more than one line for such gene listed in this file.

The full database for each organism is available in the BioCyc download page.


Organism

Full Set

Known Function Set

Sm Enzyme Set

Mapping

Escherichia coli K-12 substr. MG1655 gz gz gz txt
Escherichia coli O157:H7 EDL933 gz gz gz
Escherichia coli CFT073 gz gz gz
Shigella flexneri 2a str. 2457T gz gz gz
Vibrio cholerae O1 biovar El Tor str. N16961 gz gz gz
Caulobacter crescentus CB15 gz gz gz
Mycobacterium tuberculosis CDC1551 gz gz gz
Mycobacterium tuberculosis H37Rv gz gz gz
Francisella tularensis tularensis SCHU S4 gz gz gz
Helicobacter pylori 26695 gz gz gz

Lists of reference genomes

The following are the lists of reference genomes used in the paper. Each list contains the BioCyc IDs of the genome and their full names.

List of size 51
List of size 72
List of size 114
List of size 166
List of size 216
List of size 279
List of size 343
List of size 398
List of size 460 (full list)