Data sets for "The landscape of microbial phenotypic traits and associated genes". Maria Brbic, Matija Piskorec, Vedrana Vidulin, Anita Krisko, Tomislav Smuc, Fran Supek. Nucleic Acids Research (2016). doi:10.1093/nar/gkw964

ProTraits_precisionScores.txt contains predictions of 424 phenotypic traits for 3,046 bacterial and archaeal species. Table contains precision scores (equivalent to 1-FDR) obtained using 11 individual text mining and comparative genomics data sources, used to train Support Vector Machines (text) or Random Forest (genomics) classifiers. The precision scores were obtained by calibrating the classifier confidence scores using precision-recall curves obtained in cross-validation; see Methods in our publication for details.

Precision is provided separately for the positive (+) and the negative (-) class of a phenotypic trait. These scores need not add up to 1.0, since separate precision-recall curves were used to calibrate the scores for the positive and the negative phenotype for each trait. For instance, in cases where there is substantial uncertainty about the prediction, both the (+) and the (-) score reported will be low.

These are the precision scores browsable on http://protraits.irb.hr/ (web site reports only the prediction for the minority class of a given phenotype). We validated these scores by extensive manual curation; please see Brbic et al. publication for details.

The last two columns provide an integrated score obtained using the 'two-votes' scheme, meaning that two independant classifiers must support the given inference at that level of confidence. We recommend these scores for general use, based on their high coverage (~308,000 predictions) and excellent support in validation data (actual precision of the 11 data sources was 0.911-0.934 at nominal precision ≥ 90%).

ProTraits_adjustedWaldConfInt.txt - same as above, but reports precision scores and their 95% confidence intervals. These are obtained by applying the adjusted Wald method (Agresti & Coull, 1998) to the appropriate cut-off points in the crossvalidation precision-recall curves. The Wilson point estimate of the precision score is provided here.

ProTraits_binaryIntegratedPr0.90.txt A convenient tab-separated table with binarized predictions, requiring precision ≥ 0.9 (equivalent to FDR ≤ 10%) using only the integrated score (obtained via the two-votes scheme, as above). The value "1" denotes that a positive label was assigned to that phenotypic trait, while "0" denotes that a negative label was assigned. A "?" denotes that neither positive nor negative label could be assigned at precision ≥ 0.9.

In the extremely rare cases where both precision scores were greater than 0.9, the value of the class with the higher precision was assigned; in the case of ties the value of a minority class was assigned.

ProTraits_binaryIntegratedPr0.95.txt As above, but requires the a more stringent threshold of precision ≥ 0.95 (FDR ≤ 5%).

In addition to the above files with phenotype predictions, please see other data sets provided in the Supplementary Material of Brbic et al.:

Table S1. Supporting information regarding the curation of the known phenotype labels from existing databases.
Table S2. Discovery of phenotypic concepts from free-text using non-negative matrix factorization (NMF).
Table S3. Accuracy of phenotype prediction from text and genomic data sources.
Table S4. The sets of comparative genomics features with positive Random Forest feature importance scores, broken down by individual phenotypic traits.
Table S5. Detailed statistics describing the validation of the inferred phenotypes via literature searches by two curators.
Table S6. Gene-trait associations detected after controlling for confounders (phylogenetic relatedness, genome size and G+C content) and their enrichment in Gene Ontology functional categories.