Knowledge discovery in variant databases using inductive logic programming

doi:10.4137/BBI.S11184

. 2013 Mar 18:7:119-31.

doi: 10.4137/BBI.S11184. Print 2013.

Knowledge discovery in variant databases using inductive logic programming

Hoan Nguyen¹, Tien-Dao Luu, Olivier Poch, Julie D Thompson

Affiliations

PMID: 23589683
PMCID: PMC3615990
DOI: 10.4137/BBI.S11184

Knowledge discovery in variant databases using inductive logic programming

Hoan Nguyen et al. Bioinform Biol Insights. 2013.

. 2013 Mar 18:7:119-31.

doi: 10.4137/BBI.S11184. Print 2013.

Authors

Hoan Nguyen¹, Tien-Dao Luu, Olivier Poch, Julie D Thompson

Affiliation

¹ Laboratoire de Bioinformatique et Génomique Intégratives, Institut de Génétique et de Biologie Moléculaire et Cellulaire Illkirch, France.

PMID: 23589683
PMCID: PMC3615990
DOI: 10.4137/BBI.S11184

Abstract

Understanding the effects of genetic variation on the phenotype of an individual is a major goal of biomedical research, especially for the development of diagnostics and effective therapeutic solutions. In this work, we describe the use of a recent knowledge discovery from database (KDD) approach using inductive logic programming (ILP) to automatically extract knowledge about human monogenic diseases. We extracted background knowledge from MSV3d, a database of all human missense variants mapped to 3D protein structure. In this study, we identified 8,117 mutations in 805 proteins with known three-dimensional structures that were known to be involved in human monogenic disease. Our results help to improve our understanding of the relationships between structural, functional or evolutionary features and deleterious mutations. Our inferred rules can also be applied to predict the impact of any single amino acid replacement on the function of a protein. The interpretable rules are available at http://decrypthon.igbmc.fr/kd4v/.

Keywords: SNP prediction; genotype-phenotype relation; human monogenic disease; inductive logic programming.

PubMed Disclaimer

Figures

**Figure 1**
Main steps for an ILP application include: (i) mutation selection from MSV3d, (ii) definition of negative/positive examples in the training set, (iii) background knowledge creation, (iv) selection of the ILP system, (v) selection of the ILP parameters (number of nodes, noisy..) and optimization of the predicates in the background knowledge, (vi) model evaluation using K-fold cross validation, and (vii) the final rules used for interpretation.

**Figure 2**
Definition of neighbouring residues. **Notes:** For the mutated residue, Asn180 of protein Q13496, a sphere of radius 10 A° is drawn with the residue in the centre. Any residues that lie within the sphere are defined as neighbours.

**Figure 3**
Mutation data model. **Notes:** Each missense mutation is characterised by physico-chemical features (size, charge, polarity, hydrophobicity, etc), evolutionary information and 3D structural features. In addition, it may have one or more than one neighbouring residues, each of which can belong to a single class, based on Koolman’s classification.

**Figure 4**
Construction of background knowledge from MSV3d. **Notes:** Each mutation in the database is identified by a unique identifier ‘id’ and the values of each. Modeh defines the head of a hypothesised clause, while Modeb declares the predicates that can occur in the body of a hypothesised clause. The asterisk * in the mode declarations indicates that the corresponding predicate can be called many times during the construction of a hypothesised clause.

**Figure 5**
Part of a screenshot with four induced rules obtained using Aleph with noise = 0.5%, minpos = 5, nodes + 50,000. **Notes:** Users can click on the + icon to see the covered examples. The keyword “sub_family_conservation” was used as a filter in this screenshot.

**Figure 6**
Part of the clustering of the full set of 173 generated rules. **Notes:** We performed rule alignment on each subfamily (indicated by red rectangles in the dendrogram). Two interesting rules are indicated by (*).

See this image and copyright information in PMC

References

1. Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–53. - PMC - PubMed
1. Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307(2):683–706. - PubMed
1. Cochrane GR, Galperin MY. The 2010 nucleic scids research database issue and online database collection: a community of data resources. Nucleic Acids Res. 2010;38(Database issue):D1–4. - PMC - PubMed
1. Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32(4):358–68. - PubMed
1. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

[1] Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–53. - PMC - PubMed

[2] Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–53. - PMC - PubMed

[3] Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307(2):683–706. - PubMed

[4] Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307(2):683–706. - PubMed

[5] Cochrane GR, Galperin MY. The 2010 nucleic scids research database issue and online database collection: a community of data resources. Nucleic Acids Res. 2010;38(Database issue):D1–4. - PMC - PubMed

[6] Cochrane GR, Galperin MY. The 2010 nucleic scids research database issue and online database collection: a community of data resources. Nucleic Acids Res. 2010;38(Database issue):D1–4. - PMC - PubMed

[7] Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32(4):358–68. - PubMed

[8] Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32(4):358–68. - PubMed

[9] Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4. - PMC - PubMed

[10] Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Knowledge discovery in variant databases using inductive logic programming

Affiliation

Knowledge discovery in variant databases using inductive logic programming

Authors

Affiliation

Abstract

Figures

Similar articles

References

LinkOut - more resources

Full Text Sources

Other Literature Sources