Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 18:7:119-31.
doi: 10.4137/BBI.S11184. Print 2013.

Knowledge discovery in variant databases using inductive logic programming

Affiliations

Knowledge discovery in variant databases using inductive logic programming

Hoan Nguyen et al. Bioinform Biol Insights. .

Abstract

Understanding the effects of genetic variation on the phenotype of an individual is a major goal of biomedical research, especially for the development of diagnostics and effective therapeutic solutions. In this work, we describe the use of a recent knowledge discovery from database (KDD) approach using inductive logic programming (ILP) to automatically extract knowledge about human monogenic diseases. We extracted background knowledge from MSV3d, a database of all human missense variants mapped to 3D protein structure. In this study, we identified 8,117 mutations in 805 proteins with known three-dimensional structures that were known to be involved in human monogenic disease. Our results help to improve our understanding of the relationships between structural, functional or evolutionary features and deleterious mutations. Our inferred rules can also be applied to predict the impact of any single amino acid replacement on the function of a protein. The interpretable rules are available at http://decrypthon.igbmc.fr/kd4v/.

Keywords: SNP prediction; genotype-phenotype relation; human monogenic disease; inductive logic programming.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Main steps for an ILP application include: (i) mutation selection from MSV3d, (ii) definition of negative/positive examples in the training set, (iii) background knowledge creation, (iv) selection of the ILP system, (v) selection of the ILP parameters (number of nodes, noisy..) and optimization of the predicates in the background knowledge, (vi) model evaluation using K-fold cross validation, and (vii) the final rules used for interpretation.
Figure 2
Figure 2
Definition of neighbouring residues. Notes: For the mutated residue, Asn180 of protein Q13496, a sphere of radius 10 A° is drawn with the residue in the centre. Any residues that lie within the sphere are defined as neighbours.
Figure 3
Figure 3
Mutation data model. Notes: Each missense mutation is characterised by physico-chemical features (size, charge, polarity, hydrophobicity, etc), evolutionary information and 3D structural features. In addition, it may have one or more than one neighbouring residues, each of which can belong to a single class, based on Koolman’s classification.
Figure 4
Figure 4
Construction of background knowledge from MSV3d. Notes: Each mutation in the database is identified by a unique identifier ‘id’ and the values of each. Modeh defines the head of a hypothesised clause, while Modeb declares the predicates that can occur in the body of a hypothesised clause. The asterisk * in the mode declarations indicates that the corresponding predicate can be called many times during the construction of a hypothesised clause.
Figure 5
Figure 5
Part of a screenshot with four induced rules obtained using Aleph with noise = 0.5%, minpos = 5, nodes + 50,000. Notes: Users can click on the + icon to see the covered examples. The keyword “sub_family_conservation” was used as a filter in this screenshot.
Figure 6
Figure 6
Part of the clustering of the full set of 173 generated rules. Notes: We performed rule alignment on each subfamily (indicated by red rectangles in the dendrogram). Two interesting rules are indicated by (*).

Similar articles

References

    1. Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–53. - PMC - PubMed
    1. Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol. 2001;307(2):683–706. - PubMed
    1. Cochrane GR, Galperin MY. The 2010 nucleic scids research database issue and online database collection: a community of data resources. Nucleic Acids Res. 2010;38(Database issue):D1–4. - PMC - PubMed
    1. Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32(4):358–68. - PubMed
    1. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31(13):3812–4. - PMC - PubMed

LinkOut - more resources

  NODES
3d 9
Community 1
HOME 2
Interesting 1
Javascript 1
Note 11
os 12
server 1
text 12
Training 1
twitter 2
Users 1
web 5