Efficient mining for structurally diverse subgraph patterns in large molecular databases

Maunz, Andreas; Helma, Christoph; Kramer, Stefan

doi:10.1007/s10994-010-5187-6

Efficient mining for structurally diverse subgraph patterns in large molecular databases

Published: 19 May 2010

Volume 83, pages 193–218, (2011)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Efficient mining for structurally diverse subgraph patterns in large molecular databases

Download PDF

749 Accesses
7 Citations
Explore all metrics

Abstract

We present a new approach to large-scale graph mining based on so-called backbone refinement classes. The method efficiently mines tree-shaped subgraph descriptors under minimum frequency and significance constraints, using classes of fragments to reduce feature set size and running times. The classes are defined in terms of fragments sharing a common backbone. The method is able to optimize structural inter-feature entropy as opposed to purely occurrence-based criteria, which is characteristic for open or closed fragment mining. We first give an intuitive explanation why backbone refinement class features lead to a set of relevant features that are suitable for classification, in particular in the area of structure-activity relationships (SARs). We then show that backbone refinement classes yield a high compression in the search space of rooted perfect binary trees. We conduct several experiments to evaluate our theoretical insights in practice: A visualization suggests low co-occurrence and high entropy of backbone refinement class features. By comparison to a class of patterns sampled from the maximal patterns previously introduced by Al Hasan et al., we find a favorable tradeoff between the structural similarity and the resources needed to compute the descriptors. Cross-validation shows that classification accuracy is similar to the complete set of trees but significantly better than that of open trees, while feature set size is reduced by >90% and >30% compared to complete tree mining and open tree mining, respectively. Furthermore, compared to open or closed pattern mining, a large part of the search space can be pruned due to an improved statistical constraint (dynamic upper bound adjustment). This is confirmed experimentally by running times reduced by more than 60% compared to ordinary (static) upper bound pruning. The application of our method to the largest datasets that have been used in correlated graph mining so far indicates robustness against the minimum frequency parameter, and a cross-validation run on this data confirms that the novel descriptors render large training sets feasible, which previously might have been intractable.

A C++ implementation of the mining algorithm is available at http://www.maunz.de/libfminer-doc. Animated figures, links to datasets, and further resources are available at http://www.maunz.de/mlj-res.

Article PDF

Grasping frequent subgraph mining for bioinformatics applications

Article Open access 03 September 2018

Mining Discriminative Subgraph Patterns from Structural Data

An Efficient Approach for Counting Occurring Induced Subgraphs

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Al Hasan, M., Chaoji, V., Salem, S., Besson, J., & Zaki, M. (2007). Origami: mining representative orthogonal graph patterns. In Seventh IEEE international conference on data mining (ICDM 2007) (pp. 153–162). Washington: IEEE Computer Society.
Chapter Google Scholar
Benigni, R., & Bossa, C. (2008). Structure alerts for carcinogenicity, and the salmonella assay system: a novel insight through the chemical relational databases technology. Mutation Research/Reviews in Mutation Research, 659(3), 248–261.
Article Google Scholar
Bringmann, B., Zimmermann, A., Raedt, L. D., & Nijssen, S. (2006). Don’t be afraid of simpler patterns. In Proceedings 10th PKDD (pp. 55–66). Berlin: Springer.
Google Scholar
Chi, Y., Muntz, R. R., Nijssen, S., & Kok, J. N. (2001). Frequent subtree mining—an overview.
Helma, C. (2006). Lazy structure-activity relationships (Lazar) for the prediction of rodent carcinogenicity and salmonella mutagenicity. In Molecular diversity (pp. 147–158).
Jahn, K., & Kramer, S. (2005). Optimizing gSpan for molecular datasets. In: Proceedings of the third international workshop on mining graphs, trees and sequences (MGTS-2005).
Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. In KDD ’01: proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 136–143). New York: ACM.
Chapter Google Scholar
Maunz, A., Helma, C., & Kramer, S. (2009). Large-scale graph mining using backbone refinement classes. In KDD ’09: proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 617–626). New York: ACM.
Chapter Google Scholar
Morishita, S., & Sese, J. (2000). Traversing itemset lattice with statistical metric pruning. In Proceedings of the 19th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (pp. 226–236). New York: ACM.
Google Scholar
Nijssen, S., & Kok, J. N. (2004). A quickstart in frequent structure mining can make a difference. In KDD ’04: proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 647–652). New York: ACM.
Chapter Google Scholar
Nijssen, S., & Kok, J. N. (2006). Frequent subgraph miners: runtime don’t say everything. In Proceedings of the international workshop on mining and learning with graphs (MLG 2006) (pp. 173–180). Berlin, Germany.
OpenTox: A predictive toxicology framework. http://www.opentox.org. See also: Hardy, B., Douglas, N., Helma, C., et al.: Collaborative development of predictive toxicology applications fifth international symposium on computational methods in toxicology and pharmacology integrating internet resources (CMTPI 2009) (to appear). London: Taylor & Francis.
Rückert, U., & Kramer, S. (2007). Optimizing feature sets for structured data. In Proceedings of the 18th European conference on machine learning (ECML07) (pp. 716–723). Berlin: Springer-Verlag.
Google Scholar
Schulz, H., Kersting, C., & Karwath, A. ILP, the blind, and the elephant: Euclidean embedding of co-proven queries. In 19th international conference on inductive logic programming (ILP 2009). http://www.cs.kuleuven.be/dtai/ilp-mlg-srl/dokuwiki/doku.php?id=paper:ilp:33.
Székely, L., & Wang, H. (2005). On subtrees of trees. Advances in Applied Mathematics, 34(1), 138–155. doi:10.1016/j.aam.2004.07.002.
Article MATH MathSciNet Google Scholar
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
Article Google Scholar
Wörlein, M., Meinl, T., Fischer, I., & Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, ffsm, and Gaston. In Proceedings of PKDD (pp. 392–403). Berlin: Springer-Verlag.
Google Scholar
Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM ’02: proceedings of the 2002 IEEE international conference on data mining (ICDM’02) (p. 721). Washington: IEEE Computer Society.
Google Scholar
Yan, X., & Han, J. (2003). Closegraph: mining closed frequent graph patterns. In KDD ’03: proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 286–295). New York: ACM.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Lab, Universität Freiburg, Georges-Köhler-Allee 79, 79110, Freiburg i. Br., Germany
Andreas Maunz
in-silico Toxicology, Altkircherstr. 4, 4054, Basel, Switzerland
Christoph Helma
Institut für Informatik/I12, Boltzmannstr. 3, 85748, Garching b. München, Germany
Stefan Kramer

Authors

Andreas Maunz
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Helma
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Kramer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Maunz.

Additional information

Editors: Hendrik Blockeel, Karsten Borgwardt, and Xifeng Yan.

This research was supported by the EU seventh framework programme under contract No. Health-F5-2008-200787 (OpenTox 2009).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maunz, A., Helma, C. & Kramer, S. Efficient mining for structurally diverse subgraph patterns in large molecular databases. Mach Learn 83, 193–218 (2011). https://doi.org/10.1007/s10994-010-5187-6

Download citation

Received: 23 September 2009
Revised: 01 February 2010
Accepted: 19 April 2010
Published: 19 May 2010
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10994-010-5187-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient mining for structurally diverse subgraph patterns in large molecular databases

Abstract

Article PDF

Similar content being viewed by others

Grasping frequent subgraph mining for bioinformatics applications

Mining Discriminative Subgraph Patterns from Structural Data

An Efficient Approach for Counting Occurring Induced Subgraphs

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient mining for structurally diverse subgraph patterns in large molecular databases

Abstract

Article PDF

Similar content being viewed by others

Grasping frequent subgraph mining for bioinformatics applications

Mining Discriminative Subgraph Patterns from Structural Data

An Efficient Approach for Counting Occurring Induced Subgraphs

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation