Download PDF

Journal of Chemical Information and Modeling

Publication date: 2006-01-01
Volume: 46 Pages: 2432 - 2444
Publisher: American Chemical Society

Author:

Karwath, Andreas
De Raedt, Luc

Keywords:

Science & Technology, Life Sciences & Biomedicine, Physical Sciences, Technology, Chemistry, Medicinal, Chemistry, Multidisciplinary, Computer Science, Information Systems, Computer Science, Interdisciplinary Applications, Pharmacology & Pharmacy, Chemistry, Computer Science, AUTOMATED STRUCTURE EVALUATION, MUTAGENICITY, SYSTEM, Algorithms, Biotransformation, Computer Simulation, Databases, Factual, Drug Evaluation, Preclinical, Estrogens, Humans, Kinetics, Ligands, Models, Chemical, Mutagenicity Tests, ROC Curve, Receptors, Estrogen, Salmonella, Software, Stereoisomerism, Structure-Activity Relationship, 0304 Medicinal and Biomolecular Chemistry, 0307 Theoretical and Computational Chemistry, 0802 Computation Theory and Mathematics, Medicinal & Biomolecular Chemistry, 3404 Medicinal and biomolecular chemistry, 3407 Theoretical and computational chemistry

Abstract:

Most approaches to structure-activity-relationship (SAR) prediction proceed in two steps. In the first step, a typically large set of fingerprints, or fragments of interest, is constructed (either by hand or by some recent data mining techniques). In the second step, machine learning techniques are applied to obtain a predictive model. The result is often not only a highly accurate but also hard to interpret model. In this paper, we demonstrate the capabilities of a novel SAR algorithm, SMIREP, which tightly integrates the fragment and model generation steps and which yields simple models in the form of a small set of IF-THEN rules. These rules contain SMILES fragments, which are easy to understand to the computational chemist. SMIREP combines ideas from the well-known IREP rule learner with a novel fragmentation algorithm for SMILES strings. SMIREP has been evaluated on three problems: the prediction of binding activities for the estrogen receptor (Environmental Protection Agency's (EPA's) Distributed Structure-Searchable Toxicity (DSSTox) National Center for Toxicological Research estrogen receptor (NCTRER) Database), the prediction of mutagenicity using the carcinogenic potency database (CPDB), and the prediction of biodegradability on a subset of the Environmental Fate Database (EFDB). In these applications, SMIREP has the advantage of producing easily interpretable rules while having predictive accuracies that are comparable to those of alternative state-of-the-art techniques.