Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.
TL;DR: A consensus approach that combines all three methods outperforms the single best model for all data sets and is compared to the well-established techniques of recursive partitioning and soft independent modeling by class analogy.
read more
Abstract: Classification methods allow for the development of structure-activity relationship models when the target property is categorical rather than continuous. We describe a classification method which fits descriptor splines to activities, with descriptors selected using a genetic algorithm. This method, which we identify as SFGA, is compared to the well-established techniques of recursive partitioning (RP) and soft independent modeling by class analogy (SIMCA) using five series of compounds: cyclooxygenase-2 (COX-2) inhibitors, benzodiazepine receptor (BZR) ligands, estrogen receptor (ER) ligands, dihydrofolate reductase (DHFR) inhibitors, and monoamine oxidase (MAO) inhibitors. Only 1-D and 2-D descriptors were used. Approximately 40% of compounds in each series were assigned to a test set, "cherry-picked" from the complete set such that they lie outside the training set as much as possible. SFGA produced models that were more predictive for all but the DHFR set, for which SIMCA was most predictive. RP gave the least predictive models for all but the MAO set. A similar trend was observed when using training and test sets to which compounds were randomly assigned and when gradually eliminating compounds from the (designed) training set. The stability of models was examined for the random and reduced sets, where stability means that classification statistics and the selected descriptors are similar for models derived from different sets. Here, SIMCA produced the most stable models, followed by SFGA and RP. We show that a consensus approach that combines all three methods outperforms the single best model for all data sets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules
TL;DR: The fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network.
355
Chemical Informatics Functionality in R
TL;DR: The rcdk package is described, that provides the R user with access to the CDK, a Java framework for cheminformatics that is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints.
Bioactive molecule prediction using extreme gradient boosting
Ismail B. Mustapha,Faisal Saeed +1 more
TL;DR: Extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound’s molecular structure and showed remarkable performance on both high and low diversity datasets.
A Survey on Graph Kernels
TL;DR: Graph kernels have become an established and widely used technique for solving classification tasks on graphs as mentioned in this paper, and a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years is given in this survey.
•Posted Content
Adversarial Attack and Defense on Graph Data: A Survey.
TL;DR: This work systemically organize the considered works based on the features of each topic and provides a unified formulation for adversarialLearning on graph data which covers most adversarial learning studies on graph.
229
References
•Book
Classification and regression trees
Leo Breiman
- 01 Jan 1983
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
22.7K
Pattern recognition by means of disjoint principal components models
TL;DR: Pattern recognition based on modelling each separate class by a separate principal components (PC) model is discussed and these PC models are shown to be able to approximate any continuous variation within a single class.
1.1K
Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships
David Rogers,Anton J. Hopfinger +1 more
TL;DR: The genetic function approximation (GFA) algorithm is applied to three published data sets to demonstrate it is an effective tool for doing both QSAR and QSPR.
1K
Assessing model fit by cross-validation.
TL;DR: It is shown by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.
762
Related Papers (5)
Karsten M. Borgwardt,Hans-Peter Kriegel +1 more
- 27 Nov 2005
[...]
Pinar Yanardag,S. V. N. Vishwanathan +1 more
- 10 Aug 2015