Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.

doi:10.1021/CI034143R

Open AccessJournal Article10.1021/CI034143R

Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.

Jeffrey J. Sutherland, +2 more

- 21 Oct 2003

- Journal of Chemical Information and Comp...

- Vol. 43, Iss: 6, pp 1906-1915

237

TL;DR: A consensus approach that combines all three methods outperforms the single best model for all data sets and is compared to the well-established techniques of recursive partitioning and soft independent modeling by class analogy.

Abstract: Classification methods allow for the development of structure-activity relationship models when the target property is categorical rather than continuous. We describe a classification method which fits descriptor splines to activities, with descriptors selected using a genetic algorithm. This method, which we identify as SFGA, is compared to the well-established techniques of recursive partitioning (RP) and soft independent modeling by class analogy (SIMCA) using five series of compounds: cyclooxygenase-2 (COX-2) inhibitors, benzodiazepine receptor (BZR) ligands, estrogen receptor (ER) ligands, dihydrofolate reductase (DHFR) inhibitors, and monoamine oxidase (MAO) inhibitors. Only 1-D and 2-D descriptors were used. Approximately 40% of compounds in each series were assigned to a test set, "cherry-picked" from the complete set such that they lie outside the training set as much as possible. SFGA produced models that were more predictive for all but the DHFR set, for which SIMCA was most predictive. RP gave the least predictive models for all but the MAO set. A similar trend was observed when using training and test sets to which compounds were randomly assigned and when gradually eliminating compounds from the (designed) training set. The stability of models was examined for the random and reduced sets, where stability means that classification statistics and the selected descriptors are similar for models derived from different sets. Here, SIMCA produced the most stable models, followed by SFGA and RP. We show that a consensus approach that combines all three methods outperforms the single best model for all data sets.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Esben Jannik Bjerrum

- 21 Mar 2017

- arXiv: Learning

TL;DR: The fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network.

...read moreread less

355

•Journal Article•10.18637/JSS.V018.I05

Chemical Informatics Functionality in R

Rajarshi Guha

- 10 Jan 2007

- Journal of Statistical Software

TL;DR: The rcdk package is described, that provides the R user with access to the CDK, a Java framework for cheminformatics that is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints.

...read moreread less

277

•Journal Article•10.3390/MOLECULES21080983

Bioactive molecule prediction using extreme gradient boosting

Ismail B. Mustapha, +1 more

- 28 Jul 2016

- Molecules

TL;DR: Extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound’s molecular structure and showed remarkable performance on both high and low diversity datasets.

...read moreread less

253

•Journal Article•10.1007/S41109-019-0195-3

A Survey on Graph Kernels

Nils M. Kriege, +2 more

- 28 Mar 2019

- arXiv: Learning

TL;DR: Graph kernels have become an established and widely used technique for solving classification tasks on graphs as mentioned in this paper, and a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years is given in this survey.

...read moreread less

247

•Posted Content

Adversarial Attack and Defense on Graph Data: A Survey.

Lichao Sun, +3 more

- 26 Dec 2018

- arXiv: Cryptography and Security

TL;DR: This work systemically organize the considered works based on the features of each topic and provides a unified formulation for adversarialLearning on graph data which covers most adversarial learning studies on graph.

...read moreread less

229

...

Expand

References

•Book

Classification and regression trees

Leo Breiman

- 01 Jan 1983

TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

...read moreread less

22.7K

Journal Article•10.1016/0031-3203(76)90014-5

Pattern recognition by means of disjoint principal components models

Svante Wold

- 01 Jul 1976

- Pattern Recognition

TL;DR: Pattern recognition based on modelling each separate class by a separate principal components (PC) model is discussed and these PC models are shown to be able to approximate any continuous variation within a single class.

...read moreread less

1.1K

Journal Article•10.1021/CI00020A020

Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships

David Rogers, +1 more

- 01 Jul 1994

- Journal of Chemical Information and Comp...

TL;DR: The genetic function approximation (GFA) algorithm is applied to three published data sets to demonstrate it is an effective tool for doing both QSAR and QSPR.

...read moreread less

1K

Journal Article•10.1021/CI025626I

Assessing model fit by cross-validation.

Douglas M. Hawkins, +2 more

- 24 Jan 2003

- Journal of Chemical Information and Comp...

TL;DR: It is shown by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.

...read moreread less

762

Journal Article•10.1021/CI00028A014

Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information

Lowell H. Hall, +1 more

- 01 Nov 1995

- Journal of Chemical Information and Comp...

735

...

Expand

Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships.

Chat with Paper

AI Agents for this Paper

Citations

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Chemical Informatics Functionality in R

Bioactive molecule prediction using extreme gradient boosting

A Survey on Graph Kernels

Adversarial Attack and Defense on Graph Data: A Survey.

References

Classification and regression trees

Pattern recognition by means of disjoint principal components models

Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships

Assessing model fit by cross-validation.

Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information

Related Papers (5)

Weisfeiler-Lehman Graph Kernels

Shortest-path kernels on graphs

On Graph Kernels: Hardness Results and Efficient Alternatives

Deep Graph Kernels

Protein function prediction via graph kernels