Predicting classifier performance with a small training set: Applications to computer-aided diagnosis and prognosis

doi:10.1109/ISBI.2010.5490373

Proceedings Article10.1109/ISBI.2010.5490373

Predicting classifier performance with a small training set: Applications to computer-aided diagnosis and prognosis

Ajay Basavanhally, +2 more

- 14 Apr 2010

- pp 229-232

18

TL;DR: A power law model is utilized to evaluate and compare various classifiers (Support Vector Machine, C4.5 decision tree, k-nearest neighbor) for four distinct CAD problems and suggests that, given sufficient training data, SVMs tend to be the best classifiers.

Abstract: Selection of an appropriate classifier for computer-aided diagnosis (CAD) applications has typically been an ad hoc process. It is difficult to know a priori which classifier will yield high accuracies for a specific application, especially when well-annotated data for classifier training is scarce. In this study, we utilize an inverse power-law model of statistical learning to predict classifier performance when only limited amounts of annotated training data is available. The objectives of this study are to (a) predict classifier error in the context of different CAD problems when larger data cohorts become available, and (b) compare classifier performance and trends (both at the sample/patient level and at the pixel level) as additional data is accrued (such as in a clinical trial). In this paper we utilize a power law model to evaluate and compare various classifiers (Support Vector Machine (SVM), C4.5 decision tree, k-nearest neighbor) for four distinct CAD problems. The first two datasets deal with sample/patient-level classification for distinguishing between (1) high from low grade breast cancers and (2) high from low levels of lymphocytic infiltration in breast cancer specimens. The other two datasets are pixel-level classification problems for discriminating cancerous and non-cancerous regions on prostate (3) MRI and (4) histopathology. Our empirical results suggest that, given sufficient training data, SVMs tend to be the best classifiers. This was true for datasets (1), (2), and (3), while the C4.5 decision tree was the best classifier for dataset (4). Our results also suggest that results of classifier comparison made on small data cohorts should not be generalized as holding true when large amounts of data become available.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.15282/IJSECS.1.2015.6.0006

Evaluating the effect of dataset size on predictive model using supervised learning technique

Adeleke Raheem Ajiboye, +3 more

- 01 Feb 2015

TL;DR: Findings from this study reveals that, the quantity of data partitioned for the purpose of training must be of good representation of the entire sets and sufficient enough to span through the input space and shows that the learning model with the largest size of training sets appears to be the most accurate and consistently delivers a much better and stable results.

...read moreread less

99

Journal Article•10.1109/TIP.2013.2273669

3D Lacunarity in Multifractal Analysis of Breast Tumor Lesions in Dynamic Contrast-Enhanced Magnetic Resonance Imaging

Filipe Soares, +4 more

- 01 Nov 2013

- IEEE Transactions on Image Processing

TL;DR: A novel method of 3D multifractal analysis to characterize the spatial complexity (spatial arrangement of texture) of breast tumors at multiple scales and confirms the presence of multifractality in DCE-MR volumes of the breast, whereby multiple degrees of self-similarity prevail at multiple scale.

...read moreread less

46

Journal Article•10.1016/J.IPM.2021.102616

Improving classifier training efficiency for automatic cyberbullying detection with Feature Density

Juuso Kalevi Kristian Eronen, +5 more

- 01 Sep 2021

- Information Processing and Management

TL;DR: It is hypothesized that estimating dataset complexity allows for the reduction of the number of required experiments iterations, which can optimize the resource-intensive training of ML models which is becoming a serious issue due to the increases in available dataset sizes and the ever rising popularity of models based on Deep Neural Networks.

...read moreread less

39

Journal Article•10.1109/JSYST.2013.2284101

Classification of Breast Masses on Contrast-Enhanced Magnetic Resonance Images Through Log Detrended Fluctuation Cumulant-Based Multifractal Analysis

Filipe Soares, +4 more

- 01 Sep 2014

- IEEE Systems Journal

TL;DR: The results suggest that the log-cumulant C2 can be effective in classifying typically biopsy-recommended cases and can contribute to novel feature classification techniques to aid radiologists every time there is a change in the clinical course, namely, when biopsy should be considered.

...read moreread less

26

•Journal Article•10.1371/JOURNAL.PONE.0117900

Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

Ajay Basavanhally, +2 more

- 18 May 2015

- PLOS ONE

TL;DR: This paper presents a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy and suggests that this approach consistently yields error rates with lower variability.

...read moreread less

12

...

Expand

References

•Journal Article•10.1038/BJC.1957.43

Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years.

H. J. G. Bloom, +1 more

- 01 Sep 1957

- British Journal of Cancer

TL;DR: This is a selection of photographs from around the world taken in the period of May 21 to 29, 1997, which were taken at the request of the then-president of the United States, George W. Bush.

...read moreread less

3.1K

Journal Article•10.1089/106652703321825928

Estimating dataset size requirements for classifying DNA microarray data.

Sayan Mukherjee, +7 more

- 01 Jan 2003

- Journal of Computational Biology

TL;DR: A statistical methodology for estimating dataset size requirements for classifying microarray data using learning curves is introduced, based on fitting inverse power-law models to construct empirical learning curves.

...read moreread less

316

Journal Article•10.1109/TBME.2009.2035305

Computerized Image-Based Detection and Grading of Lymphocytic Infiltration in HER2+ Breast Cancer Histopathology

Ajay Basavanhally, +7 more

- 01 Mar 2010

- IEEE Transactions on Biomedical Engineer...

TL;DR: A computer-aided diagnosis (CADx) scheme to automatically detect and grade the extent of lymphocytic infiltration in digitized HER2+ BC histopathology will potentially help clinicians determine disease outcome and allow them to make better therapy recommendations for patients with HER2- BC.

...read moreread less

298

•Proceedings Article•10.1117/12.811899

Integrating structural and functional imaging for computer assisted detection of prostate cancer on multi-protocol in vivo 3 Tesla MRI

Satish Viswanath, +9 more

- 27 Feb 2009

- Proceedings of SPIE

TL;DR: A novel comprehensive computer-aided scheme for CaP detection from high resolution in vivo multi-protocol MRI by integrating functional and structural information obtained via dynamic-contrast enhanced (DCE) and T2-weighted (T2-w) MRI, respectively is presented.

...read moreread less

52

Proceedings Article•10.1109/ISBI.2009.5193186

Computer-aided prognosis of ER+ breast cancer histopathology and correlating survival outcome with Oncotype DX assay

Ajay Basavanhally, +3 more

- 28 Jun 2009

TL;DR: A novel computer-aided prognosis (CAP) scheme that employs quantitatively derived image information to predict patient outcome analogous to the Oncotype DX Recurrence Score (RS), with high RS implying poor outcome and vice versa, is presented.

...read moreread less

42