Self
 ‐supervised cross validation using data generation structure

doi:10.1002/asmb.2701

Journal Article10.1002/asmb.2701

Self ‐supervised cross validation using data generation structure

Ron S. Kenett, +3 more

- 27 Jun 2022

- Applied Stochastic Models in Business an...

- Vol. 38, Iss: 5, pp 750-765

5

TL;DR: In this paper , the authors discuss an overlooked gap in machine learning and predictive modeling, the role of data structure and data generation process in the partitioning of observational data into training and hold-out datasets.

Abstract: Modern statistics and machine learning typically involve large amounts of data coupled with computationally intensive methods. In a predictive modeling context, one seeks models that achieve high predictive accuracy on new datasets. This is typically implemented by partitioning the data into training and hold-out data sets. The allocation is often conducted randomly, at the row level of the data matrix. In this work, we discuss an overlooked gap in machine learning and predictive modeling, the role of data structure and data generation process in the partitioning of observational data into training and hold-out datasets. Ignoring such structures can lead to deficiencies in model generalizability and operationalization. We highlight that explicitly embracing the data generation structure to partition the data for validating predictive model is essential to the success of data science projects. The proposed approach is called befitting cross validation (BCV). It relies on an information quality perspective of analytics. This requires an assessment with inputs from domain experts, in contrast to automated approaches that are purely data driven. BCV is motivated by the objective of generating information quality with data and modeling. Two case studies are illustrating the proposed approach. One is based on a 96-h burn-in process applied to electro-mechanical devices, implemented in order to reduce early failures at the customer site. The goal was to shorten the burn-in process with a predictive model applied at 20 h. The other case study is combining tablet dissolution profiles and designed mixture experiments. The goal there was to match the tablet under test dissolution profiles with a brand tablet reference profile. These case studies demonstrate the methodological points made with BCV, which are generic in nature. We suggest that BCV principles should be always considered in the development of data-driven predictive models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.3390/make5020035

A Survey of Deep Learning for Alzheimer’s Disease

Qinghua Zhou, +4 more

- 09 Jun 2023

- Machine learning and knowledge extractio...

TL;DR: In this article , a survey of deep learning literature related to Alzheimer's disease, mild cognitive impairment, and related diseases from 2010 to early 2023 is presented, identifying the major types of unsupervised, supervised and semi-supervised methods developed for various tasks in this field, including the most recent developments, such as the application of recurrent neural networks, graph-neural networks, and generative models.

...read moreread less

19

Journal Article•10.1016/j.buildenv.2023.110047

Alert-based wearable sensing system for individualized thermal preference prediction

Yanxiao Feng, +3 more

- 01 Jan 2023

- Building and Environment

TL;DR: In this paper , an alert-based occupant-responsive framework for thermal comfort-related data and predicting individualized thermal preference by leveraging wearable sensors and computing technologies is presented. But the major issue for such model development is related to the complexity of individual data collection and the ignorance of micro-environmental parameters.

...read moreread less

11

Journal Article•10.48550/arXiv.2301.07520

Adversarial AI in Insurance: Pervasiveness and Resilience

Elisa Luciano, +2 more

- 17 Jan 2023

- arXiv.org

TL;DR: In this paper , the authors provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabels.

...read moreread less

1

Journal Article•10.1002/qre.3449

An analytic journey in an industrial classification problem: How to use models to sharpen your questions

Ron Kenett, +2 more

- 19 Sep 2023

- Quality and Reliability Engineering Inte...

TL;DR: A journey between questions, models and data analysis to reach specific goals in industrial, engineering, biology and social science applications is described and several innovative methods and insights are developed that can prove useful in other contexts.

...read moreread less

1

Journal Article•10.1007/978-3-031-28482-3_8

Cybermanufacturing and Digital Twins

01 Jan 2023

- Statistics for industry, technology, and...

TL;DR: In this article , the authors introduce the main elements of cybermanufacturing as background and context to modern industrial analytics and present an integration of models for enhanced information quality, the Bayesian flow analysis, and the Open ML community where datasets and data flows are uploaded in the spirit of open data.

...read moreread less

References

Journal Article•10.1093/BIOMET/63.3.581

Inference and missing data

Donald B. Rubin

- 01 Dec 1976

- Biometrika

TL;DR: In this article, it was shown that ignoring the process that causes missing data when making sampling distribution inferences about the parameter of the data, θ, is generally appropriate if and only if the missing data are missing at random and the observed data are observed at random, and then such inferences are generally conditional on the observed pattern of missing data.

...read moreread less

10K

Journal Article•10.1007/BF01404567

Smoothing Noisy Data with Spline Functions Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation*

Peter Craven, +1 more

- 01 Dec 1978

- Numerische Mathematik

TL;DR: In this paper, a method for estimating the optimum amount of smoothing from the data is presented, based on smoothing splines, which is well known to provide nice curves which smooth discrete, noisy data.

...read moreread less

3.2K

Journal Article•10.1080/00401706.1969.10490666

Computer Aided Design of Experiments

Robert W. Kennard, +1 more

- 01 Feb 1969

- Technometrics

TL;DR: A computer oriented method which assists in the construction of response surface type experimental plans takes into account constraints met in practice that standard procedures do not consider explicitly.

...read moreread less

3.1K

Journal Article•10.1021/AC00023A016

Multivariate instrument standardization

Yongdong Wang, +2 more

- 01 Dec 1991

- Analytical Chemistry

604

•Journal Article•10.1016/J.CSDA.2006.11.025

Cluster-wise assessment of cluster stability

Christian Hennig

- 15 Sep 2007

- Computational Statistics & Data Analysis

TL;DR: The use of the cluster-wise stability assessment is used to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods.

...read moreread less

603