Journal Article10.1002/asmb.2701
Self ‐supervised cross validation using data generation structure
5
TL;DR: In this paper , the authors discuss an overlooked gap in machine learning and predictive modeling, the role of data structure and data generation process in the partitioning of observational data into training and hold-out datasets.
read more
Abstract: Modern statistics and machine learning typically involve large amounts of data coupled with computationally intensive methods. In a predictive modeling context, one seeks models that achieve high predictive accuracy on new datasets. This is typically implemented by partitioning the data into training and hold-out data sets. The allocation is often conducted randomly, at the row level of the data matrix. In this work, we discuss an overlooked gap in machine learning and predictive modeling, the role of data structure and data generation process in the partitioning of observational data into training and hold-out datasets. Ignoring such structures can lead to deficiencies in model generalizability and operationalization. We highlight that explicitly embracing the data generation structure to partition the data for validating predictive model is essential to the success of data science projects. The proposed approach is called befitting cross validation (BCV). It relies on an information quality perspective of analytics. This requires an assessment with inputs from domain experts, in contrast to automated approaches that are purely data driven. BCV is motivated by the objective of generating information quality with data and modeling. Two case studies are illustrating the proposed approach. One is based on a 96-h burn-in process applied to electro-mechanical devices, implemented in order to reduce early failures at the customer site. The goal was to shorten the burn-in process with a predictive model applied at 20 h. The other case study is combining tablet dissolution profiles and designed mixture experiments. The goal there was to match the tablet under test dissolution profiles with a brand tablet reference profile. These case studies demonstrate the methodological points made with BCV, which are generic in nature. We suggest that BCV principles should be always considered in the development of data-driven predictive models.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Survey of Deep Learning for Alzheimer’s Disease
TL;DR: In this article , a survey of deep learning literature related to Alzheimer's disease, mild cognitive impairment, and related diseases from 2010 to early 2023 is presented, identifying the major types of unsupervised, supervised and semi-supervised methods developed for various tasks in this field, including the most recent developments, such as the application of recurrent neural networks, graph-neural networks, and generative models.
19
Alert-based wearable sensing system for individualized thermal preference prediction
TL;DR: In this paper , an alert-based occupant-responsive framework for thermal comfort-related data and predicting individualized thermal preference by leveraging wearable sensors and computing technologies is presented. But the major issue for such model development is related to the complexity of individual data collection and the ignorance of micro-environmental parameters.
11
Adversarial AI in Insurance: Pervasiveness and Resilience
TL;DR: In this paper , the authors provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabels.
1
An analytic journey in an industrial classification problem: How to use models to sharpen your questions
Ron Kenett,Chris Gotwalt,Jean-Michel Poggi +2 more
TL;DR: A journey between questions, models and data analysis to reach specific goals in industrial, engineering, biology and social science applications is described and several innovative methods and insights are developed that can prove useful in other contexts.
1
Cybermanufacturing and Digital Twins
TL;DR: In this article , the authors introduce the main elements of cybermanufacturing as background and context to modern industrial analytics and present an integration of models for enhanced information quality, the Bayesian flow analysis, and the Open ML community where datasets and data flows are uploaded in the spirit of open data.
References
Inference and missing data
TL;DR: In this article, it was shown that ignoring the process that causes missing data when making sampling distribution inferences about the parameter of the data, θ, is generally appropriate if and only if the missing data are missing at random and the observed data are observed at random, and then such inferences are generally conditional on the observed pattern of missing data.
10K
Smoothing Noisy Data with Spline Functions Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-Validation*
Peter Craven,Grace Wahba +1 more
TL;DR: In this paper, a method for estimating the optimum amount of smoothing from the data is presented, based on smoothing splines, which is well known to provide nice curves which smooth discrete, noisy data.
3.2K
Computer Aided Design of Experiments
Robert W. Kennard,L. A. Stone +1 more
TL;DR: A computer oriented method which assists in the construction of response surface type experimental plans takes into account constraints met in practice that standard procedures do not consider explicitly.
3.1K
Cluster-wise assessment of cluster stability
TL;DR: The use of the cluster-wise stability assessment is used to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods.
603