Top 55 Applied statistics papers published in 2022

Showing papers in "Applied statistics in 2022"

Flexible domain prediction using mixed effects random forests

[...]

26 Jan 2022-Applied statistics

TL;DR: A coherent framework based on mixed random forests for estimating small area averages is provided and a non-parametric bootstrap estimator for assess-ing the uncertainty of the estimates is proposed.

...read moreread less

Abstract: This paper promotes the use of random forests as versatile tools for estimating spatially disaggregated indicators in the presence of small area‐specific sample sizes. Small area estimators are predominantly conceptualised within the regression‐setting and rely on linear mixed models to account for the hierarchical structure of the survey data. In contrast, machine learning methods offer non‐linear and non‐parametric alternatives, combining excellent predictive performance and a reduced risk of model‐misspecification. Mixed effects random forests combine advantages of regression forests with the ability to model hierarchical dependencies. This paper provides a coherent framework based on mixed effects random forests for estimating small area averages and proposes a non‐parametric bootstrap estimator for assessing the uncertainty of the estimates. We illustrate advantages of our proposed methodology using Mexican income‐data from the state Nuevo León. Finally, the methodology is evaluated in model‐based and design‐based simulations comparing the proposed methodology to traditional regression‐based approaches for estimating small area averages.

...read moreread less

23 citations

Journal Article•10.1111/rssc.12566•

A unifying framework for flexible excess hazard modelling with applications in cancer epidemiology

[...]

Alessia Eletti, Giampiero Marra, Manuela Quaresma, Rosalba Radice, Francisco J. Rubio - Show less +1 more

11 Apr 2022-Applied statistics

TL;DR: In this article , a link-based additive modeling framework for the excess hazard was proposed, which allows for the inclusion of many types of covariate effects, including spatial and time-dependent effects, using any type of smoother, such as thin plate, cubic splines, tensor products and Markov random fields.

...read moreread less

Abstract: Excess hazard modelling is one of the main tools in population‐based cancer survival research. Indeed, this setting allows for direct modelling of the survival due to cancer even in the absence of reliable information on the cause of death, which is common in population‐based cancer epidemiology studies. We propose a unifying link‐based additive modelling framework for the excess hazard that allows for the inclusion of many types of covariate effects, including spatial and time‐dependent effects, using any type of smoother, such as thin plate, cubic splines, tensor products and Markov random fields. In addition, this framework accounts for all types of censoring as well as left truncation. Estimation is conducted by using an efficient and stable penalized likelihood‐based algorithm whose empirical performance is evaluated through extensive simulation studies. Some theoretical and asymptotic results are discussed. Two case studies are presented using population‐based cancer data from patients diagnosed with breast (female), colon and lung cancers in England. The results support the presence of non‐linear and time‐dependent effects as well as spatial variation. The proposed approach is available in the R package GJRM.

...read moreread less

12 citations

Journal Article•10.1111/rssc.12582•

Utility‐based Bayesian personalized treatment selection for advanced breast cancer

[...]

Juhee Lee, Peter F. Thall, Bora Lim, Pavlos Msaouel

09 Sep 2022-Applied statistics

TL;DR: In this article , a Bayesian method is proposed for personalized treatment selection in settings where data are available from a randomized clinical trial with two or more outcomes, and a new patient's treatment can be selected based on the posterior predictive utility distribution.

...read moreread less

Abstract: A Bayesian method is proposed for personalized treatment selection in settings where data are available from a randomized clinical trial with two or more outcomes. The motivating application is a randomized trial that compared letrozole plus bevacizumab to letrozole alone as first‐line therapy for hormone receptor‐positive advanced breast cancer. The combination treatment arm had larger median progression‐free survival time, but also a higher rate of severe toxicities. This suggests that the risk‐benefit trade‐off between these two outcomes should play a central role in selecting each patient's treatment, particularly since older patients are less likely to tolerate severe toxicities. To quantify the desirability of each possible outcome combination for an individual patient, we elicited from breast cancer oncologists a utility function that varied with age. The utility was used as an explicit criterion for quantifying risk‐benefit trade‐offs when making personalized treatment selections. A Bayesian nonparametric multivariate regression model with a dependent Dirichlet process prior was fit to the trial data. Under the fitted model, a new patient's treatment can be selected based on the posterior predictive utility distribution. For the breast cancer trial dataset, the optimal treatment depends on the patient's age, with the combination preferable for patients 70 years or younger and the single agent preferable for patients older than 70.

...read moreread less

10 citations

Journal Article•10.1111/rssc.12602•

Bayesian modelling strategies for borrowing of information in randomised basket trials

[...]

Luke Ondijo Ouma, Michael J. Grayling, James Wason, Haiyan Zheng

28 Oct 2022-Applied statistics

TL;DR: In this article , the authors extend a previous analysis model based on distributional discrepancy for borrowing over the subtrial treatment effects (TEB) to borrowing over subtrial groupwise responses (TRB), and demonstrate that both modelling strategies provide substantial gains over an approach with no borrowing.

...read moreread less

Abstract: Basket trials are an innovative precision medicine clinical trial design evaluating a single targeted therapy across multiple diseases that share a common characteristic. To date, most basket trials have been conducted in early‐phase oncology settings, for which several Bayesian methods permitting information sharing across subtrials have been proposed. With the increasing interest of implementing randomised basket trials, information borrowing could be exploited in two ways; considering the commensurability of either the treatment effects or the outcomes specific to each of the treatment groups between the subtrials. In this article, we extend a previous analysis model based on distributional discrepancy for borrowing over the subtrial treatment effects (‘treatment effect borrowing’, TEB) to borrowing over the subtrial groupwise responses (‘treatment response borrowing’, TRB). Simulation results demonstrate that both modelling strategies provide substantial gains over an approach with no borrowing. TRB outperforms TEB especially when subtrial sample sizes are small on all operational characteristics, while the latter has considerable gains in performance over TRB when subtrial sample sizes are large, or the treatment effects and groupwise mean responses are noticeably heterogeneous across subtrials. Further, we notice that TRB, and TEB can potentially lead to different conclusions in the analysis of real data.

...read moreread less

10 citations

Journal Article•10.1111/rssc.12576•

Nowcasting COVID-19 Deaths in England by Age and Region

[...]

15 Jun 2022-Applied statistics

TL;DR: In this article , the authors estimate the number of COVID-19 deaths per day in five age strata within seven English regions, using a Bayesian model that accounts for reporting-day effects and longer-term changes in the delay distribution.

...read moreread less

Abstract: Understanding the trajectory of the daily number of COVID-19 deaths is essential to decisions on how to respond to the pandemic, but estimating this trajectory is complicated by the delay between deaths occurring and being reported. In England the delay is typically several days, but it can be weeks. This causes considerable uncertainty about how many deaths occurred in recent days. Here we estimate the deaths per day in five age strata within seven English regions, using a Bayesian model that accounts for reporting-day effects and longer-term changes in the delay distribution. We show how the model can be computationally efficiently fitted when the delay distribution is the same in multiple strata, for example, over a wide range of ages.

...read moreread less

9 citations

Journal Article•10.1111/rssc.12557•

Bayesian imputation of COVID‐19 positive test counts for nowcasting under reporting lag

[...]

23 Apr 2022-Applied statistics

TL;DR: In this article , the authors adopt a Bayesian approach that provides for subjective priors on parameters and a hierarchical structure for an underlying latent intensity process for the infection counts, which results in a smoothed time-series representation nowcasting the expected number of daily counts of positive tests with uncertainty bands that can be used to aid decision making.

...read moreread less

Abstract: Obtaining up to date information on the number of UK COVID-19 regional infections is hampered by the reporting lag in positive test results for people with COVID-19 symptoms. In the UK, for 'Pillar 2' swab tests for those showing symptoms, it can take up to five days for results to be collated. We make use of the stability of the under reporting process over time to motivate a statistical temporal model that infers the final total count given the partial count information as it arrives. We adopt a Bayesian approach that provides for subjective priors on parameters and a hierarchical structure for an underlying latent intensity process for the infection counts. This results in a smoothed time-series representation nowcasting the expected number of daily counts of positive tests with uncertainty bands that can be used to aid decision making. Inference is performed using sequential Monte Carlo.

...read moreread less

8 citations

Journal Article•10.1111/rssc.12548•

A Bayesian non‐linear state space copula model for air pollution in Beijing

[...]

Alexander Kreuzer, Luciana Dalla Valle, Claudia Czado

26 Mar 2022-Applied statistics

TL;DR: Li et al. as mentioned in this paper proposed a nonlinear non-Gaussian state space model where both the observation and the state equations are defined by copula specifications, and they performed Bayesian inference using the Hamiltonian Monte Carlo method.

...read moreread less

Abstract: Air pollution is a serious issue that currently affects many industrial cities in the world and can cause severe illness to the population. In particular, it has been proven that extreme high levels of airborne contaminants have dangerous short‐term effects on human health, in terms of increased hospital admissions for cardiovascular and respiratory diseases and increased mortality risk. For these reasons, an accurate estimation of airborne pollutant concentrations is crucial. In this paper, we propose a flexible novel approach to model hourly measurements of fine particulate matter and meteorological data collected in Beijing in 2014. We show that the standard state space model, based on Gaussian assumptions, does not correctly capture the time dynamics of the observations. Therefore, we propose a non‐linear non‐Gaussian state space model where both the observation and the state equations are defined by copula specifications, and we perform Bayesian inference using the Hamiltonian Monte Carlo method. The proposed copula state space approach is very flexible, since it allows us to separately model the marginal distributions and to accommodate a wide variety of dependence structures in the data dynamics. We show that the proposed approach allows us not only to accurately estimate particulate matter measurements, but also to capture unusual high levels of air pollution, which were not detected by measured effects.

...read moreread less

7 citations

Journal Article•10.1111/rssc.12567•

Non‐separable spatio‐temporal models via transformed multivariate Gaussian Markov random fields

[...]

Marcos O. Prates, Douglas R. M. Azevedo, Ying C. MacNab, Michael R. Willig

23 May 2022-Applied statistics

TL;DR: In this article , a class of non-separable transformed multivariate Gaussian Markov random fields (TMGMRF) is introduced to capture spatial and temporal dynamics of Nenia tridens.

...read moreread less

Abstract: Models that capture spatial and temporal dynamics are applicable in many scientific fields. Non‐separable spatio‐temporal models were introduced in the literature to capture these dynamics. However, these models are generally complicated in construction and interpretation. We introduce a class of non‐separable transformed multivariate Gaussian Markov random fields (TMGMRF) in which the dependence structure is flexible and facilitates simple interpretations concerning spatial, temporal and spatio‐temporal parameters. Moreover, TMGMRF models have the advantage of allowing specialists to define any desired marginal distribution in model construction without suffering from spatio‐temporal confounding. Consequently, the use of spatio‐temporal models under the TMGMRF framework leads to a new class of general models, such as spatio‐temporal Gamma random fields, that can be directly used to model Poisson intensity for space–time data. The proposed model was applied to identify important environmental characteristics that affect variation in the abundance of Nenia tridens, a dominant species of gastropod in a well‐studied tropical ecosystem, and to characterize its spatial and temporal trends, which are particularly critical during the Anthropocene, an epoch of time characterized by human‐induced environmental change associated with climate and land use.

...read moreread less

6 citations

Journal Article•10.1111/rssc.12560•

Improving cardio‐mechanic inference by combining in vivo strain data with ex vivo volume–pressure data

[...]

Alan Lazarus, Hao Gao, Xiaoyu Luo, Dirk Husmeier

22 Apr 2022-Applied statistics

TL;DR: This work proposes and assess two alternative methodological frameworks for integrating ex vivo data via the Klotz curve into the inference framework, using both a non‐empirical and empirical prior distribution.

...read moreread less

Abstract: Cardio‐mechanic models show substantial promise for improving personalised diagnosis and disease risk prediction. However, estimating the constitutive parameters from strains extracted from in vivo cardiac magnetic resonance scans can be challenging. The reason is that circumferential strains, which are comparatively easy to extract, are not sufficiently informative to uniquely estimate all parameters, while longitudinal and radial strains are difficult to extract at high precision. In the present study, we show how cardio‐mechanic parameter inference can be improved by incorporating prior knowledge from population‐wide ex vivo volume–pressure data. Our work is based on an empirical law known as the Klotz curve. We propose and assess two alternative methodological frameworks for integrating ex vivo data via the Klotz curve into the inference framework, using both a non‐empirical and empirical prior distribution.

...read moreread less

6 citations

Journal Article•10.1111/rssc.12546•

Posterior summaries of grocery retail topic models: Evaluation, interpretability and credibility

[...]

Mariflor Vega, Ioanna Manolopoulou, Jason O'Sullivan, Mirco Musolesi

09 Apr 2022-Applied statistics

TL;DR: A clustering methodology that post‐processes posterior LDA draws to summarise topic distributions represented as recurrent topics is introduced, demonstrating that selecting recurrent topics not only improves predictive likelihood but also outperforms interpretability and credibility in grocery retail data.

...read moreread less

Abstract: Understanding the shopping motivations behind market baskets has significant commercial value for the grocery retail industry. The analysis of shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while delivering interpretable outcomes. Latent Dirichlet allocation (LDA) allows processing grocery transactions and the discovering of customer behaviours. Interpretations of topic models typically exploit individual samples overlooking the uncertainty of single topics. Moreover, training LDA multiple times show topics with large uncertainty, that is, topics (dis)appear in some but not all posterior samples, concurring with various authors in the field. In response, we introduce a clustering methodology that post‐processes posterior LDA draws to summarise topic distributions represented as recurrent topics. Our approach identifies clusters of topics that belong to different samples and provides associated measures of uncertainty for each group. Our proposed methodology allows the identification of an unconstrained number of customer behaviours presented as recurrent topics. We also establish a more holistic framework for model evaluation, which assesses topic models based not only on their predictive likelihood but also on quality aspects such as coherence and distinctiveness of single topics and credibility of a set of topics. Using the outcomes of a tailored survey, we set thresholds that aid in interpreting quality aspects in grocery retail data. We demonstrate that selecting recurrent topics not only improves predictive likelihood but also outperforms interpretability and credibility. We illustrate our methods with an example from a large British supermarket chain.

...read moreread less

5 citations

Journal Article•10.1111/rssc.12554•

A dynamic structural equation approach to estimate the short‐term effects of air pollution on human health

[...]

Dani Gamerman, Luigi Ippoliti, Pasquale Valentini

31 Mar 2022-Applied statistics

TL;DR: In this paper , a dimension reduction approach based on latent factors driven by the data is proposed to estimate the link between pollutants and health outcomes, instead of forcing a necessarily high-dimensional link at the observational level.

...read moreread less

Abstract: Detailed knowledge on the effects of air pollutants on human health is a prerequisite for the development of effective policies to reduce the adverse impact of ambient air pollution. However, measuring the effect of exposure on health outcomes is an extremely difficult task as the health impact of air pollution is known to vary over space and over different exposure periods. In general, standard approaches aggregate the information over space or time to simplify the study but this strategy fails to recognize important regional differences and runs into the well‐known risk of confounding the effects. However, modelling directly with the original, disaggregated data requires a highly dimensional model with the curse of dimensionality making inferences unstable; in these cases, the models tend to retain many irrelevant components and most relevant effects tend to be attenuated. The situation clearly calls for an intermediate solution that does not blindly aggregate data while preserving important regional features. We propose a dimension‐reduction approach based on latent factors driven by the data. These factors naturally absorb the relevant features provided by the data and establish the link between pollutants and health outcomes, instead of forcing a necessarily high‐dimensional link at the observational level. The dynamic structural equation approach is particularly suited for this task. The latent factor approach also provides a simple solution to the spatial misalignment caused by using variables with different spatial resolutions and the state‐space representation of the model favours the application of impulse response analysis. Our approach is discussed through the analysis of the short‐term effects of air pollution on hospitalization data from Lombardia and Piemonte regions (Italy).

...read moreread less

Journal Article•10.1111/rssc.12551•

Modelling clusters of corporate defaults: Regime‐switching models significantly reduce the contagion source

[...]

Geir Drage Berentsen, Jan Bulla, Antonello Maruotti, Bård Støve

03 Apr 2022-Applied statistics

TL;DR: In this article , the authors report robust evidence that the process of corporate defaults is time-dependent and can be modelled by extending an autoregressive count time series model class via the introduction of regime-switching.

...read moreread less

Abstract: In this paper, we report robust evidence that the process of corporate defaults is time‐dependent and can be modelled by extending an autoregressive count time series model class via the introduction of regime‐switching. That is, some of the parameters of the model depend on the regime of an unobserved Markov chain, capturing the model changes during clusters observed for count time series in corporate defaults. Thus, the process of corporate defaults is more dynamic than previously believed. Moreover, the contagion effect—that current defaults affect the probability of other firms defaulting in the future—is reduced compared to models without regime‐switching, and is only present in one regime. A two‐regime model drives the counts of monthly corporate defaults in the United States. To estimate the model, we introduce a novel quasi‐maximum likelihood estimator by adapting the extended Hamilton–Gray algorithm for the Poisson autoregressive model.

...read moreread less

Journal Article•10.1111/rssc.12579•

Spatiotemporal ETAS Model with a Renewal Main-Shock Arrival Process

[...]

y2ramev715

26 Jul 2022-Applied statistics

TL;DR: The renewal ETAS (RETAS) model proposed in this article is similar in spirit to the renewal Hawkes (RHawkes) process but the conditional intensity process supports a spatial component, and it empowers the main-shock intensity to reset upon the arrival of main-shocks.

...read moreread less

Abstract: Abstract We propose a spatiotemporal point process model that enhances the classical Epidemic-Type Aftershock Sequence (ETAS) model. This is achieved with the introduction of a renewal main-shock arrival process and we call this extension the renewal ETAS (RETAS) model. This modification is similar in spirit to the renewal Hawkes (RHawkes) process but the conditional intensity process supports a spatial component. It empowers the main-shock intensity to reset upon the arrival of main-shocks. This allows for heavier clustering of main-shocks than the classical spatiotemporal ETAS model. We introduce a likelihood evaluation algorithm for parameter estimation and provide a novel procedure to evaluate the fitted model's goodness-of-fit (GOF) based on a sequential application of the Rosenblatt transformation. A simulation algorithm for the RETAS model is outlined and used to validate the numerical performance of the likelihood evaluation algorithm and GOF test procedure. We illustrate the proposed model and methods on various earthquake catalogues around the world each with distinctly different seismic activity. These catalogues demonstrate the RETAS model's additional flexibility in comparison to the classical spatiotemporal ETAS model and emphasizes the potential for superior modelling and forecasting of seismicity.

...read moreread less

Journal Article•10.1111/rssc.12558•

Multifidelity Computer Model Emulation with High-Dimensional Output: An Application to Storm Surge

[...]

09 Apr 2022-Applied statistics

TL;DR: In this paper , a parallel partial autoregressive cokriging model is proposed to predict highly accurate storm surges in a computationally efficient way over a large spatial domain, which has the capability of predicting storm surges as accurately as a high-fidelity computer model given any storm characteristics.

...read moreread less

Abstract: Abstract Hurricane-driven storm surge is one of the most deadly and costly natural disasters, making precise quantification of the surge hazard of great importance. Surge hazard quantification is often performed through physics-based computer models of storm surges. Such computer models can be implemented with a wide range of fidelity levels, with computational burdens varying by several orders of magnitude due to the nature of the system. The threat posed by surge makes greater fidelity highly desirable, however, such models and their high-volume output tend to come at great computational cost, which can make detailed study of coastal flood hazards prohibitive. These needs make the development of an emulator combining high-dimensional output from multiple complex computer models with different fidelity levels important. We propose a parallel partial autoregressive cokriging model to predict highly accurate storm surges in a computationally efficient way over a large spatial domain. This emulator has the capability of predicting storm surges as accurately as a high-fidelity computer model given any storm characteristics over a large spatial domain.

...read moreread less

Journal Article•10.1111/rssc.12530•

Second-Order Semi-Parametric Inference for Multivariate Log Gaussian Cox Processes

[...]

01 Jan 2022-Applied statistics

TL;DR: In this article , the second-order conditional composite likelihood is used to infer the pair correlation and cross-pair correlation functions of a multivariate log Gaussian Cox process (LGCP) with a complex intensity function.

...read moreread less

Abstract: Abstract This paper introduces a new approach to inferring the second-order properties of a multivariate log Gaussian Cox process (LGCP) with a complex intensity function. We assume a semi-parametric model for the multivariate intensity function containing an unspecified complex factor common to all types of points. Given this model, we construct a second-order conditional composite likelihood to infer the pair correlation and cross pair correlation functions of the LGCP. Crucially this likelihood does not depend on the unspecified part of the intensity function. We also introduce a cross-validation method for model selection and an algorithm for regularized inference that can be used to obtain sparse models for cross pair correlation functions. The methodology is applied to simulated data as well as data examples from microscopy and criminology. This shows how the new approach outperforms existing alternatives where the intensity functions are estimated non-parametrically.

...read moreread less

Journal Article•10.1111/rssc.12604•

Derivation of maternal dietary patterns accounting for regional heterogeneity

[...]

Briana J.K. Stephenson, Amy H. Herring, Andrew F. Olshan

18 Oct 2022-Applied statistics

TL;DR: In this paper , a flexible supervised clustering approach, introduced as Supervised Robust Profile Clustering, was used to identify outcome-dependent population-based patterns, while partitioning out subpopulation pattern differences.

...read moreread less

Abstract: Latent class models are often used to characterise dietary patterns. Yet, when subtle variations exist across different sub‐populations, overall population patterns can be masked and affect statistical inference on health outcomes. We address this concern with a flexible supervised clustering approach, introduced as Supervised Robust Profile Clustering, that identifies outcome‐dependent population‐based patterns, while partitioning out subpopulation pattern differences. Using dietary data from the 1997–2011 National Birth Defects Prevention Study, we determine how maternal dietary profiles associate with orofacial clefts among offspring. Results indicate mothers who consume a higher proportion of fruits and vegetables compared to land meats lower the proportion of progeny with orofacial cleft defect.

...read moreread less

Journal Article•10.1111/rssc.12564•

Reliability Analysis of Artificial Intelligence Systems Using Recurrent Events Data from Autonomous Vehicles

[...]

Nur Khofifah Zain¹•Institutions (1)

Wuhan University¹

29 Apr 2022-Applied statistics

TL;DR: In this article , the authors use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modelling and analysing the recurrent events data from AV driving tests.

...read moreread less

Abstract: Abstract Artificial intelligence (AI) systems have become increasingly common and the trend will continue. Examples of AI systems include autonomous vehicles (AV), computer vision, natural language processing and AI medical experts. To allow for safe and effective deployment of AI systems, the reliability of such systems needs to be assessed. Traditionally, reliability assessment is based on reliability test data and the subsequent statistical modelling and analysis. The availability of reliability data for AI systems, however, is limited because such data are typically sensitive and proprietary. The California Department of Motor Vehicles (DMV) oversees and regulates an AV testing program, in which many AV manufacturers are conducting AV road tests. Manufacturers participating in the program are required to report recurrent disengagement events to California DMV. This information is being made available to the public. In this paper, we use recurrent disengagement events as a representation of the reliability of the AI system in AV, and propose a statistical framework for modelling and analysing the recurrent events data from AV driving tests. We use traditional parametric models in software reliability and propose a new non-parametric model based on monotonic splines to describe the event process and to estimate the cumulative baseline intensity function of the event process. We develop inference procedures for selecting the best models, quantifying uncertainty and testing heterogeneity in the event process. We then analyse the recurrent events data from four AV manufacturers, and make inferences on the reliability of the AI systems in AV. We also describe how the proposed analysis can be applied to assess the reliability of other AI systems. This paper has online supplementary materials.

...read moreread less

Journal Article•10.1111/rssc.12523•

Urnings: A New Method for Tracking Dynamically Changing Parameters in Paired Comparison Systems

[...]

01 Jan 2022-Applied statistics

TL;DR: In this article , a new rating system for tracking the development of parameters based on a stream of observations that can be viewed as paired comparisons is introduced, where a rating is represented by a probability (i.e. proportion of balls in the urn).

...read moreread less

Abstract: Abstract We introduce a new rating system for tracking the development of parameters based on a stream of observations that can be viewed as paired comparisons. Rating systems are applied in competitive games, adaptive learning systems and platforms for product and service reviews. We model each observation as an outcome of a game of chance that depends on the parameters of interest (e.g. the outcome of a chess game depends on the abilities of the two players). Determining the probabilities of the different game outcomes is conceptualized as an urn problem, where a rating is represented by a probability (i.e. proportion of balls in the urn). This setup allows for evaluating the standard errors of the ratings and performing statistical inferences about the development of, and relations between, parameters. Theoretical properties of the system in terms of the invariant distributions of the ratings and their convergence are derived. The properties of the rating system are illustrated with simulated examples and its potential for answering research questions is illustrated using data from competitive chess, a movie review system, and an adaptive learning system for math.

...read moreread less

Journal Article•10.1111/rssc.12594•

Leveraging network structure to improve pooled testing efficiency

[...]

Daniel K. Sewell

16 Sep 2022-Applied statistics

TL;DR: The results suggest that the approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.

...read moreread less

Abstract: Screening is a powerful tool for infection control, allowing for infectious individuals, whether they be symptomatic or asymptomatic, to be identified and isolated. The resource burden of regular and comprehensive screening can often be prohibitive, however. One such measure to address this is pooled testing, whereby groups of individuals are each given a composite test; should a group receive a positive diagnostic test result, those comprising the group are then tested individually. Infectious disease is spread through a transmission network, and this paper shows how assigning individuals to pools based on this underlying network can improve the efficiency of the pooled testing strategy, thereby reducing the resource burden. We designed a simulated annealing algorithm to improve the pooled testing efficiency as measured by the ratio of the expected number of correct classifications to the expected number of tests performed. We then evaluated our approach using an agent‐based model designed to simulate the spread of SARS‐CoV‐2 in a school setting. Our results suggest that our approach can decrease the number of tests required to regularly screen the student body, and that these reductions are quite robust to assigning pools based on partially observed or noisy versions of the network.

...read moreread less

Journal Article•10.1111/rssc.12536•

Outcome-Guided Sparse K-Means for Disease Subtype Discovery via Integrating Phenotypic Data with High-Dimensional Transcriptomic Data

[...]

Wim Ceelen¹•Institutions (1)

University of Florida Health¹

01 Mar 2022-Applied statistics

TL;DR: GuidedSparseKmeans as discussed by the authors is an outcome-guided sparse k-means method that integrates phenotypic data with the high-dimensional omics data to facilitate biologically meaningful clustering results.

...read moreread less

Abstract: Abstract The discovery of disease subtypes is an essential step for developing precision medicine, and disease subtyping via omics data has become a popular approach. While promising, subtypes obtained from existing approaches are not necessarily associated with clinical outcomes. With the rich clinical data along with the omics data in modern epidemiology cohorts, it is urgent to develop an outcome-guided clustering algorithm to fully integrate the phenotypic data with the high-dimensional omics data. Hence, we extended a sparse K-means method to an outcome-guided sparse K-means (GuidedSparseKmeans) method. An unified objective function was proposed, which was comprised of (i) weighted K-means to perform sample clusterings; (ii) lasso regularizations to perform gene selection from the high-dimensional omics data; and (iii) incorporation of a phenotypic variable from the clinical dataset to facilitate biologically meaningful clustering results. By iteratively optimizing the objective function, we will simultaneously obtain a phenotype-related sample clustering results and gene selection results. We demonstrated the superior performance of the GuidedSparseKmeans by comparing with existing clustering methods in simulations and applications of high-dimensional transcriptomic data of breast cancer and Alzheimer's disease. Our algorithm has been implemented into an R package, which is publicly available on GitHub ( https://github.com/LingsongMeng/GuidedSparseKmeans).

...read moreread less

Journal Article•10.1111/rssc.12597•

Sequential one‐step estimator by sub‐sampling for customer churn analysis with massive data sets

[...]

Feifei Wang, Danyang Huang, Tiancheng Gao, Shuyuan Wu, Hansheng Wang - Show less +1 more

19 Sep 2022-Applied statistics

TL;DR: In this article , a sequential one-step (SOS) estimation method was proposed for large-scale customer churn analysis, where data points need to be sampled only with uniform probabilities, and the sampling step is conducted repeatedly.

...read moreread less

Abstract: Customer churn is one of the most important concerns for large companies. Currently, massive data are often encountered in customer churn analysis, which bring new challenges for model computation. To cope with these concerns, sub‐sampling methods are often used to accomplish data analysis tasks of large scale. To cover more informative samples in one sampling round, classic sub‐sampling methods need to compute non‐uniform sampling probabilities for all data points. However, this method creates a huge computational burden for data sets of large scale and therefore, is not applicable in practice. In this study, we propose a sequential one‐step (SOS) estimation method based on repeated sub‐sampling data sets. In the SOS method, data points need to be sampled only with uniform probabilities, and the sampling step is conducted repeatedly. In each sampling step, a new estimate is computed via one‐step updating based on the newly sampled data points. This leads to a sequence of estimates, of which the final SOS estimate is their average. We theoretically show that both the bias and the standard error of the SOS estimator can decrease with increasing sub‐sampling sizes or sub‐sampling times. The finite sample SOS performances are assessed through simulations. Finally, we apply this SOS method to analyse a real large‐scale customer churn data set in a securities company. The results show that the SOS method has good interpretability and prediction power in this real application.

...read moreread less

Journal Article•10.1093/jrsssc/qlad009•

Missing data patterns in runners’ careers: do they matter?

[...]

M. Stival, Mauro Bernardi, Manuela Cattelan, Petros Dellaportas

25 Jun 2022-Applied statistics

TL;DR: In this paper , a latent class matrix-variate state space model was proposed to predict the future performance of male middle distance runners for a period of 14 years and provided a modelling framework that accounts for both the fact that each runner has typically run in three distance events (800, 1,500, and 5,000 m) and the presence of periods of no running activities.

...read moreread less

Abstract: Predicting the future performance of young runners is an important research issue in experimental sports science and performance analysis. We analyse a dataset with annual seasonal best performances of male middle distance runners for a period of 14 years and provide a modelling framework that accounts for both the fact that each runner has typically run in 3 distance events (800, 1,500, and 5,000 m) and the presence of periods of no running activities. We propose a latent class matrix-variate state space model and we empirically demonstrate that accounting for missing data patterns in runners’ careers improves the out of sample prediction of their performances over time. In particular, we demonstrate that for this analysis, the missing data patterns provide valuable information for the prediction of runner’s performance.

...read moreread less

Journal Article•10.1111/rssc.12527•

A Discrete Kernel Stick-Breaking Model for Detecting Spatial Boundaries in Hydraulic Fracturing Wastewater Disposal Well Placement Across Ohio

[...]

01 Jan 2022-Applied statistics

TL;DR: In this paper , a new discrete areal data kernel function is introduced for use in the kernel stick-breaking process framework that is shown to yield improved detection of spatial boundaries, estimation of regression parameters and model fit through a simulation study and comparison with existing approaches.

...read moreread less

Abstract: Abstract Detecting sharp differences, or boundaries, in areal data can uncover important biological, physical and/or social differences between spatial regions. We introduce a new discrete areal data kernel function for use in the kernel stick-breaking process framework that is shown to yield improved (i) detection of spatial boundaries, (ii) estimation of regression parameters and (iii) model fit through a simulation study and comparison with existing approaches. We use the model to analyse county-level hydraulic fracturing Class II injection well counts in Ohio, where interesting boundary patterns may exist due to the close connection between hydraulic fracturing and shale rock formations. Class II injection wells are used for disposing hydraulic fracturing liquid waste and may pose an environmental risk for surrounding communities. Counties located on the Devonian shale with increased poverty, less income equality, smaller proportion of the population that is white, and increased population density are found to contain more wells, with the relationship reversed for counties off the shale. Results suggest that the new method provides improved model fit and is robust to the exclusion of an important spatially varying covariate, while also detecting boundaries surrounding different shale rock formations. The method is implemented in the R package KSBound.

...read moreread less

Journal Article•10.1111/rssc.12572•

Stopping time detection of wood panel compression: A functional time‐series approach

[...]

Han Lin Shang, Jiguo Cao, Peijun Sang

27 Apr 2022-Applied statistics

TL;DR: In this paper , an estimation procedure was proposed to determine the optimal stopping time of wood panel compression and the estimation uncertainty associated with the estimated stopping time for the glue curing of wood panels in an automatic process environment.

...read moreread less

Abstract: We consider determining the optimal stopping time for the glue curing of wood panels in an automatic process environment. Using the near‐infrared spectroscopy technology to monitor the manufacturing process ensures substantial savings in energy and time. We collect a time‐series of curves from a near‐infrared spectrum probe consisting of 72 spectra and aim to detect an optimal stopping time. We propose an estimation procedure to determine the optimal stopping time of wood panel compression and the estimation uncertainty associated with the estimated stopping time. Our method first divides the entire data set into a training sample and a testing sample, then iteratively computes integrated squared forecast errors based on the testing sample. We then apply a structural break detection method with one breakpoint to determine an estimated optimal stopping time from a univariate time‐series of the integrated squared forecast errors. We also investigate the finite sample performance of the proposed method via a series of simulation studies.

...read moreread less

Journal Article•10.1111/rssc.12599•

Non-parametric Calibration of Multiple Related Radiocarbon Determinations and their Calendar Age Summarisation

[...]

17 Oct 2022-Applied statistics

TL;DR: In this article , a non-parametric Bayesian approach using a Dirichlet process mixture model was developed to address the multi-modality typical within radiocarbon (14C) calibration and provided a predictive estimate for the underlying calendar age of a future sample.

...read moreread less

Abstract: Abstract Due to fluctuations in past radiocarbon (14C) levels, calibration is required to convert 14C determinations Xi into calendar ages θi. In many studies, we wish to calibrate a set of related samples taken from the same site or context, which have calendar ages drawn from the same shared, but unknown, density f(θ). Calibration of X1,…,Xn can be improved significantly by incorporating the knowledge that the samples are related. Furthermore, summary estimates of the underlying shared f(θ) can provide valuable information on changes in population size/activity over time. Most current approaches require a parametric specification for f(θ) which is often not appropriate. We develop a rigorous non-parametric Bayesian approach using a Dirichlet process mixture model, with slice sampling to address the multi-modality typical within 14C calibration. Our approach simultaneously calibrates the set of 14C determinations and provides a predictive estimate for the underlying calendar age of a future sample. We show, in a simulation study, the improvement in calendar age estimation when jointly calibrating related samples using our approach, compared with calibration of each 14C determination independently. We also illustrate the use of the predictive calendar age estimate to provide insight on activity levels over time using three real-life case studies.

...read moreread less

Journal Article•10.1111/rssc.12550•

Generalizing Trial Evidence to Target Populations in Non-Nested Designs: Applications to AIDS Clinical Trials

[...]

Doris González-Fernández¹•Institutions (1)

Yale University¹

17 Mar 2022-Applied statistics

TL;DR: In this article , weighting, regression and doubly robust estimators were used to estimate the causal effects of HIV interventions in a specified population of people living with HIV in the USA.

...read moreread less

Abstract: Comparative effectiveness evidence from randomized trials may not be directly generalizable to a target population of substantive interest when, as in most cases, trial participants are not randomly sampled from the target population. Motivated by the need to generalize evidence from two trials conducted in the AIDS Clinical Trials Group (ACTG), we consider weighting, regression and doubly robust estimators to estimate the causal effects of HIV interventions in a specified population of people living with HIV in the USA. We focus on a non-nested trial design and discuss strategies for both point and variance estimation of the target population average treatment effect. Specifically in the generalizability context, we demonstrate both analytically and empirically that estimating the known propensity score in trials does not increase the variance for each of the weighting, regression and doubly robust estimators. We apply these methods to generalize the average treatment effects from two ACTG trials to specified target populations and operationalize key practical considerations. Finally, we report on a simulation study that investigates the finite-sample operating characteristics of the generalizability estimators and their sandwich variance estimators.

...read moreread less

Journal Article•10.1111/rssc.12571•

Assessing predictive discrimination performance of biomarkers in the presence of treatment‐induced dependent censoring

[...]

Cuihong Zhang, Jing Ning, Steven H. Belle, Robert H. Squires, Jianwen Cai, Ruosha Li - Show less +2 more

25 May 2022-Applied statistics

TL;DR: In this article , the dependence between the failure event and dependent censoring event using auxiliary information in multiple longitudinal risk factors is tackled. But, the authors do not consider the independent censoring assumption and are thus not applicable.

...read moreread less

Abstract: In medical studies, some therapeutic decisions could lead to dependent censoring for the survival outcome of interest. This is exemplified by a study of paediatric acute liver failure, where death was subject to dependent censoring due to liver transplantation. Existing methods for assessing the predictive performance of biomarkers often pose the independent censoring assumption and are thus not applicable. In this work, we propose to tackle the dependence between the failure event and dependent censoring event using auxiliary information in multiple longitudinal risk factors. We propose estimators of sensitivity, specificity and area under curve, to discern the predictive power of biomarkers for the failure event by removing the disturbance of dependent censoring. Point estimation and inferential procedures were developed by adopting the joint modelling framework. The proposed methods performed satisfactorily in extensive simulation studies. We applied them to examine the predictive value of various biomarkers and risk scores for mortality in the motivating example.

...read moreread less

Journal Article•10.1111/rssc.12565•

A nearest‐neighbour Gaussian process spatial factor model for censored, multi‐depth geochemical data

[...]

Tilman M. Davies, Sudipto Banerjee, Adam P. Martin, R.E. Turnbull

19 May 2022-Applied statistics

TL;DR: In this article , the authors investigate the relationship between local environmental variables and the geochemical composition of the Earth in a region spanning over 26,000 km2 in the lower South Island of New Zealand.

...read moreread less

Abstract: We investigate the relationships between local environmental variables and the geochemical composition of the Earth in a region spanning over 26,000 km2 in the lower South Island of New Zealand. Part of the Southland–South Otago geochemical baseline survey—a pilot study pre‐empting roll‐out across the country—the data comprise the measurements of 59 chemical trace elements, each at two depth prescriptions, at several hundred spatial sites. We demonstrate construction of a hierarchical spatial factor model that captures inter‐depth dependency; handles imputation of left‐censored readings in a statistically principled manner; and exploits sparse approximations to Gaussian processes to deliver inference. The voluminous results provide a novel impression of the underlying processes and are presented graphically via simple web‐based applications. These both confirm existing knowledge and provide a basis from which new research hypotheses in geochemistry might be formed.

...read moreread less

Journal Article•10.1111/rssc.12559•

Modelling the Extremes of Seasonal Viruses and Hospital Congestion: The Example of Flu in a Swiss Hospital

[...]

13 Apr 2022-Applied statistics

TL;DR: In this article , a discrete generalized Pareto distribution for the number of positive and negative cases was proposed to model the extremes of hospital visits from patients showing flu-like symptoms.

...read moreread less

Abstract: Abstract Viruses causing flu or milder coronavirus colds are often referred to as ‘seasonal viruses’ as they tend to subside in warmer months. In other words, meteorological conditions tend to impact the activity of viruses, and this infor2mation can be exploited for the operational management of hospitals. In this study, we use 3 years of daily data from one of the biggest hospitals in Switzerland and focus on modelling the extremes of hospital visits from patients showing flu-like symptoms and the number of positive flu cases. We propose employing a discrete generalized Pareto distribution for the number of positive and negative cases. Our modelling framework allows for the parameters of these distributions to be linked to covariate effects, and for outlying observations to be dealt with via a robust estimation approach. Because meteorological conditions may vary over time, we use meteorological and not calendar variations to explain hospital charge extremes, and our empirical findings highlight their significance. We propose a measure of hospital congestion and a related tool to estimate the resulting CaRe (Charge-at-Risk-estimation) under different meteorological conditions. The relevant numerical computations can be easily carried out using the freely available GJRM R package. The empirical effectiveness of the proposed method is assessed through a simulation study.

...read moreread less

Journal Article•10.1111/rssc.12533•

Ranking Tailoring Variables for Constructing Individualized Treatment Rules: An Application to Schizophrenia

[...]

Yuan Yuan Liu¹•Institutions (1)

Kaiser Permanente Washington Health Research Institute¹

01 Mar 2022-Applied statistics

TL;DR: In this paper , the authors address the challenge of identifying and ranking important variables for tailoring treatment decisions for patients with schizophrenia and apply their methodology to the Clinical Antipsychotics Trial of Intervention and Effectiveness (CATIE) study for schizophrenia.

...read moreread less

Abstract: Abstract As with many chronic conditions, matching patients with schizophrenia to the best treatment option is difficult. Selecting antipsychotic medication is especially challenging because many of the medications can have burdensome side effects. Adjusting or tailoring medications based on patients’ characteristics could improve symptoms. However, it is often not known which patient characteristics are most helpful for informing treatment selection. In this paper, we address the challenge of identifying and ranking important variables for tailoring treatment decisions. We consider a value-search approach implemented through dynamic marginal structural models to estimate an optimal individualized treatment rule. We apply our methodology to the Clinical Antipsychotics Trial of Intervention and Effectiveness (CATIE) study for schizophrenia, to evaluate if some tailoring variables have greater potential than others for selecting treatments for patients with schizophrenia (Stroup et al., 2003, Schizophrenia Bulletin, 29, 15–31).

...read moreread less