TL;DR: The validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures can be maintained by using robust variance estimators.
Abstract: Linear mixed models are commonly used in analyzing stepped-wedge cluster randomized trials. A key consideration for analyzing a stepped-wedge cluster randomized trial is accounting for the potentially complex correlation structure, which can be achieved by specifying random-effects. The simplest random effects structure is random intercept but more complex structures such as random cluster-by-period, discrete-time decay, and more recently, the random intervention structure, have been proposed. Specifying appropriate random effects in practice can be challenging: assuming more complex correlation structures may be reasonable but they are vulnerable to computational challenges. To circumvent these challenges, robust variance estimators may be applied to linear mixed models to provide consistent estimators of standard errors of fixed effect parameters in the presence of random-effects misspecification. However, there has been no empirical investigation of robust variance estimators for stepped-wedge cluster randomized trials. In this article, we review six robust variance estimators (both standard and small-sample bias-corrected robust variance estimators) that are available for linear mixed models in R, and then describe a comprehensive simulation study to examine the performance of these robust variance estimators for stepped-wedge cluster randomized trials with a continuous outcome under different data generators. For each data generator, we investigate whether the use of a robust variance estimator with either the random intercept model or the random cluster-by-period model is sufficient to provide valid statistical inference for fixed effect parameters, when these working models are subject to random-effect misspecification. Our results indicate that the random intercept and random cluster-by-period models with robust variance estimators performed adequately. The CR3 robust variance estimator (approximate jackknife) estimator, coupled with the number of clusters minus two degrees of freedom correction, consistently gave the best coverage results, but could be slightly conservative when the number of clusters was below 16. We summarize the implications of our results for the linear mixed model analysis of stepped-wedge cluster randomized trials and offer some practical recommendations on the choice of the analytic model.
TL;DR: Ecological studies often overlook linearity assumption in generalized linear models (GLMs), with less than 15% of recent studies testing for linearity, highlighting the need for formal methods to address nonlinearity in ecological modeling and conservation decision-making.
Abstract: Abstract Generalized linear models (GLMs) are an integral tool in ecology. Like general linear models, GLMs assume linearity, which entails a linear relationship between independent and dependent variables. However, because this assumption acts on the link rather than the natural scale in GLMs, it is more easily overlooked. We reviewed recent ecological literature to quantify the use of linearity. We then used two case studies to confront the linearity assumption via two GLMs fit to empirical data. In the first case study we compared GLMs to generalized additive models (GAMs) fit to mammal relative abundance data. In the second case study we tested for linearity in occupancy models using passerine point‐count data. We reviewed 162 studies published in the last 5 years in five leading ecology journals and found less than 15% reported testing for linearity. These studies used transformations and GAMs more often than they reported a linearity test. In the first case study, GAMs strongly out‐performed GLMs as measured by AIC in modeling relative abundance, and GAMs helped uncover nonlinear responses of carnivore species to landscape development. In the second case study, 14% of species‐specific models failed a formal statistical test for linearity. We also found that differences between linear and nonlinear (i.e., those with a transformed independent variable) model predictions were similar for some species but not for others, with implications for inference and conservation decision‐making. Our review suggests that reporting tests for linearity are rare in recent studies employing GLMs. Our case studies show how formally comparing models that allow for nonlinear relationships between the dependent and independent variables has the potential to impact inference, generate new hypotheses, and alter conservation implications. We conclude by suggesting that ecological studies report tests for linearity and use formal methods to address linearity assumption violations in GLMs.
TL;DR: Cerebrovascular reactivity increases across development in multiple networks as revealed by a breath-holding task. Increased cerebrovascular reactivity is observed in multiple networks across age.
Abstract: Functional magnetic resonance imaging (fMRI) has been widely used to understand the neurodevelopmental changes that occur in cognition and behavior across childhood. The blood-oxygen-level-dependent (BOLD) signal obtained from fMRI is understood to be comprised of both neuronal and vascular information. However, it is unclear whether the vascular response is altered across age in studies investigating development in children. Since the breath-hold (BH) task is commonly used to understand cerebrovascular reactivity (CVR) in fMRI studies, it can be used to account for developmental differences in vascular response. This study examines how the cerebrovascular response changes over age in a longitudinal children's BH data set from the Nathan Kline Institute (NKI) Rockland Sample (aged 6-18 years old at enrollment). A general linear model approach was applied to derive CVR from BH data. To model both the longitudinal and cross-sectional effects of age on BH response, we used mixed-effects modeling with the following terms: linear, quadratic, logarithmic, and quadratic-logarithmic, to find the best-fitting model. We observed increased BH BOLD signals in multiple networks across age, in which linear and logarithmic mixed-effects models provided the best fit with the lowest Akaike information criterion scores. This shows that the cerebrovascular response increases across development in a brain network-specific manner. Therefore, fMRI studies investigating the developmental period should account for cerebrovascular changes that occur with age.
TL;DR: This study highlights the importance of considering non-linear associations in meta-analysis, particularly when examining environmental pollutant exposure, such as PFAS, and birth outcomes, to avoid oversimplification and promote evidence-based policymaking.
Abstract: Abstract Meta-analysis is a powerful analytic method for summarizing effect estimates across studies. However, conventional meta-analysis often assumes a linear exposure-outcome relationship and does not account for variability over the exposure ranges. In this work, we first used simulation techniques to illustrate that the linear-based meta-analytical approach may result in oversimplistic effect estimation based on three plausible non-linear exposure-outcome curves (S-shape, inverted U-shape, and M-shape). We showed that subgroup meta-analysis that stratifies on exposure levels can investigate non-linearity and identify the consistency of effect magnitudes in these simulated examples. Next, we examined the heterogeneity of effect estimates across exposure ranges in two published linear-based meta-analyses of prenatal exposure to per- and polyfluoroalkyl substances (PFAS) on changes in mean birth weight or risk of preterm birth. The re-analysis found some varying effect sizes and potential heterogeneity when restricting to different PFAS exposure ranges, but findings were sensitive to the cut-off choices used to rank the exposure levels. Finally, we discussed methodological challenges and recommendations for detecting and interpreting potential non-linear associations in meta-analysis. Using meta-analysis without accounting for exposure range could contribute to literature inconsistency for exposure-induced health effects and impede evidence-based policymaking. Therefore, investigating result heterogeneity by exposure range is recommended.
TL;DR: Researchers introduce the clustermatch correlation coefficient (CCC), a not-only-linear correlation coefficient that efficiently detects both linear and nonlinear associations in genome-scale data, outperforming standard methods and revealing biologically meaningful patterns.
Abstract: Identifying meaningful patterns in data is crucial for understanding complex biological processes, particularly in transcriptomics, where genes with correlated expression often share functions or contribute to disease mechanisms. Traditional correlation coefficients, which primarily capture linear relationships, may overlook important nonlinear patterns. We introduce the clustermatch correlation coefficient (CCC), a not-only-linear coefficient that utilizes clustering to efficiently detect both linear and nonlinear associations. CCC outperforms standard methods by revealing biologically meaningful patterns that linear-only coefficients miss and is faster than state-of-the-art coefficients such as the maximal information coefficient. When applied to human gene expression data from genotype-tissue expression (GTEx), CCC identified robust linear relationships and nonlinear patterns, such as sex-specific differences, that are undetectable by standard methods. Highly ranked gene pairs were enriched for interactions in integrated networks built from protein-protein interactions, transcription factor regulation, and chemical and genetic perturbations, suggesting that CCC can detect functional relationships missed by linear-only approaches. CCC is a highly efficient, next-generation, not-only-linear correlation coefficient for genome-scale data. A record of this paper's transparent peer review process is included in the supplemental information.
TL;DR: The LR method is valid for estimating population accuracy of non-linear predictions based on conditional means and can detect inadequate models.
Abstract: To address the limitations of commonly used cross-validation methods, the linear regression method (LR) was proposed to estimate population accuracy of predictions based on the implicit assumption that the fitted model is correct. This method also provides two statistics to determine the adequacy of the fitted model. The validity and behavior of the LR method have been provided and studied for linear predictions but not for nonlinear predictions. The objectives of this study were to 1) provide a mathematical proof for the validity of the LR method when predictions are based on conditional means, regardless of whether the predictions are linear or non-linear 2) investigate the ability of the LR method to detect whether the fitted model is adequate or inadequate, and 3) provide guidelines on how to appropriately partition the data into training and validation such that the LR method can identify an inadequate model.
TL;DR: Software package to compute Bayes factor for linear mixed model is lacking. This paper proposes a method to compute Bayes factor for linear mixed model by transforming it into Bayesian linear regression. The method is implemented in the software IDUL.
Abstract: Bayes factor has advantages over p-value as test statistics for association, particularly when comparing multiple alternative models. A software package to compute Bayes factor for linear mixed model is lacking. We transformed the standard linear mixed model as Bayesian linear regression, substituting the random effect by fixed effects with eigenvectors as covariates whose prior effect sizes are proportional to their corresponding eigenvalues. Using conjugate normal inverse gamma priors on regression parameters, Bayes factors can be computed in a closed form. We then showed that the transformed Bayesian linear regression produced identical estimates to those of the best linear unbiased prediction (BLUP), providing a new derivation to a known connection between BLUP and Bayesian estimates. Methods described in this note are implemented in the software IDUL as two new functionalities: computing Bayes factors and residuals for the linear mixed model. IDUL and its source code are freely available at \url{https://github.com/haplotype/idul}.
TL;DR: Model selection criteria based on Rényi’s pseudodistances are robust against non-identically distributed observations.
Abstract: Model selection criteria are rules used to select the best statistical model among a set of candidate models, striking a trade-off between goodness of fit and model complexity. Most popular model selection criteria measures the goodness of fit trough the model log-likelihood function, yielding to non-robust criteria. This paper presents a new family of robust model selection criteria for independent but not identically distributed observations (i.n.i.d.o.) based on the Rényi’s pseudodistance (RP). The RP-based model selection criterion is indexed with a tuning parameter α controlling the trade-off between efficiency and robustness. Some theoretical results about the RP criterion are derived and the theory is applied to the multiple linear regression model, obtaining explicit expressions of the model selection criterion. Moreover, restricted models are considered and explicit expressions under the multiple linear regression model with nested models are accordingly derived. Finally, a simulation study empirically illustrates the robustness advantage of the method.
TL;DR: Non-linearity and temporal variability are overlooked components of global population dynamics, revealing dramatic declines or recovery signals that are hidden by linear trends.
Abstract: Aim. Population dynamics are usually assessed through linear trend analysis, quantifying their general direction. However, linear trends may hide substantial variations in population dynamics that could reconcile apparent discrepancies when quantifying the extent of the biodiversity crisis. We seek to determine whether the use of non-linear methods and the quantification of temporal variability can add value to the linear approach by offering a more complete representation of global population changes. In addition, we seek to determine how these components are distributed among biogeographical regions and taxonomic groups. Location.Global.Methods.We analysed 6,437 population time series from 1,257 species from the Living Planet Database over the period 1950-2020. We modeled populations through the use of second order polynomials and classified trajectories according to their direction and acceleration. We modeled and classified these same populations using a more common linear trend analysis. We quantified temporal variability using three metrics, the coefficient of variation, the mean squared error and the consecutive disparity index. We then used chi-squared tests and linear mixed-effects models to test potential sources of heterogeneity in non-linear trajectories and temporal variability.Results.Non-linear models were a better fit for 44.8 % of the analyzed time series, and temporal variability was higher among trajectories classified as linear. Linear models missed meaningful information by misclassifying recent declines or recovery signals. Marine populations were highly variable, and all taxonomic groups or IUCN categories exhibited variability in their degree of non-linearity and temporal variability.Main conclusions.Non-linearity and temporal variability reveal usually overlooked dramatic declines or recovery signals in global population dynamics. Thus, moving beyond linearity can help reduce the risk of misleading conclusions and better inform conservation decisions. In particular, population usually classified as « stable » can hide informative non-linear and variable changes to integrate in more advanced global biodiversity assessment.
TL;DR: A comprehensive intermediate-level introduction to the linear model with balanced proofs and R code for replicating simulations and case studies.
Abstract: I developed the lecture notes based on my ``Linear Model'' course at the University of California Berkeley over the past seven years. This book provides an intermediate-level introduction to the linear model. It balances rigorous proofs and heuristic arguments. This book provides R code to replicate all simulation studies and case studies.
Abstract: The scalar-on-function regression is quite useful for modelling mixed-data in the context of scalar and functional variables.Under this class of regression, the paper aims at proposing a compelling alternative to model selection methods to address model selection uncertainty.The considered models characterize a scalar response using parametric effect of the scalar predictors and nonparametric effect of a functional predictor, and a model averaging estimation is developed based on Mallows-type criterion to assign weights for averaging.Further, the asymptotic optimality of the resulting estimator, in terms of achieving the smallest possible squared error loss, is established.Besides, simulation studies demonstrate its superiority to or comparability with some information criterion score-based model selection and averaging estimators.The proposed procedure is also applied to a mid-infrared spectra dataset for illustration.
TL;DR: FL-based linear regression model for predicting length of stay using Arduino Nano 33 BLE Sense, leveraging privacy-preserving distributed learning.
Abstract: Federated learning (FL) is a collaborative learning paradigm where multiple clients are used to build the model without sharing data and preserving privacy. An FL-based linear regression model is designed to predict the length of stay for patients at hospitals using the low-power Arduino Nano 33 BLE Sense microcontroller unit (MCU). FL uses a distributed learning technique that allows model building from decentralized data sources. The Arduino Nano 33 BLE Sense is a compact and energy-efficient MCU providing an ideal platform for implementing FL in resource-constrained environments. FL algorithms aggregate model parameters from multiple Arduino clients and collectively train and build a predictive model to estimate the length of stay at the hospital by patients. Experiments were conducted to understand the performance of FL on clients with data of equal and varying sizes and heterogeneous data from multiple sources. The performance of the algorithm is evaluated based on Mean Absolute Error (MAE), Percentage Decrease in Training error (PDTE), and Percentage Difference with Optimal Testing (PDOT) value. Experimental results show that the number of local epochs and FL rounds affects the convergence of clients to the optimal value. The experimental results demonstrate the applicability of FL on low-power MCUs, preserving privacy which is a core requirement for healthcare solutions.
TL;DR: This study compares machine learning methods for predicting Arabidopsis thaliana traits, finding that some methods outperform linear models, with neural networks being the most accurate and robust for traits with high heritability.
Abstract: We present a comparison of machine learning methods for the prediction of four quantitative traits in Arabidopsis thaliana . High prediction accuracies were achieved on individuals grown under standardized laboratory conditions from the 1001 Arabidopsis Genomes Project . An existing body of evidence suggests that linear models may be impeded by their inability to make use of non-additive effects to explain phenotypic variation at the population level. The results presented here use a nested cross-validation approach to confirm that some machine learning methods have the ability to statistically outperform linear prediction models, with the optimal model dependent on availability of training data and genetic architecture of the trait in question. Linear models were competitive in their performance as per previous work, though the neural network class of predictors was observed to be the most accurate and robust for traits with high heritability. The extent to which non-linear models exploit interaction effects will require further investigation of the causal pathways that lay behind their predictions. Future work utilizing more traits and larger sample sizes, combined with an improved understanding of their respective genetic architectures, may lead to improvements in prediction accuracy.
TL;DR: A linear model estimates international freight transport modal split based on cost and time saving preferences, revealing significant mode-specific effects and varying elasticities across commodity groups, with cost and time elasticities ranging from 0 to -1.9 and -3.3, respectively.
Abstract: Abstract We present a new model for estimating the distribution of international freight transport over transport modes that is directly applicable to aggregated data, allows estimations and predictions also when some modes are infeasible on some routes, and requires few processing time and memory space. It builds on the assumption that demand for transport by a given mode is driven by trading firms’ and consumers’ preferences about saving transport costs and time. In contrast to conventional mode-choice models, it is linear and grounded on consumer demand theory. Applying the model to international freight transport as recorded in the latest upgrade of UN Comtrade reveals an average cost elasticity of transport demand of − 0.32 and an average time elasticity of − 0.18. In addition, we find significant independent mode-specific effects. Cost and time elasticities are highly dependent on the type of commodity transported. The cost elasticity ranges from zero to − 1.9 and the time elasticity from zero to − 3.3 across commodity groups defined at the four-digit level of the Harmonized System classification. These findings suggest that policy measures, exogenous shocks or other events that change the relative transport costs and transit times across modes can cause modal shifts—for some commodities more than for others—thereby mitigating the loss in welfare.
TL;DR: A novel multiple linear regression model predicts characteristic frequencies in human tissues using bioelectrical impedance analysis, demonstrating significant correlation (r=0.997, p<0.001) and minimal error (bias=0.32, RMSE=0.38) in detecting cardiovascular, cancer, and dengue fever ailments.
Abstract: This research introduces a novel mathematical methodology for identifying the distinctive frequency of human tissue. The model has been formulated using bioelectrical impedance analysis. The developed model can be utilized to detect a range of ailments, including those associated with the cardiovascular system, cancer, and dengue fever. A total of 3813 data points, including both males and females, were utilized. Data from a sample of both male and female individuals, including their age, height, bioelectrical impedance at frequencies ranging from 5 kHz to 1 MHz (for the Fc model), body mass index, and an impedance index of 2000, were utilized to create mathematical models. To validate the suggested models, data from a total of 1813 individuals (both male and female) were utilized. The statistical analysis of the proposed model (Fc) reveals a significant correlation (Pearson coefficient = 0.997, p < 0.001) between both male and female subjects, with a positive covariance. The model’s 95% limits of agreement, ranging from −1.28 to 1.98 L for both males and females, are sufficiently minimal. All errors fall within this limit. In addition, the suggested model has undergone validation in terms of various types of error analysis, such as bias and root mean square (RMSE). The bias and RMSE values, which are indicators of error, reach a maximum of 0.32 and 0.38 L (for both male and female), respectively. These values are within the predicted range and can be considered minimal.
Mohammed Abdelhadi Sellami, Oussama Seyf Elislam Hamzi, Belal Khaldi
24 Apr 2024
TL;DR: The UMS-Linear model improves the forecasting ability of MLP-based models by capturing intricate data dynamics through timestamps and multi-scale decomposition.
Abstract: Transformer models have achieved state-of-the-art performance in fields such as natural language processing and computer vision. However, they encounter difficulties when applied to time series forecasting due to the permutation invariance of their self-attention mechanism. This has encouraged adapting simpler models, including those based on CNNs, RNNs, and MLPs. While MLP-based models succeed at capturing the moving average and seasonal patterns in data, they struggle with trends and sudden changes. To address this issue, we propose a Univariate Multi-Scale Linear (UMS-Linear) model that leverages timestamps and multi-scale decomposition. UMS-Linear improves the forecasting ability of MLP-based models by capturing the intricate dynamics of data. Empirical results demonstrate that UMS-Linear surpasses existing methods across multiple benchmarks, indicating its potential to advance time series forecasting accuracy.
TL;DR: Confidence sets for a level set in linear regression estimate the set of covariate values for which the regression function exceeds a predefined level.
Abstract: Regression modeling is the workhorse of statistics and there is a vast literature on estimation of the regression function. It has been realized in recent years that in regression analysis the ultimate aim may be the estimation of a level set of the regression function, ie, the set of covariate values for which the regression function exceeds a predefined level, instead of the estimation of the regression function itself. The published work on estimation of the level set has thus far focused mainly on nonparametric regression, especially on point estimation. In this article, the construction of confidence sets for the level set of linear regression is considered. In particular, 1-α$$ 1-\alpha $$ level upper, lower and two-sided confidence sets are constructed for the normal-error linear regression. It is shown that these confidence sets can be easily constructed from the corresponding 1-α$$ 1-\alpha $$ level simultaneous confidence bands. It is also pointed out that the construction method is readily applicable to other parametric regression models where the mean response depends on a linear predictor through a monotonic link function, which include generalized linear models, linear mixed models and generalized linear mixed models. Therefore, the method proposed in this article is widely applicable. Simulation studies with both linear and generalized linear models are conducted to assess the method and real examples are used to illustrate the method.
TL;DR: A novel imputation-free method for linear regression prediction on data sets with missing values, adaptive predictor-set linear model (aps-lm), is proposed. It directly handles missing values without imputation, thereby improving prediction accuracy and reducing bias.
Abstract: Abstract Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete‐case analysis or imputation. Both work‐arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness‐induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor‐set linear model (aps‐lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor‐selection operation, the Moore–Penrose pseudoinverse, and the reduced QR decomposition. aps‐lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy‐preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps‐lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps‐lm in a simulation study. aps‐lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof‐of‐principle, we apply aps‐lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.
Santosh M. Rajkumar, Cheng Sheng, Naira Hovakimyan, Debdipta Goswami
18 Sep 2024
TL;DR: This letter presents a Koopman-theoretic lifted LPV system for quadrotor control, analytically deriving observables for SE(3) dynamics, and formulating a linear model-predictive control scheme for tracking problems with potential for real-time implementation.
Abstract: This letter presents a Koopman-theoretic lifted linear parameter-varying (LPV) system with countably infinite dimensions to model the nonlinear dynamics of a quadrotor on SE(3) for facilitating control design. The LPV system evolves in time in the space of the observables, called the lifted space. A primary challenge in utilizing the Koopman-based linearization is identifying a set of observables that can adequately span the lifted space, with the majority of the current methods using data to learn these observables. In this study, we analytically derive the observables for the quadrotor dynamics on SE(3) to formulate the lifted LPV system. The lifted LPV system has a countably infinite dimension which is then truncated for practical control design. The truncation is analytically justified by showing vanishing residual property in a bounded trajectory regime. The LPV system is then approximated as a linear time-invariant (LTI) system with a set of virtual control inputs. The controllability of the lifted LTI system is translatable to the true quadrotor system on SE(3). A linear model-predictive control (LMPC) scheme is formulated and implemented in numerical simulations employing this LTI framework for various tracking problems, with attention given to the potential for real-time implementation.
TL;DR: Researchers employ dimensionality analysis and linear modeling to streamline linear free energy relationships of proteins, improving predictions of protein-water partition coefficients by leveraging multiple parameters and reducing complexity in traditional one-parameter models.
Abstract: Linear free energy relationships (LFERs) are pivotal in predicting protein-water partition coefficients, with traditional one-parameter (
Josafhat Salinas‐Ruíz, Osval Antonio Montesinos López, Gabriela Hernández Ramírez, Jose Crossa
10 Jun 2024
Abstract: Abstract Throughout this book, we have been using the pseudonym GLMMs to denote generalized linear mixed models. The common denominator among all these models is that they all contain a linear model (LM) part, which refers to the fixed effects component of the linear predictor Xβ . In a GLMM, the prefix "G" indicates that the distribution of observations may not be normal, the suffix of the first M means that the linear predictor includes mixed effects and thus contains random effects, which are expressed by the term " Zb ." The fixed linear component of the predictor Xβ is important because the fixed effects describe the treatment design, which, in turn, is determined by the objectives or the initial research questions that the study wishes to answer. Therefore, if the researcher proposes using a reasonable model to analyze an experiment, then he/she must be able to express each objective as a question about a model parameter or as a linear combination of model parameters.
TL;DR: This study develops a linear regression model to predict the climate policy support index in the 2020 American National Election Study (ANES), utilizing a three-item index to gauge public support for climate policies.
Abstract: <p>Regression model for climate policy support index in ANES 2020 (three-item index; linear regression).</p>
TL;DR: Approximate confidence intervals for the population proportion based on a non-normal linear model cover a true value of a proportion that close to a specific nominal value of the level of significant.
Abstract: When the linear model errors are non-normal, one might be interested in making inference concerning proportion. The goal of this article is to construct approximate confidence intervals for the proportion founded on the supposed linear model which cover a true value of a proportion that close to a specific nominal value of the level of significant.
TL;DR: This paper proposes a functional partial linear regression model with quadratic regression for multivariate predictors, combining functional linear models with least squares estimation, and theoretically analyzes its asymptotic properties and convergence rates.
Abstract: In the last decades, functional semi-parametric models are widely studied such as functional additive model, functional partial linear model, functional index model, and so on. But some of the semi-parametric models suffer from the lower convergence rate for the non parametric parts (especially for the non parametric functional regression) and the assumption for the linear parts is too strict in practical. In this article, we propose a functional partial linear model corresponding to a scalar response, while the predictors contain both of function and multivariate predictor. We use a functional linear model to avoid the low convergence rate of the non parametric functional estimation, and for the multivariate predictor, we use a quadratic regression model to deal with the strict assumption for the linear model. To estimate this model, we first expand the functional predictor and functional regression parametric on the functional principal component basis, then, we estimate the functional coefficient by the least square method. For the theoretical studying, we will study the asymptotical normal distribution of the multivariate regression parameter and the specific convergence rate of the functional regression parameter for the dense functional observations. Furthermore, we illustrate the performance of the proposed method by simulation studies and one real data analysis.
Abstract: Estimating response time (RT) to questions in exam or homework practices is becoming more important in online learning platforms to improve the learning experience and item allocation. To predict RT, we applied the Empirical Best Linear Unbiased Predictor (EBLUP) technique from the linear mixed-effects model to predict RT to each question. The proposed method provides more precise RT predictions in terms of mean absolute errors, correlation coefficient, and close-to-1 ratio counts compared with feature-based machine learning regression models and 95 quantile predictions.