TL;DR: A rich variety of diagnostic tests for these situations have been developed in the econometrics community, a collection of which has been implemented in the packages lmtest and strucchange covering the problems mentioned above.
Abstract: is still one of the most popular tools for data analysis despite (or due to) its simple structure. Although it is appropriate in many situations, there are many pitfalls that might affect the quality of conclusions drawn from fitted models or might even lead to uninterpretable results. Some of these pitfalls that are considered especially important in applied econometrics are heteroskedasticity or serial correlation of the error terms, structural changes in the regression coefficients, nonlinearities, functional misspecification or omitted variables. Therefore, a rich variety of diagnostic tests for these situations have been developed in the econometrics community, a collection of which has been implemented in the packages lmtest and strucchange covering the problems mentioned above. These diagnostic tests are not only useful in econometrics but also in many other fields where linear regression is used, which we will demonstrate with an application from biostatistics. As Breiman (2001) argues it is important to assess the goodness-of-fit of data models, in particular not only using omnibus tests but tests designed for a certain direction of the alternative. These diagnostic checks do not have to be seen as pure significance procedures but also as an explorative tool to extract information about the structure of the data, especially in connection with residual plots or other diagnostic plots. As Brown, Durbin, and Evans (1975) argue for the recursive CUSUM test, these procedures can “be regarded as yardsticks for the interpretation of data rather than leading to hard and fast decisions.” Moreover, we will always be able to reject the nullhypothesis provided we have enough data at hand. The question is not whether the model is wrong (it always is!) but if the irregularities are serious. The package strucchange implements a variety of procedures related to structural change of the regression coefficients and was already introduced in R news by Zeileis (2001) and described in more detail in Zeileis, Leisch, Hornik, and Kleiber (2002). Therefore, we will focus on the package lmtest in the following. Most of the tests and the datasets contained in the package are taken from the book of Kramer and Sonnberger (1986), which originally inspired us to write the package. Compared to the book, we implemented later versions of some tests and modern flexible interfaces for the procedures. Most of the tests are based on the OLS residuals of a linear model, which is specified by a formula argument. Instead of a formula a fitted model of class "lm" can also be supplied, which should work if the data are either contained in the object or still present in the workspace—however this is not encouraged. The full references for the tests can be found on the help pages of the respective function. We present applications of the tests contained in lmtest to two different data sets: the first is a macroeconomic time series from the U.S. analysed by Stock and Watson (1996) and the second is data from a study on measurments of fetal mandible length discussed by Royston and Altman (1994).
TL;DR: BOLT-LMM is presented, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes.
Abstract: Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN(2)) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.
TL;DR: In this article, the authors introduce the knockoff filter, a new variable selection procedure for controlling the false discovery rate (FDR) in the statistical linear model whenever there are at least as many observations as variables.
Abstract: In many fields of science, we observe a response variable together with a large number of potential explanatory variables, and would like to be able to discover which variables are truly associated with the response. At the same time, we need to know that the false discovery rate (FDR)—the expected fraction of false discoveries among all discoveries—is not too high, in order to assure the scientist that most of the discoveries are indeed true and replicable. This paper introduces the knockoff filter, a new variable selection procedure controlling the FDR in the statistical linear model whenever there are at least as many observations as variables. This method achieves exact FDR control in finite sample settings no matter the design or covariates, the number of variables in the model, or the amplitudes of the unknown regression coefficients, and does not require any knowledge of the noise level. As the name suggests, the method operates by manufacturing knockoff variables that are cheap—their construction does not require any new data—and are designed to mimic the correlation structure found within the existing variables, in a way that allows for accurate FDR control, beyond what is possible with permutation-based methods. The method of knockoffs is very general and flexible, and can work with a broad class of test statistics. We test the method in combination with statistics from the Lasso for sparse regression, and obtain empirical results showing that the resulting method has far more power than existing selection rules when the proportion of null variables is high.
TL;DR: A novel hybrid model is proposed for prediction of stocks returns which is hybrid of two linear models and a non-linear model which outperforms recurrent neural network.
Abstract: A novel hybrid model is proposed for prediction of stocks returns.The proposed model is hybrid of two linear models and a non-linear model.An optimization model is introduced which generates weights for proposed model.Proposed model is able to capture non-linear patterns of stock data very well. In this paper, we propose a robust and novel hybrid model for prediction of stock returns. The proposed model is constituted of two linear models: autoregressive moving average model, exponential smoothing model and a non-linear model: recurrent neural network. Training data for recurrent neural network is generated by a new regression model. Recurrent neural network produces satisfactory predictions as compared to linear models. With the goal to further improve the accuracy of predictions, the proposed hybrid prediction model merges predictions obtained from these three prediction based models. An optimization model is introduced which generates optimal weights for proposed model; the model is solved using genetic algorithms. The results confirm about the accuracy of the prediction performance of recurrent neural network. As expected, an outstanding prediction performance has been obtained from proposed hybrid prediction model as it outperforms recurrent neural network. The proposed model is certainly expected to be a promising approach in the field of prediction based models where data is non-linear, whose patterns are difficult to be captured by traditional models.
TL;DR: The first step (choose candidate models) is emphasized by providing an extensive library of nonlinear functions (77 equations with the associated parameter meanings) and examples of typical applications in agriculture to clarify some of the difficulties and confusion with the task of using nonlinear models.
Abstract: Nonlinear regression models are important tools because many crop and soil processes are better represented by nonlinear than linear models. Fitting nonlinear models is not a single-step procedure but an involved process that requires careful examination of each individual step. Depending on the objective and the application domain, different priorities are set when fitting nonlinear models; these include obtaining acceptable parameter estimates and a good model fit while meeting standard assumptions of statistical models. We propose steps in fitting nonlinear models as described by a flow diagram and discuss each step separately providing examples and updates on procedures used. The following steps are considered: (i) choose candidate models, (ii) set starting values, (iii) fit models, (iv) check convergence and parameter estimates, (v) find the "best" model among competing models, (vi) check model assumptions (residual analysis), and (vii) calculate statistical descriptors and confidence intervals. The associated feedback mechanisms are also addressed (i.e., model variance homogeneity). In particular, we emphasize the first step (choose candidate models) by providing an extensive library of nonlinear functions (77 equations with the associated parameter meanings) and examples of typical applications in agriculture. We hope that this contribution will clarify some of the difficulties and confusion with the task of using nonlinear models.
TL;DR: An A-to-Z exposition of the power prior and its applications to date is given, including its theoretical properties, variations in its formulation, statistical contexts for which it has been used, applications, and its advantages over other informative priors.
Abstract: The power prior has been widely used in many applications covering a large number of disciplines. The power prior is intended to be an informative prior constructed from historical data. It has been used in clinical trials, genetics, health care, psychology, environmental health, engineering, economics, and business. It has also been applied for a wide variety of models and settings, both in the experimental design and analysis contexts. In this review article, we give an A-to-Z exposition of the power prior and its applications to date. We review its theoretical properties, variations in its formulation, statistical contexts for which it has been used, applications, and its advantages over other informative priors. We review models for which it has been used, including generalized linear models, survival models, and random effects models. Statistical areas where the power prior has been used include model selection, experimental design, hierarchical modeling, and conjugate priors. Frequentist properties of power priors in posterior inference are established, and a simulation study is conducted to further examine the empirical performance of the posterior estimates with power priors. Real data analyses are given illustrating the power prior as well as the use of the power prior in the Bayesian design of clinical trials.
TL;DR: The results indicate that caution is needed when using marker-based regressions for inferences about population parameters such as the genomic heritability, and develops theory that leads to a precise definition of parameters arising in high dimensional genomic regressions.
Abstract: Whole-genome regression methods are being increasingly used for the analysis and prediction of complex traits and diseases. In human genetics, these methods are commonly used for inferences about genetic parameters, such as the amount of genetic variance among individuals or the proportion of phenotypic variance that can be explained by regression on molecular markers. This is so even though some of the assumptions commonly adopted for data analysis are at odds with important quantitative genetic concepts. In this article we develop theory that leads to a precise definition of parameters arising in high dimensional genomic regressions; we focus on the so-called genomic heritability: the proportion of variance of a trait that can be explained (in the population) by a linear regression on a set of markers. We propose a definition of this parameter that is framed within the classical quantitative genetics theory and show that the genomic heritability and the trait heritability parameters are equal only when all causal variants are typed. Further, we discuss how the genomic variance and genomic heritability, defined as quantitative genetic parameters, relate to parameters of statistical models commonly used for inferences, and indicate potential inferential problems that are assessed further using simulations. When a large proportion of the markers used in the analysis are in LE with QTL the likelihood function can be misspecified. This can induce a sizable finite-sample bias and, possibly, lack of consistency of likelihood (or Bayesian) estimates. This situation can be encountered if the individuals in the sample are distantly related and linkage disequilibrium spans over short regions. This bias does not negate the use of whole-genome regression models as predictive machines; however, our results indicate that caution is needed when using marker-based regressions for inferences about population parameters such as the genomic heritability.
TL;DR: In this article, the authors provide theoretical results regarding local optima of regularized M-estimators, allowing for nonconvexity in both loss and penalty functions under restricted strong convexity on the loss and suitable regularity conditions on the penalty.
Abstract: We provide novel theoretical results regarding local optima of regularized M-estimators, allowing for nonconvexity in both loss and penalty functions Under restricted strong convexity on the loss and suitable regularity conditions on the penalty, we prove that any stationary point of the composite objective function will lie within statistical precision of the underlying parameter vector Our theory covers many nonconvex objective functions of interest, including the corrected Lasso for errors-in-variables linear models; regression for generalized linear models with nonconvex penalties such as SCAD, MCP, and capped-l 1; and high-dimensional graphical model estimation We quantify statistical accuracy by providing bounds on the l1-, l2-, and prediction error between stationary points and the population-level optimum We also propose a simple modification of composite gradient descent that may be used to obtain a near-global optimum within statistical precision estat in log(1/estat) steps, which is the fastest possible rate of any first-order method We provide simulation studies illustrating the sharpness of our theoretical results
TL;DR: The main approaches to building cross-covariance models are reviewed, including the linear model of coregionalization, convolution methods, the multivariate Mat\'{e}rn and nonstationary and space-time extensions of these among others, and specialized constructions, including those designed for asymmetry, compact support and spherical domains, are covered.
Abstract: Continuously indexed datasets with multiple variables have become ubiquitous in the geophysical, ecological, environmental and climate sciences, and pose substantial analysis challenges to scientists and statisticians. For many years, scientists developed models that aimed at capturing the spatial behavior for an individual process; only within the last few decades has it become commonplace to model multiple processes jointly. The key difficulty is in specifying the cross-covariance function, that is, the function responsible for the relationship between distinct variables. Indeed, these cross-covariance functions must be chosen to be consistent with marginal covariance functions in such a way that the second-order structure always yields a nonnegative definite covariance matrix. We review the main approaches to building cross-covariance models, including the linear model of coregionalization, convolution methods, the multivariate Matern and nonstationary and space-time extensions of these among others. We additionally cover specialized constructions, including those designed for asymmetry, compact support and spherical domains, with a review of physics-constrained models. We illustrate select models on a bivariate regional climate model output example for temperature and pressure, along with a bivariate minimum and maximum temperature observational dataset; we compare models by likelihood value as well as via cross-validation co-kriging studies. The article closes with a discussion of unsolved problems.
TL;DR: The various variable importance metrics for the linear model, particularly emphasizing variance decomposition metrics, are reviewed, with a focus on linear parametric models.
Abstract: Regression analysis is one of the most-used statistical methods. Often part of the research question is the identification of the most important regressors or an importance ranking of the regressors. Most regression models are not specifically suited for answering the variable importance question, so that many different proposals have been made. This article reviews in detail the various variable importance metrics for the linear model, particularly emphasizing variance decomposition metrics. All linear model metrics are illustrated by an example analysis. For nonlinear parametric models, several principles from linear models have been adapted, and machine-learning methods have their own set of variable importance methods. These are also briefly covered. Although there are many variable importance metrics, there is still no convincing theoretical basis for them, and they all have a heuristic touch. Nevertheless, some metrics are considered useful for a crude assessment in the absence of a good subject matter theory. WIREs Comput Stat 2015, 7:137-152. doi: 10.1002/wics.1346
TL;DR: The authors have concluded that UTAUT2 model obtains a better explanation power than the rest of technology acceptance models (TAMs) in the sample of mobile internet users.
Abstract: Purpose – The purpose of this paper is to provide a complete and chronological view of the evolution of the main acceptance and use of technology models, from the 1970s to the present day. Design/methodology/approach – A comparison of partial least squares (linear model) and WarpPLS (non-linear model) has been run for each acceptation of technology model: TRA, TAM0, TAM1, TAM2, TAM3, UTAUT, UTAUT2. The data set collects the information of mobile internet users. Findings – The authors have concluded that UTAUT2 model obtains a better explanation power than the rest of technology acceptance models (TAMs) in the sample of mobile internet users. Furthermore, all models have a better explanation power using non-linear relationships than the traditional linear approach. Originality/value – The vast majority of research published to date with regard to the Theory of Reasoned Action (TRA), the Technology Acceptance Model (TAM), and the Unified Theory of Acceptance and Use of Technology (UTAUT) are based on struct...
TL;DR: An adaptive and nonlinear prognostic model is presented to estimate RUL using a system's history of the observed data to date and an application for estimating the RUL of lithium-ion batteries is used to demonstrate the superiority of the method.
Abstract: Remaining useful life (RUL) estimation via degradation modeling is considered as one of the most central components in prognostics and health management. Current RUL estimation studies mainly focus on linear stochastic models, and the results under nonlinear models are relatively limited in literature. Even in nonlinear degradation modeling, the estimated RUL is aimed at a population of systems of the same type or depend only on the current degradation observation. In this paper, an adaptive and nonlinear prognostic model is presented to estimate RUL using a system's history of the observed data to date. Specifically, a general nonlinear stochastic process with a time-dependent drift coefficient is first adopted to characterize the dynamics and nonlinearity of the degradation process. In order to render the RUL estimation depending on the degradation history to date, a state-space model is constructed, and Kalman filtering is applied to update one key parameter in the drifting function through treating this parameter as an unobserved state variable. To update the hidden state and other parameters in the state-space model simultaneously and recursively, the expectation maximization algorithm is used in conjunction with Kalman smoother to achieve this aim. The probability density function of the estimated RUL is derived with an explicit form, and some commonly used results under linear models turn out to be its special cases. Finally, the implementation of the presented approach is illustrated by numerical simulations, and an application for estimating the RUL of lithium-ion batteries is used to demonstrate the superiority of the method.
TL;DR: This article develops the IV approach for regression analysis in a survival context, primarily under an additive hazards model, for which it is established that analogous strategies can also be used under a proportional hazards model specification, provided the outcome is rare over the entire follow-up.
Abstract: Bias due to unobserved confounding can seldom be ruled out with certainty when estimating the causal effect of a nonrandomized treatment. The instrumental variable (IV) design offers, under certain assumptions, the opportunity to tame confounding bias, without directly observing all confounders. The IV approach is very well developed in the context of linear regression and also for certain generalized linear models with a nonlinear link function. However, IV methods are not as well developed for regression analysis with a censored survival outcome. In this article, we develop the IV approach for regression analysis in a survival context, primarily under an additive hazards model, for which we describe 2 simple methods for estimating causal effects. The first method is a straightforward 2-stage regression approach analogous to 2-stage least squares commonly used for IV analysis in linear regression. In this approach, the fitted value from a first-stage regression of the exposure on the IV is entered in place of the exposure in the second-stage hazard model to recover a valid estimate of the treatment effect of interest. The second method is a so-called control function approach, which entails adding to the additive hazards outcome model, the residual from a first-stage regression of the exposure on the IV. Formal conditions are given justifying each strategy, and the methods are illustrated in a novel application to a Mendelian randomization study to evaluate the effect of diabetes on mortality using data from the Health and Retirement Study. We also establish that analogous strategies can also be used under a proportional hazards model specification, provided the outcome is rare over the entire follow-up.
TL;DR: In this article, the authors compared generalized linear models (GLMs) and generalized linear mixed models (LMs) with the traditional approach of applying least-squares linear models to transformed data and found that the GLMs provided slightly better statistical power than LMs.
Abstract: Summary
The rise in the use of statistical models for non-Gaussian data, such as generalized linear models (GLMs) and generalized linear mixed models (GLMMs), is pushing aside the traditional approach of transforming data and applying least-squares linear models (LMs). Nonetheless, many least-squares statistical tests depend on the variance of the sum of residuals, which by the Central Limit Theorem converge to a Gaussian distribution for large sample sizes. Therefore, least-squares LMs will likely have good performance in assessing the statistical significance of regression coefficients.
Using simulations of count data, I compared GLM approaches for testing whether regression coefficients differ from zero with the traditional approach of applying LMs to transformed data. Simulations assumed that variation among sample populations was either (i) negative binomial or (ii) log-normal Poisson (i.e. log-normal variation among populations that were then sampled by a Poisson distribution). I used the simulated data to conduct tests of the hypotheses that regression coefficients differed from zero; I did not investigate statistical properties of the coefficient estimators, such as bias and precision.
For negative binomial simulations whose assumptions closely matched the GLMs, the GLMs were nonetheless prone to type I errors (false positives) especially when there was more than one predictor (independent) variable. After correcting for type I errors, however, the GLMs provided slightly better statistical power than LMs. For log-normal-Poisson simulations, both a GLMM and the LMs performed well, but under some simulated conditions the GLMs had high type I error rates, a deadly sin for statistical tests.
These results show that, while GLMs have slight advantages in power when they are properly specified, they can lead to badly wrong conclusions about the significance of regression coefficients if they are mis-specified. In contrast, transforming data and applying least-squares linear analyses provide robust statistical tests for significance over a wide range of conditions. Thus, the traditional approach of transforming data and applying LMs is still useful.
TL;DR: An overview of many regression algorithms is provided, and it is demonstrated how the function representation whose parameters they regress fall into two classes: a weighted sum of basis functions, or a mixture of linear models.
TL;DR: A review of the main approaches to cross-covariance models can be found in this article, including the linear model of coregionalization, convolution methods, the multivariate Matern and nonstationary and space-time extensions of these among others.
Abstract: Continuously indexed datasets with multiple variables have become ubiquitous in the geophysical, ecological, environmental and climate sciences, and pose substantial analysis challenges to scientists and statisticians. For many years, scientists developed models that aimed at capturing the spatial behavior for an individual process; only within the last few decades has it become commonplace to model multiple processes jointly. The key difficulty is in specifying the cross-covariance function, that is, the function responsible for the relationship between distinct variables. Indeed, these cross-covariance functions must be chosen to be consistent with marginal covariance functions in such a way that the second-order structure always yields a nonnegative definite covariance matrix. We review the main approaches to building cross-covariance models, including the linear model of coregionalization, convolution methods, the multivariate Matern and nonstationary and space–time extensions of these among others. We additionally cover specialized constructions, including those designed for asymmetry, compact support and spherical domains, with a review of physics-constrained models. We illustrate select models on a bivariate regional climate model output example for temperature and pressure, along with a bivariate minimum and maximum temperature observational dataset; we compare models by likelihood value as well as via cross-validation co-kriging studies. The article closes with a discussion of unsolved problems.
TL;DR: In this paper, a posterior variable selection summary is proposed, which distills a full posterior distribution over regression coefficients into a sequence of sparse linear predictors, and then selects a subset of variables for linear models.
Abstract: Selecting a subset of variables for linear models remains an active area of research. This article reviews many of the recent contributions to the Bayesian model selection and shrinkage prior literature. A posterior variable selection summary is proposed, which distills a full posterior distribution over regression coefficients into a sequence of sparse linear predictors.
TL;DR: In this paper, the geometric median of a collection of independent "weakly concentrated" estimators satisfies a much stronger deviation bound than each individual element in the collection, which is illustrated through several examples, including sparse linear regression and low-rank matrix recovery problems.
Abstract: In many real-world applications, collected data are contaminated by noise with heavy-tailed distribution and might contain outliers of large magnitude. In this situation, it is necessary to apply methods which produce reliable outcomes even if the input contains corrupted measurements. We describe a general method which allows one to obtain estimators with tight concentration around the true parameter of interest taking values in a Banach space. Suggested construction relies on the fact that the geometric median of a collection of independent “weakly concentrated” estimators satisfies a much stronger deviation bound than each individual element in the collection. Our approach is illustrated through several examples, including sparse linear regression and low-rank matrix recovery problems.
TL;DR: The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures.
Abstract: We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.
TL;DR: The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of both the linear and nonlinear models.
TL;DR: In the application, effects of age and BMI on the time‐specific change in probability of being active over a 24‐hour period are identified; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.
Abstract: This manuscript considers regression models for generalized, multilevel functional responses: functions are generalized in that they follow an exponential family distribution and multilevel in that they are clustered within groups or subjects. This data structure is increasingly common across scientific domains and is exemplified by our motivating example, in which binary curves indicating physical activity or inactivity are observed for nearly 600 subjects over 5 days. We use a generalized linear model to incorporate scalar covariates into the mean structure, and decompose subject-specific and subject-day-specific deviations using multilevel functional principal components analysis. Thus, functional fixed effects are estimated while accounting for within-function and within-subject correlations, and major directions of variability within and between subjects are identified. Fixed effect coefficient functions and principal component basis functions are estimated using penalized splines; model parameters are estimated in a Bayesian framework using Stan, a programming language that implements a Hamiltonian Monte Carlo sampler. Simulations designed to mimic the application have good estimation and inferential properties with reasonable computation times for moderate datasets, in both cross-sectional and multilevel scenarios; code is publicly available. In the application we identify effects of age and BMI on the time-specific change in probability of being active over a 24-hour period; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.
TL;DR: The EViews work file as discussed by the authors is a large collection of work on statistical analysis of growth models with time-related effects and their application in the context of economic forecasting. But it is not a complete set of models.
Abstract: Preface. 1 EViews workfile and descriptive data analysis. 1.1 What is the EViews workfile? 1.2 Basic options in EViews. 1.3 Creating a workfile. 1.4 Illustrative data analysis. 1.5 Special notes and comments. 1.6 Statistics as a sample space. 2 Continuous growth models. 2.1 Introduction. 2.2 Classical growth models. 2.3 Autoregressive growth models. 2.4. Residual tests. 2.5 Bounded autoregressive growth models. 2.6 Lagged variables or autoregressive growth models. 2.7 Polynomial growth model. 2.8 Growth models with exogenous variables. 2.9 A Taylor series approximation model. 2.10 Alternative univariate growth models. 2.11 Multivariate growth models. 2.12 Multivariate AR(p) GLM with trend. 2.13 Generalized multivariate models with trend. 2.14 Special notes and comments. 2.15 Alternative multivariate models with trend. 2.16 Generalized multivariate models with time-related effects. 3 Discontinuous growth models. 3.1 Introduction. 3.2 Piecewise growth models. 3.3 Piecewise S-shape growth models. 3.4 Two-piece polynomial bounded growth models. 3.5 Discontinuous translog linear AR(1) growth models. 3.6 Alternative discontinuous growth models. 3.7 Stability test. 3.8 Generalized discontinuous models with trend. 3.9 General two-piece models with time-related effects. 3.10 Multivariate models by states and time periods. 4 Seemingly causal models. 4.1 Introduction. 4.2 Statistical analysis based on a single time series. 4.3 Bivariate seemingly causal models. 4.4 Trivariate seemingly causal models. 4.5 System equations based on trivariate time series. 4.6 General system of equations. 4.7 Seemingly causal models with dummy variables. 4.8 General discontinuous seemingly causal models. 4.9 Additional selected seemingly causal models. 4.10 Final notes in developing models. 5 Special cases of regression models. 5.1 Introduction. 5.2 Specific cases of growth curve models. 5.3 Seemingly causal models. 5.4 Lagged variable models. 5.5 Cases based on the US domestic price of copper. 5.6 Return rate models. 5.7 Cases based on the BASICS workfile. 6 VAR and system estimation methods. 6.1 Introduction. 6.2 The VAR models. 6.3 The vector error correction models. 6.4 Special notes and comments. 7 Instrumental variables models. 7.1 Introduction. 7.2 Should we apply instrumental models? 7.3 Residual analysis in developing instrumental models. 7.4 System equation with instrumental variables. 7.5 Selected cases based on the US-DPOC data. 7.6 Instrumental models with time-related effects. 7.7 Instrumental seemingly causal models. 7.8 Multivariate instrumental models based on the US-DPOC. 7.9 Further extension of the instrumental models. 8 ARCH models. 8.1 Introduction. 8.2 Options of ARCH models. 8.3 Simple ARCH models. 8.4 ARCH models with exogenous variables. 8.5 Alternative GARCH variance series. 9 Additional testing hypotheses. 9.1 Introduction. 9.2 The unit root tests. 9.3 The omitted variables tests. 9.4 Redundant variables test (RV-test). 9.5 Nonnested test (NN-test). 9.6 The Ramsey RESET test. 9.7 Illustrative examples based on the Demo.wf1. 10 Nonlinear least squares models. 10.1 Introduction. 10.2 Classical growth models. 10.3 Generalized Cobb-Douglas models. 10.4 Generalized CES models. 10.5 Special notes and comments. 10.6 Other NLS models. 11 Nonparametric estimation methods. 11.1 What is the nonparametric data analysis. 11.2 Basic moving average estimates. 11.3 Measuring the best fit model. 11.4 Advanced moving average models. 11.5 Nonparametric regression based on a time series. 11.6 The local polynomial Kernel fit regression. 11.7 Nonparametric growth models. Appendix A: Models for a single time series. A.1 The simplest model. A.2 First-order autoregressive models. A.3 Second-order autoregressive model. A.4 First-order moving average model. A.5 Second-order moving average model. A.6 The simplest ARMA model. A.7 General ARMA model. Appendix B: Simple linear models. B.1 The simplest linear model. B.2 Linear model with basic assumptions. B.3 Maximum likelihood estimation method. B.4 First-order autoregressive linear model. B.5 AR(p) linear model. B.6 Alternative models. B.7 Lagged-variable model. B.8 Lagged-variable autoregressive models. B.9 Special notes and comments. Appendix C: General linear models. C.1 General linear model with i.i.d. Gaussian disturbances. C.2 AR(1) general linear model. C.3 AR(p) general linear model. C.4 General lagged-variable autoregressive model. C.5 General models with Gaussian errors. Appendix D: Multivariate general linear models. D.1 Multivariate general linear models. D.2 Moments of an endogenous multivariate. D.3 Vector autoregressive model. D.4 Vector moving average model. D.5 Vector autoregressive moving average model. D.6 Simple multivariate models with exogenous variables. D.7 General estimation methods. D.8 Maximum likelihood estimation for an MGLM. D.9 MGLM with autoregressive errors. References. Index.
TL;DR: The authors developed several models to examine possible predictors of the return of gold, which embrace six global factors (business cycle, nominal, interest rate, commodity, exchange rate and stock price) extracted from a recursive principal component analysis (PCA) and two uncertainty and stress indices (the Kansas City Fed's financial stress index and the U.S. economic policy uncertainty index).
TL;DR: Hyperfit as discussed by the authors is a package for online fitting of D-dimensional data points in R statistical language (github.com/asgr/hyperfit) with a user-friendly web interface.
Abstract: Astronomical data is often uncertain with errors that are heteroscedastic (different for each data point) and covariant between different dimensions. Assuming that a set of D-dimensional data points can be described by a (D − 1)-dimensional plane with intrinsic scatter, we derive the general likelihood function to be maximised to recover the best fitting model. Alongside the mathematical description, we also release the hyper-fit package for the R statistical language (github.com/asgr/hyper.fit) and a user-friendly web interface for online fitting (hyperfit.icrar.org). The hyper-fit package offers access to a large number of fitting routines, includes visualisation tools, and is fully documented in an extensive user manual. Most of the hyper-fit functionality is accessible via the web interface. In this paper, we include applications to toy examples and to real astronomical data from the literature: the mass-size, Tully–Fisher, Fundamental Plane, and mass-spin-morphology relations. In most cases, the hyper-fit solutions are in good agreement with published values, but uncover more information regarding the fitted model.
TL;DR: In this paper, the authors provide an introduction to the statistical analysis of neuroimaging data using the general linear model, which allows a flexible use of various models offering a wide range of statistical tests for the analysis of typical neuro-imaging experiments.
Abstract: This article provides an introduction into the statistical analysis of neuroimaging data using the general linear model. The analysis allows a flexible use of various models offering a wide range of statistical tests for the analysis of typical neuroimaging experiments. A short introduction to the general linear model is provided using simple examples. We will focus on matrix formulations of the general linear model to clarify the nature of parameter estimation and ensuing statistical inference, using the extra sum-of-squares principle. The article concludes with a discussion of widely used statistical tests within the context of the general linear model.
TL;DR: “The statistician knows that in nature there never was a normal distribution, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.
Abstract: “The statistician knows...that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.”1
TL;DR: An adaptive method based on the S-system formalism, which is a sensible representation of nonlinear mass-action kinetics typically found in cellular dynamics, maintains the efficiency of linear regression and is combined with adaptive model selection to obtain efficient and parsimonious representations of cellular dynamics.
Abstract: The nonlinearity of dynamics in systems biology makes it hard to infer them from experimental data. Simple linear models are computationally efficient, but cannot incorporate these important nonlinearities. An adaptive method based on the S-system formalism, which is a sensible representation of nonlinear mass-action kinetics typically found in cellular dynamics, maintains the efficiency of linear regression. We combine this approach with adaptive model selection to obtain efficient and parsimonious representations of cellular dynamics. The approach is tested by inferring the dynamics of yeast glycolysis from simulated data. With little computing time, it produces dynamical models with high predictive power and with structural complexity adapted to the difficulty of the inference problem.
TL;DR: This work establishes a form of local statistical consistency for the penalized regression estimators under fairly mild conditions on the error distribution, and analysis of the local curvature of the loss function has useful consequences for optimization when the robust regression function and/or regularizer is nonconvex and the objective function possesses stationary points outside the local region.
Abstract: We study theoretical properties of regularized robust M-estimators, applicable when data are drawn from a sparse high-dimensional linear model and contaminated by heavy-tailed distributions and/or outliers in the additive errors and covariates. We first establish a form of local statistical consistency for the penalized regression estimators under fairly mild conditions on the error distribution: When the derivative of the loss function is bounded and satisfies a local restricted curvature condition, all stationary points within a constant radius of the true regression vector converge at the minimax rate enjoyed by the Lasso with sub-Gaussian errors. When an appropriate nonconvex regularizer is used in place of an l_1-penalty, we show that such stationary points are in fact unique and equal to the local oracle solution with the correct support---hence, results on asymptotic normality in the low-dimensional case carry over immediately to the high-dimensional setting. This has important implications for the efficiency of regularized nonconvex M-estimators when the errors are heavy-tailed. Our analysis of the local curvature of the loss function also has useful consequences for optimization when the robust regression function and/or regularizer is nonconvex and the objective function possesses stationary points outside the local region. We show that as long as a composite gradient descent algorithm is initialized within a constant radius of the true regression vector, successive iterates will converge at a linear rate to a stationary point within the local region. Furthermore, the global optimum of a convex regularized robust regression function may be used to obtain a suitable initialization. The result is a novel two-step procedure that uses a convex M-estimator to achieve consistency and a nonconvex M-estimator to increase efficiency.
TL;DR: This study illustrates that the linear mixed model is the preferred method to investigate risk factors associated with renal function trajectories in studies, where patients may dropout during the study period because of initiation of renal replacement therapy.
Abstract: Background. The most commonly used methods to investigate risk factors associated with renal function trajectory over time include linear regression on individual glomerular filtration rate (GFR) slopes, linear mixed models and generalized estimating equations (GEEs). The objective of this study was to explain the principles of these three methods and to discuss their advantages and limitations in particular when renal function trajectories are not completely observable due to dropout. Methods. We generated data from a hypothetical cohort of 200 patients with chronic kidney disease at inclusion and seven subsequent annual measurements of GFR. The data were generated such that both baseline level and slope of GFR over time were associated with baseline albuminuria status. In a second version of the dataset, we assumed that patients systematically dropped out after a GFR measurement of <15 mL/min/1.73 m 2 .E ach dataset was analysed with the three methods. Results. The estimated effects of baseline albuminuria status on GFR slope were similar among the three methods when no patient dropped out. When 32.7% dropped out, standard GEE provided biased estimates of the mean GFR slope in normo-, micro- and macroalbuminuric patients. Linear regression on individual slopes and linear mixed models provided slope estimates of the same magnitude, likely because most patients had at least three GFR measurements. However, the linear mixed model was the only method to provide effect estimates on both slope and baseline level of GFR unaffected by dropout. Conclusion. This study illustrates that the linear mixed model is the preferred method to investigate risk factors associated with renal function trajectories in studies, where patients may dropout during the study period because of initiation of renal replacement therapy.