Top 765 papers published in the topic of Linear model in 2016

Showing papers on "Linear model published in 2016"

A heteroskedasticity-consistent covariance matrix estimator and a direct test

[...]

1 Jan 2016

TL;DR: In this paper, a covariance matrix estimator which is consistent even when the disturbances of a linear regression model are heteroskedastic is presented, but does not rely on a (possibly incorrect) specific formal model of the structure of the heter-kedasticity.

...read moreread less

Abstract: This paper presents a parameter covariance matrix estimator which is consistent even when the disturbances of a linear regression model are heteroskedastic. This estimator does not depend on a formal model of the structure of the heteroskedasticity. By comparing the elements of the new estimator to those of the usual covariance estimator, one obtains a direct test for heteroskedasticity, since in the absence of heteroskedasticity, the two estimators will be approximately equal, but will generally diverge otherwise. The test has an appealing least squares interpretation. IT IS WELL KNOWN that the presence of heteroskedasticity in the disturbances of an otherwise properly specified linear model leads to consistent but inefficient parameter estimates and inconsistent covariance matrix estimates. As a result, faulty inferences will be drawn when testing statistical hypotheses in the presence of heteroskedasticity. If the investigator has a formal model of the process generating the differing variances, these difficulties are easily eliminated by performing an appropriate linear transformation on the data, based on this model. However, even when such a model is available, it may be incorrect. Often, several models are considered (e.g., Griliches [10]), but still without the certain knowledge that any of them is correct. In this situation one can test each of the alternative transformed models for remaining heteroskedasticity (using any of several available tests), and eliminate those which fail. But what is one to do if all fail the heteroskedasticity test? Although the investigator will have a fairly good idea of the parameter values of the linear model, there remains a considerable difficulty in assessing the precision of the parameter estimates and testing hypotheses due to the possible inconsistency of the usual covariance matrix estimator. In this paper I resolve this difficulty by presenting a covariance matrix estimator which is consistent in the presence of heteroskedasticity, but does not rely on a (possibly incorrect) specific formal model of the structure of the heteroskedasticity. Thus, even when heteroskedasticity cannot be completely eliminated, proper inferences can be drawn. Under appropriate conditions, a natural test for heteroskedasticity can be obtained by comparing the consistent estimator to the usual covariance matrix estimator; in the absence of heteroskedasticity, both estimators will be about the same-otherwise, they will generally diverge. The test shares the advantage of the covariance estimator, in that no formal structure on

...read moreread less

14,921 citations

Journal Article•10.18637/JSS.V069.I01•

Least-Squares Means: The R Package lsmeans

[...]

Russell V. Lenth

29 Jan 2016-Journal of Statistical Software

TL;DR: The lsmeans package (Lenth 2016) provides a simple way of obtaining least-squares means and contrasts thereof and supports many models fitted by R (R Core Team 2015) core packages that fit linear or mixed models.

...read moreread less

Abstract: Least-squares means are predictions from a linear model, or averages thereof. They are useful in the analysis of experimental data for summarizing the effects of factors, and for testing linear contrasts among predictions. The lsmeans package (Lenth 2016) provides a simple way of obtaining least-squares means and contrasts thereof. It supports many models fitted by R (R Core Team 2015) core packages (as well as a few key contributed ones) that fit linear or mixed models, and provides a simple way of extending it to cover more model classes.

...read moreread less

5,598 citations

Journal Article•10.1016/J.OMEGA.2015.12.001•

Best-worst multi-criteria decision-making method: Some properties and a linear model

[...]

Jafar Rezaei¹•Institutions (1)

Delft University of Technology¹

01 Oct 2016-Omega-international Journal of Management Science

TL;DR: In this paper, the authors propose a non-linear minmax model to identify the weights such that the maximum absolute difference between the weight ratios and their corresponding comparisons is minimized, which may result in multiple optimal solutions.

...read moreread less

Abstract: The Best Worst Method (BWM) is a multi-criteria decision-making method that uses two vectors of pairwise comparisons to determine the weights of criteria. First, the best (e.g. most desirable, most important), and the worst (e.g. least desirable, least important) criteria are identified by the decision-maker, after which the best criterion is compared to the other criteria, and the other criteria to the worst criterion. A non-linear minmax model is then used to identify the weights such that the maximum absolute difference between the weight ratios and their corresponding comparisons is minimized. The minmax model may result in multiple optimal solutions. Although, in some cases, decision-makers prefer to have multiple optimal solutions, in other cases they prefer to have a unique solution. The aim of this paper is twofold: firstly, we propose using interval analysis for the case of multiple optimal solutions, in which we show how the criteria can be weighed and ranked. Secondly, we propose a linear model for BWM, which is based on the same philosophy, but yields a unique solution.

...read moreread less

1,438 citations

Book Chapter•10.1007/978-3-319-30671-1_4•

Deep Learning over Multi-field Categorical Data

[...]

Weinan Zhang¹, Tianming Du¹, Jun Wang¹•Institutions (1)

University College London¹

20 Mar 2016

TL;DR: In this article, two novel models using deep neural networks (DNNs) were proposed to automatically learn effective patterns from categorical feature interactions and make predictions of users' ad clicks.

...read moreread less

Abstract: Predicting user responses, such as click-through rate and conversion rate, are critical in many web applications including web search, personalised recommendation, and online advertising. Different from continuous raw features that we usually found in the image and audio domains, the input features in web space are always of multi-field and are mostly discrete and categorical while their dependencies are little known. Major user response prediction models have to either limit themselves to linear models or require manually building up high-order combination features. The former loses the ability of exploring feature interactions, while the latter results in a heavy computation in the large feature space. To tackle the issue, we propose two novel models using deep neural networks (DNNs) to automatically learn effective patterns from categorical feature interactions and make predictions of users’ ad clicks. To get our DNNs efficiently work, we propose to leverage three feature transformation methods, i.e., factorisation machines (FMs), restricted Boltzmann machines (RBMs) and denoising auto-encoders (DAEs). This paper presents the structure of our models and their efficient training algorithms. The large-scale experiments with real-world data demonstrate that our methods work better than major state-of-the-art models.

...read moreread less

536 citations

Journal Article•10.1038/S41467-017-00030-8•

Chaos as an Intermittently Forced Linear System

[...]

Steven L. Brunton¹, Bingni W. Brunton¹, Joshua L. Proctor, Eurika Kaiser¹, J. Nathan Kutz¹ - Show less +1 more•Institutions (1)

University of Washington¹

18 Aug 2016-arXiv: Dynamical Systems

TL;DR: In this paper, a universal, data-driven decomposition of chaos as an intermittently forced linear system is presented, which combines Takens' delay embedding with modern Koopman operator theory and sparse regression to obtain linear representations of strongly nonlinear dynamics.

...read moreread less

Abstract: Understanding the interplay of order and disorder in chaotic systems is a central challenge in modern quantitative science. We present a universal, data-driven decomposition of chaos as an intermittently forced linear system. This work combines Takens' delay embedding with modern Koopman operator theory and sparse regression to obtain linear representations of strongly nonlinear dynamics. The result is a decomposition of chaotic dynamics into a linear model in the leading delay coordinates with forcing by low energy delay coordinates; we call this the Hankel alternative view of Koopman (HAVOK) analysis. This analysis is applied to the canonical Lorenz system, as well as to real-world examples such as the Earth's magnetic field reversal, and data from electrocardiogram, electroencephalogram, and measles outbreaks. In each case, the forcing statistics are non-Gaussian, with long tails corresponding to rare events that trigger intermittent switching and bursting phenomena; this forcing is highly predictive, providing a clear signature that precedes these events. Moreover, the activity of the forcing signal demarcates large coherent regions of phase space where the dynamics are approximately linear from those that are strongly nonlinear.

...read moreread less

414 citations

Journal Article•10.1038/SREP19444•

Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology

[...]

Shibo Wang¹, Jian-Ying Feng¹, Wen-Long Ren¹, Bo Huang¹, Ling Zhou¹, Yang-Jun Wen¹, Jin Zhang¹, Jim M. Dunwell², Shizhong Xu³, Yuan-Ming Zhang⁴, Yuan-Ming Zhang¹ - Show less +7 more•Institutions (4)

Nanjing Agricultural University¹, University of Reading², University of California, Riverside³, Huazhong Agricultural University⁴

20 Jan 2016-Scientific Reports

TL;DR: The MRMLM is a multi-locus model including markers selected from the RMLM method with a less stringent selection criterion and is more powerful and accurate than the EMMA in QTN detection and QTN effect estimation.

...read moreread less

Abstract: Genome-wide association studies (GWAS) have been widely used in genetic dissection of complex traits. However, common methods are all based on a fixed-SNP-effect mixed linear model (MLM) and single marker analysis, such as efficient mixed model analysis (EMMA). These methods require Bonferroni correction for multiple tests, which often is too conservative when the number of markers is extremely large. To address this concern, we proposed a random-SNP-effect MLM (RMLM) and a multi-locus RMLM (MRMLM) for GWAS. The RMLM simply treats the SNP-effect as random, but it allows a modified Bonferroni correction to be used to calculate the threshold p value for significance tests. The MRMLM is a multi-locus model including markers selected from the RMLM method with a less stringent selection criterion. Due to the multi-locus nature, no multiple test correction is needed. Simulation studies show that the MRMLM is more powerful in QTN detection and more accurate in QTN effect estimation than the RMLM, which in turn is more powerful and accurate than the EMMA. To demonstrate the new methods, we analyzed six flowering time related traits in Arabidopsis thaliana and detected more genes than previous reported using the EMMA. Therefore, the MRMLM provides an alternative for multi-locus GWAS.

...read moreread less

410 citations

Journal Article•10.1016/J.EPSR.2015.09.001•

Pattern-based local linear regression models for short-term load forecasting

[...]

Grzegorz Dudek¹•Institutions (1)

Częstochowa University of Technology¹

01 Jan 2016-Electric Power Systems Research

TL;DR: In this article, the authors proposed univariate models for short-term load forecasting based on linear regression and patterns of daily cycles of load time series, where the patterns used as input and output variables simplify the forecasting problem by filtering out the trend and seasonal variations of periods longer than the daily one.

...read moreread less

313 citations

Journal Article•10.1016/J.NEUROIMAGE.2016.05.068•

Faster permutation inference in brain imaging

[...]

Anderson M. Winkler¹, Gerard R. Ridgway², Gwenaëlle Douaud¹, Thomas E. Nichols³, Stephen M. Smith¹ - Show less +1 more•Institutions (3)

University of Oxford¹, Wellcome Trust Centre for Neuroimaging², University of Warwick³

01 Nov 2016-NeuroImage

TL;DR: Property of statistics used with the general linear model (GLM) and their distributions are exploited to obtain accelerations irrespective of generic software or hardware improvements and method (iv) was found the best as long as symmetric errors can be assumed.

...read moreread less

305 citations

Journal Article•10.1016/J.JCP.2016.08.015•

A novel evolutionary algorithm applied to algebraic modifications of the RANS stress-strain relationship

[...]

Jack Weatheritt¹, Richard D. Sandberg¹•Institutions (1)

University of Melbourne¹

15 Nov 2016-Journal of Computational Physics

TL;DR: A novel expansion of Gene Expression Programming for the purpose of tensor modeling is described, to give freedom to the algorithm to produce a constraint-free model; its own functional form that was not previously imposed.

...read moreread less

299 citations

Journal Article•10.3390/EN9020109•

Wind Speed Prediction Using a Univariate ARIMA Model and a Multivariate NARX Model

[...]

Erasmo Cadenas, Wilfrido Rivera, Rafael Campos-Amezcua, C.L. Heard

17 Feb 2016-Energies

TL;DR: In this article, two on-step ahead wind speed forecasting models were compared, one using a linear autoregressive integrated moving average (ARIMA) and the other using a nonlinear auto-regressive exogenous artificial neural network (NARX).

...read moreread less

Abstract: Two on step ahead wind speed forecasting models were compared. A univariate model was developed using a linear autoregressive integrated moving average (ARIMA). This method’s performance is well studied for a large number of prediction problems. The other is a multivariate model developed using a nonlinear autoregressive exogenous artificial neural network (NARX). This uses the variables: barometric pressure, air temperature, wind direction and solar radiation or relative humidity, as well as delayed wind speed. Both models were developed from two databases from two sites: an hourly average measurements database from La Mata, Oaxaca, Mexico, and a ten minute average measurements database from Metepec, Hidalgo, Mexico. The main objective was to compare the impact of the various meteorological variables on the performance of the multivariate model of wind speed prediction with respect to the high performance univariate linear model. The NARX model gave better results with improvements on the ARIMA model of between 5.5% and 10. 6% for the hourly database and of between 2.3% and 12.8% for the ten minute database for mean absolute error and mean squared error, respectively.

...read moreread less

289 citations

Posted Content•

Modeling Missing Data in Clinical Time Series with RNNs

[...]

Zachary C. Lipton, David C. Kale, Randall C. Wetzel

13 Jun 2016-arXiv: Learning

TL;DR: A simple strategy to cope with missing data in sequential inputs is demonstrated, addressing the task of multilabel classification of diagnoses given clinical time series, and it is shown that for some diseases, what tests are run can be as predictive as the results themselves.

...read moreread less

Abstract: We demonstrate a simple strategy to cope with missing data in sequential inputs, addressing the task of multilabel classification of diagnoses given clinical time series. Collected from the pediatric intensive care unit (PICU) at Children's Hospital Los Angeles, our data consists of multivariate time series of observations. The measurements are irregularly spaced, leading to missingness patterns in temporally discretized sequences. While these artifacts are typically handled by imputation, we achieve superior predictive performance by treating the artifacts as features. Unlike linear models, recurrent neural networks can realize this improvement using only simple binary indicators of missingness. For linear models, we show an alternative strategy to capture this signal. Training models on missingness patterns only, we show that for some diseases, what tests are run can be as predictive as the results themselves.

...read moreread less

Journal Article•10.1117/1.NPH.3.1.010401•

Commentary on the statistical properties of noise and its implication on general linear models in functional near-infrared spectroscopy

[...]

Theodore J. Huppert¹•Institutions (1)

University of Pittsburgh¹

01 Jan 2016-Neurophotonics

TL;DR: The application of linear regression models in fNIRS and the modifications needed to generalize these models in order to deal with structured (colored) noise due to systemic physiology and noise heteroscedasticity due to motion artifacts are discussed.

...read moreread less

Abstract: Functional near-infrared spectroscopy (fNIRS) is a noninvasive neuroimaging technique that uses low levels of light to measure changes in cerebral blood oxygenation levels. In the majority of NIRS functional brain studies, analysis of this data is based on a statistical comparison of hemodynamic levels between a baseline and task or between multiple task conditions by means of a linear regression model: the so-called general linear model. Although these methods are similar to their implementation in other fields, particularly for functional magnetic resonance imaging, the specific application of these methods in fNIRS research differs in several key ways related to the sources of noise and artifacts unique to fNIRS. In this brief communication, we discuss the application of linear regression models in fNIRS and the modifications needed to generalize these models in order to deal with structured (colored) noise due to systemic physiology and noise heteroscedasticity due to motion artifacts. The objective of this work is to present an overview of these noise properties in the context of the linear model as it applies to fNIRS data. This work is aimed at explaining these mathematical issues to the general fNIRS experimental researcher but is not intended to be a complete mathematical treatment of these concepts.

...read moreread less

Journal Article•10.3982/ECTA12560•

Identifying Latent Structures in Panel Data

[...]

Liangjun Su, Zhentao Shi¹, Peter C.B. Phillips•Institutions (1)

The Chinese University of Hong Kong¹

01 Nov 2016-Econometrica

TL;DR: In this article, a new variant of Lasso called classifier-Lasso is proposed to shrink individual coefficients to the unknown group-specific coefficients, which achieves simultaneous classification and consistent estimation in a single step.

...read moreread less

Abstract: This paper provides a novel mechanism for identifying and estimating latent group structures in panel data using penalized techniques. We consider both linear and nonlinear models where the regression coefficients are heterogeneous across groups but homogeneous within a group and the group membership is unknown. Two approaches are considered—penalized profile likelihood (PPL) estimation for the general nonlinear models without endogenous regressors, and penalized GMM (PGMM) estimation for linear models with endogeneity. In both cases, we develop a new variant of Lasso called classifier-Lasso (C-Lasso) that serves to shrink individual coefficients to the unknown group-specific coefficients. C-Lasso achieves simultaneous classification and consistent estimation in a single step and the classification exhibits the desirable property of uniform consistency. For PPL estimation, C-Lasso also achieves the oracle property so that group-specific parameter estimators are asymptotically equivalent to infeasible estimators that use individual group identity information. For PGMM estimation, the oracle property of C-Lasso is preserved in some special cases. Simulations demonstrate good finite-sample performance of the approach in both classification and estimation. Empirical applications to both linear and nonlinear models are presented.

...read moreread less

Proceedings Article•

Online ARIMA algorithms for time series prediction

[...]

Chenghao Liu¹, Steven C. H. Hoi¹, Peilin Zhao², Jianling Sun³•Institutions (3)

Singapore Management University¹, Institute for Infocomm Research Singapore², Zhejiang University³

12 Feb 2016

TL;DR: This paper proposes online learning algorithms for estimating ARIMA models under relaxed assumptions on the noise terms, which is suitable to a wider range of applications and enjoys high computational efficiency.

...read moreread less

Abstract: Autoregressive integrated moving average (ARIMA) is one of the most popular linear models for time series forecasting due to its nice statistical properties and great flexibility. However, its parameters are estimated in a batch manner and its noise terms are often assumed to be strictly bounded, which restricts its applications and makes it inefficient for handling large-scale real data. In this paper, we propose online learning algorithms for estimating ARIMA models under relaxed assumptions on the noise terms, which is suitable to a wider range of applications and enjoys high computational efficiency. The idea of our ARIMA method is to reformulate the ARIMA model into a task of full information online optimization (without random noise terms). As a consequence, we can online estimation of the parameters in an efficient and scalable way. Furthermore, we analyze regret bounds of the proposed algorithms, which guarantee that our online ARIMA model is provably as good as the best ARIMA model in hindsight. Finally, our encouraging experimental results further validate the effectiveness and robustness of our method.

...read moreread less

Journal Article•10.1089/CMB.2015.0189•

Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters

[...]

Yifeng Li¹, Yifeng Li², Chih-Yu Chen¹, Wyeth W. Wasserman¹•Institutions (2)

University of British Columbia¹, National Research Council²

09 May 2016-Journal of Computational Biology

TL;DR: A deep feature selection (DFS) model is proposed that takes advantages of deep structures to model nonlinearity and conveniently selects a subset of features right at the input level for multiclass data.

...read moreread less

Abstract: Sparse linear models approximate target variable(s) by a sparse linear combination of input variables. Since they are simple, fast, and able to select features, they are widely used in classification and regression. Essentially they are shallow feed-forward neural networks that have three limitations: (1) incompatibility to model nonlinearity of features, (2) inability to learn high-level features, and (3) unnatural extensions to select features in a multiclass case. Deep neural networks are models structured by multiple hidden layers with nonlinear activation functions. Compared with linear models, they have two distinctive strengths: the capability to (1) model complex systems with nonlinear structures and (2) learn high-level representation of features. Deep learning has been applied in many large and complex systems where deep models significantly outperform shallow ones. However, feature selection at the input level, which is very helpful to understand the nature of a complex system, is still not well studied. In genome research, the cis-regulatory elements in noncoding DNA sequences play a key role in the expression of genes. Since the activity of regulatory elements involves highly interactive factors, a deep tool is strongly needed to discover informative features. In order to address the above limitations of shallow and deep models for selecting features of a complex system, we propose a deep feature selection (DFS) model that (1) takes advantages of deep structures to model nonlinearity and (2) conveniently selects a subset of features right at the input level for multiclass data. Simulation experiments convince us that this model is able to correctly identify both linear and nonlinear features. We applied this model to the identification of active enhancers and promoters by integrating multiple sources of genomic information. Results show that our model outperforms elastic net in terms of size of discriminative feature subset and classification accuracy.

...read moreread less

Journal Article•10.1016/J.IJFORECAST.2015.09.006•

Electric load forecasting with recency effect: A big data approach

[...]

Pu Wang¹, Bidong Liu², Tao Hong²•Institutions (2)

SAS Institute¹, University of North Carolina at Charlotte²

01 Jul 2016-International Journal of Forecasting

TL;DR: A comprehensive study to model the recency effect using a big data approach and two interesting findings are presented: 1) the naive models are not useful for benchmark purposes in load forecasting at aggregated level due to their lack of accuracy; and 2) slicing the data into 24 pieces to develop one model for each hour is not necessarily better than building one interaction regression model using all 24 hours together.

...read moreread less

Proceedings Article•10.1109/GET.2016.7916627•

A comparative analysis on linear regression and support vector regression

[...]

S. Kavitha¹, Varuna S¹, Ramya R¹•Institutions (1)

Bannari Amman Institute of Technology, Sathy¹

1 Nov 2016

TL;DR: In this article, the authors compared linear regression and support vector regression models to predict the future of business with the current data or historical data for better prediction and accuracy, using the training data set in order to use the correct model.

...read moreread less

Abstract: In business, consumers interest, behavior, product profits are the insights required to predict the future of business with the current data or historical data. These insights can be generated with the statistical techniques for the purpose of forecasting. The statistical techniques can be evaluated for the predictive model based on the requirements of the data. The prediction and forecasting are done widely with time series data. Most of the applications such as weather forecasting, finance and stock market combine historical data with the current streaming data for better accuracy. However the time series data is analyzed with regression models. In this paper, linear regression and support vector regression model is compared using the training data set in order to use the correct model for better prediction and accuracy.

...read moreread less

Journal Article•10.1289/EHP172•

A systematic comparison of linear regression-based statistical methods to assess exposome-health associations

[...]

Lydiane Agier¹, Lützen Portengen², Marc Chadeau-Hyam³, Xavier Basagaña⁴, Lise Giorgis-Allemand¹, Valérie Siroux¹, Oliver Robinson⁴, Jelle Vlaanderen², Juan R. González⁴, Mark J. Nieuwenhuijsen⁴, Paolo Vineis³, Martine Vrijheid⁴, Rémy Slama¹, Roel Vermeulen², Roel Vermeulen³ - Show less +11 more•Institutions (4)

University of Grenoble¹, Utrecht University², Imperial College London³, Pompeu Fabra University⁴

24 May 2016-Environmental Health Perspectives

TL;DR: Although GUESS and DSA provided a marginally better balance between sensitivity and FDP, they did not outperform the other multivariate methods across all scenarios and properties examined, and computational complexity and flexibility should be considered when choosing between these methods.

...read moreread less

Abstract: Background:The exposome constitutes a promising framework to improve understanding of the effects of environmental exposures on health by explicitly considering multiple testing and avoiding select...

...read moreread less

Journal Article•10.1093/BIOMET/ASV062•

Partially functional linear regression in high dimensions

[...]

Dehan Kong¹, Kaijie Xue², Fang Yao², Hao Helen Zhang³•Institutions (3)

University of North Carolina at Chapel Hill¹, University of Toronto², University of Arizona³

01 Mar 2016-Biometrika

TL;DR: This work proposes a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types, and establishes the consistency and oracle properties of the proposed method under mild conditions.

...read moreread less

Abstract: SUMMARY In modern experiments, functional and nonfunctional data are often encountered simultaneously when observations are sampled from random processes and high-dimensional scalar covariates. It is difficult to apply existing methods for model selection and estimation. We propose a new class of partially functional linear models to characterize the regression between a scalar response and covariates of both functional and scalar types. The new approach provides a unified and flexible framework that simultaneously takes into account multiple functional and ultrahigh-dimensional scalar predictors, enables us to identify important features, and offers improved interpretability of the estimators. The underlying processes of the functional predictors are considered to be infinite-dimensional, and one of our contributions is to characterize the effects of regularization on the resulting estimators. We establish the consistency and oracle properties of the proposed method under mild conditions, demonstrate its performance with simulation studies, and illustrate its application using air pollution data.

...read moreread less

Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series

[...]

Zachary C. Lipton, David C. Kale, Randall C. Wetzel

10 Dec 2016

TL;DR: In this paper, the authors demonstrate a simple strategy to cope with missing data in sequential inputs, addressing the task of multilabel classification of diagnoses given clinical time series, and evaluate LSTMs, MLPs, and linear models trained on missingness patterns only.

...read moreread less

Abstract: We demonstrate a simple strategy to cope with missing data in sequential inputs, addressing the task of multilabel classification of diagnoses given clinical time series. Collected from the intensive care unit (ICU) of a major urban medical center, our data consists of multivariate time series of observations. The data is irregularly sampled, leading to missingness patterns in re-sampled sequences. In this work, we show the remarkable ability of RNNs to make effective use of binary indicators to directly model missing data, improving AUC and F1 significantly. However, while RNNs can learn arbitrary functions of the missing data and observations, linear models can only learn substitution values. For linear models and MLPs, we show an alternative strategy to capture this signal. Additionally, we evaluate LSTMs, MLPs, and linear models trained on missingness patterns only, showing that for several diseases, what tests are run can be more predictive than the results themselves.

...read moreread less

Journal Article•10.1111/2041-210X.12552•

Three points to consider when choosing a LM or GLM test for count data

[...]

David I. Warton¹, Mitchell B. Lyons¹, Jakub Stoklosa¹, Anthony R. Ives²•Institutions (2)

University of New South Wales¹, University of Wisconsin-Madison²

01 Aug 2016-Methods in Ecology and Evolution

TL;DR: GLMs are thought to fit count data well, and when any necessary steps are taken to correct type I error rates, they should be used rather than LMs, and tests based on models that better fit the data tend to have better power properties and in some instances have considerably higher power.

...read moreread less

Abstract: Summary The two most common approaches for analysing count data are to use a generalized linear model (GLM), or transform data, and use a linear model (LM). The latter has recently been advocated to more reliably maintain control of type I error rates in tests for no association, while seemingly losing little in power. We make three points on this issue. Point 1 – Choice of statistical model should primarily be made on the grounds of data properties. Choice of testing procedure should be considered and addressed as a separate issue, after model choice. If models with the appropriate data properties nonetheless have statistical problems such as type I error control (i.e. type I error rate greatly exceeds the intended significance level), the best solution is to keep the model but fix the problems. Point 2 – When a test has problems with type I error control, it can usually be corrected, but this may require departure from software default approaches. In particular, resampling is a good solution for small samples that can be easy to implement. Point 3 –Tests based on models that better fit the data (e.g. a negative binomial for overdispersed count data) tend to have better power properties and in some instances have considerably higher power. We illustrate these issues for a 2 × 2 experiment with a count response. This seemingly simple problem becomes hard when the experimental design is unbalanced, and software default procedures using LMs or GLMs can have difficulties, although in both cases the issues can be fixed. We conclude that, when GLMs are thought to fit count data well, and when any necessary steps are taken to correct type I error rates, they should be used rather than LMs. Nonetheless, standard LM tests are often robust and can have good type I error control, so there is an argument for their use for counts when diagnostics are difficult and statistical models are complex, although at some risk of loss of power and interpretability.

...read moreread less

Journal Article•10.1016/J.APENERGY.2016.07.052•

Forecasting the daily power output of a grid-connected photovoltaic system based on multivariate adaptive regression splines

[...]

Yanting Li¹, Yong He¹, Yan Su², Lianjie Shu²•Institutions (2)

Shanghai Jiao Tong University¹, University of Macau²

15 Oct 2016-Applied Energy

TL;DR: A fairly simple nonlinear regression model known as multivariate adaptive regression splines (MARS) is suggested, as an alternative to forecasting of solar power output, that maintains simplicity of the classical multiple linear regression (MLR) model while possessing the capability of handling nonlinearity.

...read moreread less

Journal Article•10.1214/16-AOAS928•

Regression analysis for microbiome compositional data

[...]

Pixu Shi, Anru Zhang, Hongzhe Li

03 Mar 2016-The Annals of Applied Statistics

TL;DR: In this article, a penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed, which provides valid confidence intervals of the regression coefficient and can be used to obtain the $p$-values, and the proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.

...read moreread less

Abstract: One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression coefficients are introduced. Such models allow regression analysis for subcompositions and include the log-contrast model for compositional covariates as a special case. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain debiased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the $p$-values. Simulation results show the validity of the confidence intervals and smaller variances of the debiased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.

...read moreread less

Posted Content•

Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series

[...]

Zachary C. Lipton, David C. Kale, Randall C. Wetzel

13 Jun 2016-arXiv: Learning

TL;DR: This work shows the remarkable ability of RNNs to make effective use of binary indicators to directly model missing data, improving AUC and F1 significantly and evaluating LSTMs, MLPs, and linear models trained on missingness patterns only.

...read moreread less

Abstract: We demonstrate a simple strategy to cope with missing data in sequential inputs, addressing the task of multilabel classification of diagnoses given clinical time series. Collected from the intensive care unit (ICU) of a major urban medical center, our data consists of multivariate time series of observations. The data is irregularly sampled, leading to missingness patterns in re-sampled sequences. In this work, we show the remarkable ability of RNNs to make effective use of binary indicators to directly model missing data, improving AUC and F1 significantly. However, while RNNs can learn arbitrary functions of the missing data and observations, linear models can only learn substitution values. For linear models and MLPs, we show an alternative strategy to capture this signal. Additionally, we evaluate LSTMs, MLPs, and linear models trained on missingness patterns only, showing that for several diseases, what tests are run can be more predictive than the results themselves.

...read moreread less

Journal Article•10.1080/00401706.2016.1142900•

Online Updating of Statistical Inference in the Big Data Setting

[...]

Elizabeth D. Schifano¹, Jing Wu¹, Chun Wang¹, Jun Yan¹, Ming-Hui Chen¹ - Show less +1 more•Institutions (1)

University of Connecticut¹

08 Jul 2016-Technometrics

TL;DR: In this article, the authors present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data.

...read moreread less

Abstract: We present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data. In particular, we develop iterative estimating algorithms and statistical inferences for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Within the linear model setting, the proposed online-updating framework leads to predictive residual tests that can be used to assess the goodness of fit of the hypothesized model. We also propose a new online-updating estimator under the estimating equation setting. Theoretical properties of the goodness-of-fit tests and proposed estimators are examined in detail. In simulation studies and real data applications, our estimator compares favorably with competing approaches...

...read moreread less

Journal Article•10.1016/J.ENVRES.2015.12.016•

A land use regression model for ambient ultrafine particles in Montreal, Canada: A comparison of linear regression and a machine learning approach.

[...]

Scott Weichenthal¹, Scott Weichenthal², Keith Van Ryswyk², Alon Goldstein¹, Scott Bagg¹, Maryam Shekkarizfard¹, Marianne Hatzopoulou³ - Show less +3 more•Institutions (3)

McGill University¹, Health Canada², University of Toronto³

01 Apr 2016-Environmental Research

TL;DR: A land use regression model for UFPs in Montreal, Canada is developed using mobile monitoring data collected from 414 road segments during the summer and winter months between 2011 and 2012 and suggests that the KRLS approach may offer modest improvements in predictive performance compared to standard multivariable linear regression models used to estimate spatial variations in ambient U FPs.

...read moreread less

Journal Article•10.1016/J.RIE.2016.01.001•

Should Instrumental Variables be Used as Matching Variables

[...]

Jeffrey M. Wooldridge¹•Institutions (1)

Michigan State University¹

01 Jun 2016-Research in Economics

TL;DR: The authors showed that adding covariates that satisfy instrumental variables assumptions increases the amount of inconsistency in a linear model and showed that regression adjustment using the propensity score based on instrumental variables actually maximizes the inconsistency among regression-type estimators.

...read moreread less

Journal Article•10.1093/IMAIAI/IAW015•

High-dimensional estimation with geometric constraints

[...]

Yaniv Plan¹, Roman Vershynin¹, Elena Yudovina¹•Institutions (1)

University of Michigan¹

14 Sep 2016-Information and Inference: A Journal of the IMA

TL;DR: A general model where it is only assumed that each observation y_i may depend on a_i only through , which leads to the intriguing conclusion that in the high noise regime, an unknown non-linearity in the observations does not significantly reduce one's ability to determine the signal, even when the non- linearity may be non-invertible.

...read moreread less

Abstract: Author(s): Plan, Yaniv; Vershynin, Roman; Yudovina, Elena | Abstract: Consider measuring an n-dimensional vector x through the inner product with several measurement vectors, a_1, a_2, ..., a_m. It is common in both signal processing and statistics to assume the linear response model y_i = + e_i, where e_i is a noise term. However, in practice the precise relationship between the signal x and the observations y_i may not follow the linear model, and in some cases it may not even be known. To address this challenge, in this paper we propose a general model where it is only assumed that each observation y_i may depend on a_i only through . We do not assume that the dependence is known. This is a form of the semiparametric single index model, and it includes the linear model as well as many forms of the generalized linear model as special cases. We further assume that the signal x has some structure, and we formulate this as a general assumption that x belongs to some known (but arbitrary) feasible set K. We carefully detail the benefit of using the signal structure to improve estimation. The theory is based on the mean width of K, a geometric parameter which can be used to understand its effective dimension in estimation problems. We determine a simple, efficient two-step procedure for estimating the signal based on this model -- a linear estimation followed by metric projection onto K. We give general conditions under which the estimator is minimax optimal up to a constant. This leads to the intriguing conclusion that in the high noise regime, an unknown non-linearity in the observations does not significantly reduce one's ability to determine the signal, even when the non-linearity may be non-invertible. Our results may be specialized to understand the effect of non-linearities in compressed sensing.

...read moreread less

Proceedings Article•

Linear dynamical neural population models through nonlinear embeddings

[...]

Yuanjun Gao¹, Evan Archer¹, Liam Paninski¹, John P. Cunningham¹•Institutions (1)

Columbia University¹

1 Jan 2016

TL;DR: This paper proposed fLDS, a general class of nonlinear generative models that permits the firing rate of each neuron to vary as an arbitrary smooth function of a latent, linear dynamical state, which allows the model to capture a richer set of neural variability than a purely linear model, but retains an easily visualizable low-dimensional latent space.

...read moreread less

Abstract: A body of recent work in modeling neural activity focuses on recovering low- dimensional latent features that capture the statistical structure of large-scale neural populations. Most such approaches have focused on linear generative models, where inference is computationally tractable. Here, we propose fLDS, a general class of nonlinear generative models that permits the firing rate of each neuron to vary as an arbitrary smooth function of a latent, linear dynamical state. This extra flexibility allows the model to capture a richer set of neural variability than a purely linear model, but retains an easily visualizable low-dimensional latent space. To fit this class of non-conjugate models we propose a variational inference scheme, along with a novel approximate posterior capable of capturing rich temporal correlations across time. We show that our techniques permit inference in a wide class of generative models.We also show in application to two neural datasets that, compared to state-of-the-art neural population models, fLDS captures a much larger proportion of neural variability with a small number of latent dimensions, providing superior predictive performance and interpretability.

...read moreread less

Bayes factors and marginal distributions in invariant situations

[...]

Luis R. Pericchi, Julia A. Varshavsky

1 Jan 2016

TL;DR: In this article, the marginal density of a "minimal" data set is typically available in closed form, regardless of the error distribution, and the conditions for the results to hold are explored in some detail for nonnormal linear models and various transformations thereof.

...read moreread less

Abstract: SUMMARY. In Bayesian analysis with a "minimal" data set and common non informative priors, the (formal) marginal density of the data is surprisingly often independent of the error distribution. This results in great simplifications in certain model selection methodologies; for instance, the Intrinsic Bayes Factor for models with this property reduces simply to the Bayes factor with respect to the noninformative priors. The basic result holds for comparison of models which are invariant with respect to the same group structure. Indeed the condi tion reduces to a condition on the distributions of the common maximal invariant. In these situations, the marginal density of a "minimal" data set is typically available in closed form, regardless of the error distribution. This provides very useful expressions for computation of Intrinsic Bayes Factors in more general settings. The conditions for the results to hold are explored in some detail for nonnormal linear models and various transformations thereof.

...read moreread less

...

Expand