TL;DR: The need for a new generation of robust estimators that can efficiently deal with cellwise outliers and at the same time show good performance under casewise outliers is highlighted.
Abstract: Multivariate location and scatter matrix estimation is a cornerstone in multivariate data analysis. We consider this problem when the data may contain independent cellwise and casewise outliers. Flat data sets with a large number of variables and a relatively small number of cases are common place in modern statistical applications. In these cases, global down-weighting of an entire case, as performed by traditional robust procedures, may lead to poor results. We highlight the need for a new generation of robust estimators that can efficiently deal with cellwise outliers and at the same time show good performance under casewise outliers.
TL;DR: In this article, some basic properties for negatively superadditive-dependent (NSD, in short) random variables are presented, such as the Rosenthal-type inequality and the Kolmogorov-type exponential inequality.
Abstract: In this paper, some basic properties for negatively superadditive-dependent (NSD, in short) random variables are presented, such as the Rosenthal-type inequality and the Kolmogorov-type exponential inequality. Using these properties, we further study the complete convergence for weighted sums of NSD random variables, which generalizes and improves some corresponding ones for independent random variables and negatively associated random variables. Some sufficient conditions to prove the complete convergence for weighted sums of NSD random variables are provided. As an application, the complete consistency of LS estimators in the EV regression model with NSD errors is investigated under mild conditions, which generalizes and improves the corresponding one for negatively associated random variables.
TL;DR: In this paper, the authors used the family of elliptical distributions to estimate the spatial dependence in referenced data and also used the local influence methodology to assess the sensitivity of the maximum likelihood estimators to small perturbations in the data and/or in the spatial linear model assumptions.
Abstract: In recent years, there has been a growing interest in statistical methods for the analysis of spatially referenced data. The spatial dependence structure modeling is an indispensable tool to estimate the parameters that define this structure. In this paper, we use the family of elliptical distributions to estimate the spatial dependence in referenced data. Thus we extend the Gaussian spatial linear model. Also we use the local influence methodology to assess the sensitivity of the maximum likelihood estimators to small perturbations in the data and/or in the spatial linear model assumptions. The methodology is illustrated with a real data set. The results allowed us to conclude that the presence of atypical values in the sample data have a strong influence, changing the spatial dependence structure. Also we have included a small simulation study.
TL;DR: In this paper, a semiparametric regression model is proposed for data set analysis in which the distribution of the response is strictly positive and asymmetric, and both median and skewness of the distribution are explicitly modeled.
Abstract: We motivate this paper by showing through Monte Carlo simulation that ignoring the skewness of the response variable distribution in non-linear regression models may introduce biases on the parameter estimates and/or on the estimation of the associated variability measures. Then, we propose a semiparametric regression model suitable for data set analysis in which the distribution of the response is strictly positive and asymmetric. In this setup, both median and skewness of the response variable distribution are explicitly modeled, the median using a parametric non-linear function and the skewness using a semiparametric function. The proposed model allows for the description of the response using the log-symmetric distribution, which is a generalization of the log-normal distribution and is flexible enough to consider bimodal distributions in special cases as well as distributions having heavier or lighter tails than those of the log-normal one. An iterative estimation process as well as some diagnostic methods are derived. Two data sets previously analyzed under parametric models are reanalyzed using the proposed methodology.
TL;DR: In this article, two new model selection criteria, namely the bootstrapped likelihood quasi-CV and its 632QCV variant, were proposed to circumvent the bias of the Akaike information criterion.
Abstract: This paper addresses the issue of model selection in the beta regression model focused on small samples. The Akaike information criterion (AIC) is a model selection criterion widely used in practical applications. The AIC is an estimator of the expected log-likelihood value, and measures the discrepancy between the true model and the estimated model. In small samples, the AIC is biased and tends to select overparameterized models. To circumvent that problem, we propose two new selection criteria, namely: the bootstrapped likelihood quasi-CV and its 632QCV variant. We use Monte Carlo simulation to compare the finite sample performances of the two proposed criteria to those of the AIC and its variations that use the bootstrapped log-likelihood in the class of varying dispersion beta regressions. The numerical evidence shows that the proposed model selection criteria perform well in small samples. We also present and discuss and empirical application.
TL;DR: In this article, the authors consider the problem of selecting a predictor from a finite collection of spatial predictors of a spatial random process defined on a subset of the Euclidean space.
Abstract: Remote sensing technology for the study of Earth and its environment has led to “Big Data” that, paradoxically, have global extent but may be spatially sparse. Furthermore, the variability in the measurement error and the latent process error may not fit conveniently into the Gaussian linear paradigm. In this paper, we consider the problem of selecting a predictor from a finite collection of spatial predictors of a spatial random process defined on \(D\), a subset of \(d\)-dimensional Euclidean space. Critically, we make no statistical distributional assumptions other than additive measurement error. In this nonparametric setting, one could use a criterion based on a validation dataset to select a spatial predictor for all of \(D\). Instead, we propose local criteria based on validation data to select a predictor at each spatial location in \(D\); the result is a hybrid combination of the spatial predictors, which we call a locally selected predictor (LSP). We consider selection from a collection of some of the classical and more recently proposed spatial predictors currently available. In a simulation study, the relative performances of various LSPs, as well as the performance of each of the individual spatial predictors in the collection, are assessed. “Big Data” are always challenging, and here we apply LSP to a very large global spatial dataset of atmospheric \(\mathrm {CO}_{2}\) measurements.
TL;DR: An existing estimation procedure for survival under a copula-based dependent truncation model is revisited but a different algorithm to solve the equation is proposed and its theoretical and practical usefulness are discussed.
Abstract: Traditional analysis with truncated survival data has been developed under the assumption that the lifetime variable of interest is statistically independent of the truncation variable. However, empirical evidence has shown that the truncation variable may depend on the lifetime of interest in many real-world examples. The lack of independence can lead to seriously biased analysis. In this article, we revisit an existing estimation procedure for survival under a copula-based dependent truncation model. Here, the same estimating equation is adopted but a different algorithm to solve the equation is proposed. We compare the new algorithm with the existing one and discuss its theoretical and practical usefulness. Real data examples are analyzed for illustration. We implemented the proposed algorithm in an R “depend.truncation” package, available from CRAN.
TL;DR: In this paper, the authors proposed to use the asymptotics of the Lambert function to improve the accuracy of the approximation to the Gumbel law in signal processing problems.
Abstract: In some applied problems of signal processing, the maximum of a sample of $$\chi ^2(m)$$
random variables is computed and compared with a threshold to assess certain properties. It is well known that this maximum, conveniently normalized, converges in law to a Gumbel random variable; however, numerical and simulation studies show that the norming constants that are usually suggested are inaccurate for moderate or even large sample sizes. In this paper, we propose, for Gamma laws (in particular, for a $$\chi ^2(m)$$
law) and other Weibull-like distributions, other norming constants computed with the asymptotics of the Lambert $$W$$
function that significantly improve the accuracy of the approximation to the Gumbel law.
TL;DR: In this article, the authors introduce spatial long-range dependent models based on the fractional difference operators associated with the Gegenbauer polynomials and verify assumptions for consistency and asymptotic normality of minimum contrast estimators.
Abstract: The article introduces spatial long-range dependent models based on the fractional difference operators associated with the Gegenbauer polynomials. The results on consistency and asymptotic normality of a class of minimum contrast estimators of long-range dependence parameters of the models are obtained. A methodology to verify assumptions for consistency and asymptotic normality of minimum contrast estimators is developed. Numerical results are presented to confirm the theoretical findings.
TL;DR: In this article, a robust testing procedure is recommended under the framework of the generalized likelihood ratio test (GLR) by incorporating with a Wilcoxon-type artificial likelihood function, which is proved to be asymptotically normal and free of nuisance parameters and covariate designs.
Abstract: This paper is concerned about robust comparison of two regression curves. Most of the procedures in the literature are least-squares-based methods with local polynomial approximation to nonparametric regression. However, the efficiency of these methods is adversely affected by outlying observations and heavy-tailed distributions. To attack this challenge, a robust testing procedure is recommended under the framework of the generalized likelihood ratio test (GLR) by incorporating with a Wilcoxon-type artificial likelihood function. Under the null hypothesis, the proposed test statistic is proved to be asymptotically normal and free of nuisance parameters and covariate designs. Its asymptotic relative efficiency with respect to the least-squares-based GLR method is closely related to that of the signed-rank Wilcoxon test in comparison with the $$t$$
test. We then consider a bootstrap approximation to determine $$p$$
values of the test in finite sample situation. Its asymptotic validity is also presented. A simulation study is conducted to examine the performance of the proposed test and to compare it with its competitors in the literature.
TL;DR: A conditional to dropout model is introduced for the analysis of longitudinal data suffering attrition in random effects models framework, where the subject-specific random effects are assumed to be discrete and to follow a time-dependent latent process.
Abstract: We illustrate a class of conditional models for the analysis of longitudinal data suffering attrition in random effects models framework, where the subject-specific random effects are assumed to be discrete and to follow a time-dependent latent process. The latent process accounts for unobserved heterogeneity and correlation between individuals in a dynamic fashion, and for dependence between the observed process and the missing data mechanism. Of particular interest is the case where the missing mechanism is non-ignorable. To deal with the topic we introduce a conditional to dropout model. A shape change in the random effects distribution is considered by directly modeling the effect of the missing data process on the evolution of the latent structure. To estimate the resulting model, we rely on the conditional maximum likelihood approach and for this aim we outline an EM algorithm. The proposal is illustrated via simulations and then applied on a dataset concerning skin cancers. Comparisons with other well-established methods are provided as well.
TL;DR: In this paper, the conditional mean function is modeled as a partial linear model and the conditional distribution function of model error is modelled as a single-index model, and a semi-parametric global weighted least squares estimator coupled with an indicator function of the residuals is proposed.
Abstract: In this paper, we present a method for estimating the conditional distribution function of the model error. Given the covariates, the conditional mean function is modeled as a partial linear model, and the conditional distribution function of model error is modeled as a single-index model. To estimate the single-index parameter, we propose a semi-parametric global weighted least-squares estimator coupled with an indicator function of the residuals. We derive a residual-based kernel estimator to estimate the unknown conditional distribution function. Asymptotic distributions of the proposed estimators are derived, and the residual-based kernel process constructed by the estimator of the conditional distribution function is shown to converge to a Gaussian process. Simulation studies are conducted and a real dataset is analyzed to demonstrate the performance of the proposed estimators.
TL;DR: In this paper, the problem of testing multivariate normality when the data consist of a random sample of two-step monotone incomplete observations is considered, and a generalization of Mardia's statistic for measuring kurtosis is proposed.
Abstract: We consider the problem of testing multivariate normality when the data consist of a random sample of two-step monotone incomplete observations. We define for such data a generalization of Mardia’s statistic for measuring kurtosis, derive the asymptotic non-null distribution of the statistic under certain regularity conditions and against a broad class of alternatives, and provide an application to a well-known data set on cholesterol measurements.
TL;DR: In this article, a semi-parametric method for estimating the absorption probability and the distribution of the absorbing time of a growth-fragmentation model observed within a long time interval is presented.
Abstract: In the present paper, we focus on semi-parametric methods for estimating the absorption probability and the distribution of the absorbing time of a growth-fragmentation model observed within a long time interval. We establish that the absorption probability is the unique solution in an appropriate space of a Fredholm equation of the second kind whose parameters are unknown. We estimate this important characteristic of the underlying process by solving numerically the estimated Fredholm equation. Even if the study has been conducted for a particular model, our method is quite general.
TL;DR: In this article, it is shown that it is possible to obtain closed-form expressions for the exact distributions of the LRT statistics to test independence, equality of mean vectors and the equality of an expected value matrix to a given matrix.
Abstract: In this paper the authors show how it is possible to establish a common structure for the exact distribution of the main likelihood ratio test (LRT) statistics used in the complex multivariate normal setting. In contrast to what happens when dealing with real random variables, for complex random variables it is shown that it is possible to obtain closed-form expressions for the exact distributions of the LRT statistics to test independence, equality of mean vectors and the equality of an expected value matrix to a given matrix. For the LRT statistics to test sphericity and the equality of covariance matrices, cases where the exact distribution has a non-manageable expression, easy to implement and very accurate near-exact distributions are developed. Numerical studies show how these near-exact distributions outperform by far any other available approximations. As an example of application of the results obtained, the authors develop a near-exact approximation for the distribution of the LRT statistic to test the equality of several complex normal distributions.
TL;DR: In this article, the authors developed a methodology to deal with time series data on the unit interval modeled by a partially linear model with correlated disturbances from a Bayesian perspective, which can capture the dynamic evolution of curves using both non-stochastic explanatory variables and non-parametric components, allowing an accurate fit with a limited number of parameters.
Abstract: This paper is focused on developing a methodology to deal with time series data on the unit interval modeled by a partially linear model with correlated disturbances from a Bayesian perspective. In this context, the linear predictor of the beta regression model incorporates an unknown smooth function with time as an auxiliary covariate and a set of regressors. In addition, an autoregressive dependence structure is proposed for the errors of the model. This formulation can capture the dynamic evolution of curves using both non-stochastic explanatory variables and non-parametric components, allowing an accurate fit with a limited number of parameters. Diagnostic measures are derived from the case-deletion approach and an influence measure based on the Kullback–Leibler divergence is studied and thus, a new method to determine the optimal order of the autoregressive processes through an adaptive procedure using the conditional predictive ordinate statistic is presented. A simulation study is conducted to assess some properties of the Bayesian estimator. Finally, the proposed methodology is illustrated in two real-life applications.
TL;DR: In this article, a non-standard linear regression analysis by considering that the dependent variable is left censored and also that some of the explanatory variables are measured with additive errors is developed, which is specified by assuming heavy-tailed distributions for the underlying probabilistic process.
Abstract: We develop a non-standard linear regression analysis by considering that the dependent variable is left censored and also that some of the explanatory variables are measured with additive errors. Our censored measurement error regression model is specified by assuming heavy-tailed distributions for the underlying probabilistic process. Specifically, we focus on assuming a multivariate \(t\) joint distribution for the error terms and the unobserved true covariates. For the model estimation, we consider the maximum likelihood methodology in which we include the estimation of the asymptotic variance of the maximum likelihood estimators. We also develop an EM algorithm to obtain the estimates. The performance of the newly developed methodology is evaluated throughout a simulation study as well as a case study analysis.
TL;DR: In this paper, a new concept of generalized elliptical location quantiles is proposed, inspired by nonlinear quantile regression, which is more sensitive to outliers, less rigid, employ more a priori information regarding the location of the distribution, and have higher potential for various regression generalizations than their common elliptical predecessor defined in the convex optimization framework.
Abstract: Inspired by nonlinear quantile regression, the article introduces, investigates, discusses, and illustrates a new concept of generalized elliptical location quantiles. They may require less stringent moment assumptions, be less sensitive to outliers, be less rigid, employ more a priori information regarding the location of the distribution, and have higher potential for various regression generalizations than their common elliptical predecessor defined in the convex optimization framework by means of standard linear quantile regression. Furthermore, they still include an equivalent of their predecessor as a special case and inherit most of its favorable features such as the probability interpretation, natural equivariance properties, and good behavior for elliptical and symmetric distributions, which is demonstrated both by theoretical results and data examples with convincing graphical output. On the other hand, the new elliptical quantiles need not always be uniquely defined and they require somewhat different approach to their analysis and computation due to their intrinsically non-convex formulation.
TL;DR: In this article, a nonparametric adaptive strategy is proposed to estimate the target function, which is based on the relative density, a function recently used in two-sample problems, defined as the density of the variable.
Abstract: This paper deals with the classical statistical problem of comparing the probability distributions of two real random variables $$X$$
and $$X_0$$
, from a double independent sample. While most of the usual tools are based on the cumulative distribution functions $$F$$
and $$F_0$$
of the variables, we focus on the relative density, a function recently used in two-sample problems, and defined as the density of the variable $$F_0(X)$$
. We provide a nonparametric adaptive strategy to estimate the target function. We first define a collection of estimates using a projection on the trigonometric basis and a preliminary estimator of $$F_0$$
. An estimator is selected among this collection of projection estimates, with a criterion in the spirit of the Goldenshluger–Lepski methodology. We show the optimality of the procedure both in the oracle and the minimax sense: the convergence rate for the risk computed from an oracle inequality matches with the lower bound that we also derived. Finally, some simulations illustrate the method.
TL;DR: A quantitative study of the robustness properties of the $$\ell _1$$ℓ1 and the Huber M-estimator on finite samples is presented and sharp error bounds are derived in terms of the leverage constants of a design matrix introduced here.
Abstract: A quantitative study of the robustness properties of the \(\ell _1\) and the Huber M-estimator on finite samples is presented. The focus is on the linear model involving a fixed design matrix and additive errors restricted to the dependent variables consisting of noise and sparse outliers. We derive sharp error bounds for the \(\ell _1\) estimator in terms of the leverage constants of a design matrix introduced here. A similar analysis is performed for Huber’s estimator using an equivalent problem formulation of independent interest. Our analysis considers outliers of arbitrary magnitude, and we recover breakdown point results as particular cases when outliers diverge. The practical implications of the theoretical analysis are discussed on two real datasets.
TL;DR: The rejoinder is organized in several sections, which address the main points raised by the discussants.
Abstract: We thank the discussants, the referees, and the associate editor for their stimulating discussions and helpful remarks. We thank the editor for giving us the opportunity to discuss our paper in this journal. The rejoinder is organized in several sections, which address the main points raised by the discussants.
TL;DR: In this paper, the authors propose two alternative formulas to identify causal effects in linear systems with an unobserved variable that acts as a direct confounder by exploiting the conditional independence structure of a directed acyclic graph.
Abstract: The instrumental variable (IV) formula has become widely used to address the issue of identification of a causal effect in linear systems with an unobserved variable that acts as direct confounder. We here propose two alternative formulations to achieve identification when the assumptions underlying the use of IV are violated. Parallel to the IV, the proposed formulas exploit the conditional independence structure of a directed acyclic graph and can be obtained via a series of univariate regressions, a feature that renders the results particularly attractive and easy to implement. By exploiting the notion of Markov equivalence, the derivations can also be applied to regression graphs, thereby enlarging the class of models to which the results are of use.
TL;DR: The authors are to be praised for a very original and elegant contribution to a difficult and important problem, and the goal is to develop estimators that are resistant to both types of outliers, and also fast enough for highdimensional data.
Abstract: The authors are to be praised for a very original and elegant contribution to a difficult and important problem. I would like to point out that one of the fields where the need to consider cellwise outliers arose was computer vision. As far as I know, the efforts in this direction have mainly been concentrated on principal components. Here, the goal is to develop estimators that are resistant to both types of outliers, and also fast enough for highdimensional data. Before the formalization by Alqallaf et al. (2009), an early proposal is due to De la Torre and Black (2001). Maronna and Yohai (2008) proposed a “perturbedMM estimator”. Candes et al. (2011) propose a quite different approach, which is very fast and is resistant to cellwise outliers, but not to casewise ones. Estimation of a scatter matrix presents difficulties of a different nature than those in principal components, and the authors have dealt with them very skillfully. I have just two comments for the authors: In Step I, instead of declaring the univariate outliers as missing, one could somehow impute them (eventually correcting the imputation after Step II). This would make it possible to apply in Step II a “standard” robust multivariate estimator, with considerable gains in computing speed. In Step II, a bisquare S estimator is employed. Rocke (1996) showed that the normal efficiency of these estimators tends to the one with increasing dimension, but this gain is paid for by a decrease in robustness. It would be interesting if the procedure by
TL;DR: In this article, the authors present a methodology for constructing significance tests for composite alternative hypotheses that have no natural test statistic, and apply their methodology to construct exact tests for cross-tabulated data, and their motivating example is constructing a test for discovering Simpson's paradox.
Abstract: We present a methodology for constructing significance tests for “difficult” composite alternative hypotheses that have no natural test statistic. We apply our methodology to construct exact tests for cross-tabulated data, and our motivating example is constructing a test for discovering Simpson’s Paradox. Our tests are Bayesian extensions of the likelihood ratio test; they are optimal with respect to the prior distribution and are also closely related to Bayes factors and Bayesian FDR controlling testing procedures.
TL;DR: In this paper, Castruccio et al. propose local criteria to compare and select from a finite number of spatial predictors defined on a (possibly irregular) spatial lattice.
Abstract: First and foremost, we would like to thank the discussants: Stefano Castruccio and Marc G. Genton (CG), Alan E. Gelfand (GE), Finn Lindgren (LI), M.D. Ruiz-Medina (RM), andMartin P. Tingley andBenA. Shaby (TS) for giving generously of their time to comment on our paper, “Comparing and Selecting Spatial Predictors Using Local Criteria” by Jonathan R. Bradley, Noel Cressie, and Tao Shi (BCS). Their comments and those of the editors and referees have given us this opportunity to amplify and clarify some aspects of using local criteria to compare and select spatial predictors. In BCS, we propose local criteria to compare and select from a finite number of spatial predictors defined on a (possibly irregular) spatial lattice D ≡ {u1, ...,uN } ⊂
TL;DR: In this article, a new concept of relative skewness among multivariate distributions, extending to the multivariate case a similar concept in the univariate case, was proposed. But the concept is restricted to the case where the convexity of the transformation is assumed to be convex.
Abstract: In this paper, we provide a new concept of relative skewness among multivariate distributions, extending to the multivariate case a similar concept in the univariate case. In this case, a random variable $$Y$$
is said to be more right skewed than a random variable $$X$$
if there exists an increasing convex transformation which maps $$X$$
onto $$Y$$
. Given two random vectors $$\mathbf X$$
and $$\mathbf Y$$
and an appropriate transformation which maps $$\mathbf X$$
onto $$\mathbf Y$$
, we define a new concept of relative skewness assuming the convexity of this transformation. Properties and applications of this concept are given.
TL;DR: The inclusion of reduced rankmodels as candidate predictors warrants further discussion, as the performance of these predictors has recently been questioned by Stein (2014), largely on the basis of the Kullback–Leibler divergence between the low rank approximation and the true data generating measure.
Abstract: Bradley et al. (2014) select between competing spatial predictors, including traditional stationary kriging and several non-stationary and reduced rankmodels, based on expected local squared prediction error. The inclusion of reduced rankmodels as candidate predictors warrants further discussion, as the performance of these predictors has recently been questioned by Stein (2014), largely on the basis of the Kullback–Leibler divergence between the low rank approximation and the true data generating measure. Applications in Stein (2014) further show that reduced rank methods can result in inefficient spatial interpolation as measured by mean squared prediction error. In their simulation study (Bradley et al. 2014), focus on evaluating themean squared prediction errors of the Locally Selected Predictor as a number of parameters of the simulation, including the signal to noise ratio and the neighborhood used to select
TL;DR: A varying-coefficient nonparametric approach to the analysis of three-level clustered data models is developed, and it is shown that the resultant estimators are asymptotically normal, and exhibit considerably smaller asymPTotic variability than the traditional local polynomial estimators that neglect the correlations within and among clusters.
Abstract: Most of the literature on clustered data models emphasizes two-level clustering, and within-cluster correlation. While multi-level clustered data models can arise in practice, analysis of multi-level clustered data models poses additional difficulties owing to the existence of error correlations both within and across the clusters. It is perhaps for this reason that existing approaches to multi-level clustered data models have been mostly parametric. The purpose of this paper is to develop a varying-coefficient nonparametric approach to the analysis of three-level clustered data models. Because the nonparametric functions are restricted only to some of the variables, this approach has the appeal of avoiding many of the curse of dimensionality problems commonly associated with other nonparametric methods. By applying an undersmoothing technique, taking into account the correlations within and across clusters, we develop an efficient two-stage local polynomial estimation procedure for the unknown coefficient functions. The large and finite sample properties of the resultant estimators are examined; in particular, we show that the resultant estimators are asymptotically normal, and exhibit considerably smaller asymptotic variability than the traditional local polynomial estimators that neglect the correlations within and among clusters. An application example is presented based on a data set extracted from the World Bank’s STARS database.