TL;DR: A robust functional principal component analysis is proposed to find the linear combinations of the original variables that contain most of the information, even if there are outliers and to flag functional outliers.
Abstract: Functional principal component analysis is the preliminary step to represent the data in a lower dimensional space and to capture the main modes of variability of the data by means of small number of components which are linear combinations of original variables. Sensitivity of the variance and the covariance functions to irregular observations make this method vulnerable to outliers and may not capture the variation of the regular observations. In this study, we propose a robust functional principal component analysis to find the linear combinations of the original variables that contain most of the information, even if there are outliers and to flag functional outliers. We demonstrate the performance of the proposed method on an extensive simulation study and two datasets from chemometrics and environment.
TL;DR: It is shown that for robust parameter estimation not only an appropriate transformation, but also affine equivariant estimators of location and covariance are needed, which of the transformations lead to invariance of the resulting discriminant rules.
Abstract: Compositional data, i.e. data including only relative information, need to be transformed prior to applying the standard discriminant analysis methods that are designed for the Euclidean space. Here it is investigated for linear, quadratic, and Fisher discriminant analysis, which of the transformations lead to invariance of the resulting discriminant rules. Moreover, it is shown that for robust parameter estimation not only an appropriate transformation, but also affine equivariant estimators of location and covariance are needed. An example and simulated data demonstrate the effects of working in an inappropriate space for discriminant analysis.
TL;DR: The paper presents smooth estimation of densities utilizing penalized splines, to represent the unknown density by a convex mixture of basis densities, where the weights are estimated in a penalized form and it shows stable and powerful behaviour.
Abstract: The paper presents smooth estimation of densities utilizing penalized splines. The idea is to represent the unknown density by a convex mixture of basis densities, where the weights are estimated in a penalized form. The proposed method extends the work of Komarek and Lesaffre (Comput Stat Data Anal 52(7):3441–3458, 2008) and allows for general density estimation. Simulations show a convincing performance in comparison to existing density estimation routines. The idea is extended to allow the density to depend on some (factorial) covariate. Assuming a binary group indicator, for instance, we can test on equality of the densities in the groups. This provides a smooth alternative to the classical Kolmogorov-Smirnov test or an Analysis of Variance and it shows stable and powerful behaviour.
TL;DR: A boxplot version for a circular data set, called the circular boxplot is proposed, which performs better when both the value of κ and the sample size are larger and is developed in S-Plus.
Abstract: A boxplot is a simple and flexible graphical tool which has been widely used in exploratory data analysis. One of its main applications is to identify extreme values and outliers in a univariate data set. While the boxplot is useful for a real line data set, it is not suitable for a circular data set due to the fact that there is no natural ordering of circular observations. In this paper, we propose a boxplot version for a circular data set, called the circular boxplot. The problem of finding the appropriate circular boxplot criterion of the form ν × CIQR, where CIQR is the circular interquartile range and ν is the resistant constant, is investigated through a simulation study. As might be expected, we find that the choice of ν depends on the value of the concentration parameter κ. Another simulation study is done to investigate the performance of the circular boxplot in detecting a single outlier. Our results show that the circular boxplot performs better when both the value of κ and the sample size are larger. We develop a visual display for the circular boxplot in S-Plus and illustrate its application using two real circular data sets.
TL;DR: A naïve bootstrap estimation of the square trace correlation criterion of the effective dimension reduction (EDR) space is introduced to allow selection of an “optimal” number of slices and a 3D-graphical tool is proposed which can be useful to select the suitable couple (H, K).
Abstract: Sliced inverse regression (SIR) and related methods were introduced in order to reduce the dimensionality of regression problems. In general semiparametric regression framework, these methods determine linear combinations of a set of explanatory variables X related to the response variable Y, without losing information on the conditional distribution of Y given X. They are based on a “slicing step” in the population and sample versions. They are sensitive to the choice of the number H of slices, and this is particularly true for SIR-II and SAVE methods. At the moment there are no theoretical results nor practical techniques which allows the user to choose an appropriate number of slices. In this paper, we propose an approach based on the quality of the estimation of the effective dimension reduction (EDR) space: the square trace correlation between the true EDR space and its estimate can be used as goodness of estimation. We introduce a naive bootstrap estimation of the square trace correlation criterion to allow selection of an “optimal” number of slices. Moreover, this criterion can also simultaneously select the corresponding suitable dimension K (number of the linear combination of X). From a practical point of view, the choice of these two parameters H and K is essential. We propose a 3D-graphical tool, implemented in R, which can be useful to select the suitable couple (H, K). An R package named “edrGraphicalTools” has been developed. In this article, we focus on the SIR-I, SIR-II and SAVE methods. Moreover the proposed criterion can be use to determine which method seems to be efficient to recover the EDR space, that is the structure between Y and X. We indicate how the proposed criterion can be used in practice. A simulation study is performed to illustrate the behavior of this approach and the need for selecting properly the number H of slices and the dimension K. A short real-data example is also provided.
TL;DR: The present paper provides another solution based on the fact that the quantile regions can also be computed from a competing concept of projection regression quantiles, elaborated in Kong and Mizera and Paindaveine and Šiman and Matlab implementation.
Abstract: In the multiple-output regression context, Hallin et al. (Ann Statist 38:635–669, 2010) introduced a powerful data-analytical tool based on regression quantile regions. However, the computation of these regions, that are obtained by considering in all directions an original concept of directional regression quantiles, is a very challenging problem. Paindaveine and Siman (Comput Stat Data Anal 2011b) described a first elegant solution relying on linear programming techniques. The present paper provides another solution based on the fact that the quantile regions can also be computed from a competing concept of projection regression quantiles, elaborated in Kong and Mizera (Quantile tomography: using quantiles with multivariate data 2008) and Paindaveine and Siman (J Multivar Anal 2011a). As a by-product, this alternative solution further provides various characteristics useful for statistical inference. We describe in detail the algorithm solving the parametric programming problem involved, and illustrate the resulting procedure on simulated data. We show through simulations that the Matlab implementation of the algorithm proposed in this paper is faster than that from Paindaveine and Siman (Comput Stat Data Anal 2011b) in various cases.
TL;DR: This paper develops a Bayesian approach to analyzing quantile regression models for censored dynamic panel data by employing a likelihood-based approach using the asymmetric Laplace error distribution and introducing lagged observed responses into the conditional quantile function.
Abstract: This paper develops a Bayesian approach to analyzing quantile regression models for censored dynamic panel data. We employ a likelihood-based approach using the asymmetric Laplace error distribution and introduce lagged observed responses into the conditional quantile function. We also deal with the initial conditions problem in dynamic panel data models by introducing correlated random effects into the model. For posterior inference, we propose a Gibbs sampling algorithm based on a location-scale mixture representation of the asymmetric Laplace distribution. It is shown that the mixture representation provides fully tractable conditional posterior densities and considerably simplifies existing estimation procedures for quantile regression models. In addition, we explain how the proposed Gibbs sampler can be utilized for the calculation of marginal likelihood and the modal estimation. Our approach is illustrated with real data on medical expenditures.
TL;DR: A generic framework for component-based algorithms design that enhances understanding, testing and usability of decision tree algorithm parts and allows exchanging of solutions from various algorithms and fast design of new algorithms.
Abstract: We propose a generic decision tree framework that supports reusable components design. The proposed generic decision tree framework consists of several sub-problems which were recognized by analyzing well-known decision tree induction algorithms, namely ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE. We identified reusable components in these algorithms as well as in several of their partial improvements that can be used as solutions for sub-problems in the generic decision tree framework. The identified components can now be used outside the algorithm they originate from. Combining reusable components allows the replication of original algorithms, their modification but also the creation of new decision tree induction algorithms. Every original algorithm can outperform other algorithms under specific conditions but can also perform poorly when these conditions change. Reusable components allow exchanging of solutions from various algorithms and fast design of new algorithms. We offer a generic framework for component-based algorithms design that enhances understanding, testing and usability of decision tree algorithm parts.
TL;DR: This paper proposes the estimation approach of VaR using semiparametric support vector quantile regression (SSVQR) models which are functions of the one-step-ahead volatility forecast and the length of the holding period, and can be used regardless of the distribution.
Abstract: Value at Risk (VaR) has been used as an important tool to measure the market risk under normal market. Usually the VaR of log returns is calculated by assuming a normal distribution. However, log returns are frequently found not normally distributed. This paper proposes the estimation approach of VaR using semiparametric support vector quantile regression (SSVQR) models which are functions of the one-step-ahead volatility forecast and the length of the holding period, and can be used regardless of the distribution. We find that the proposed models perform better overall than the variance-covariance and linear quantile regression approaches for return data on S&P 500, NIKEI 225 and KOSPI 200 indices.
TL;DR: Near-exact approximations to the distribution of the likelihood ratio test statistic, arising from a different method of approximating distributions, emerge as a desirable goal and are much closer to the exact distribution than the usual asymptotic distributions.
Abstract: The exact distribution of the likelihood ratio test statistic to test the equality of several variance-covariance matrices has a non-manageable form. On the other hand, the existing asymptotic approximations do not exhibit the necessary precision for many applications. For these reasons, the development of near-exact approximations to the distribution of this statistic, arising from a different method of approximating distributions, emerges as a desirable goal. These distributions, while being manageable are much closer to the exact distribution than the usual asymptotic distributions and opposite to these, are also asymptotic for increasing number of variables and matrices involved. Computational modules to implement the near-exact distributions are made available on a web-site.
TL;DR: A new multivariate volatility model is proposed that combines the appealing properties of the stable Paretian distribution to model the heavy tails with the GARCH model to capture the volatility clustering.
Abstract: In this paper a new multivariate volatility model is proposed It combines the appealing properties of the stable Paretian distribution to model the heavy tails with the GARCH model to capture the volatility clustering Returns on assets are assumed to follow a sub-Gaussian distribution, which is a particular multivariate stable distribution In this way the characteristic function of the fitted returns has a tractable expression and the density function can be recovered by numerical methods A multivariate GARCH structure is then adopted to model the covariance matrix of the Gaussian vectors underlying the sub-Gaussian system The model is applied to a bivariate series of daily US stock returns Value-at-risk for long and short positions is computed and compared with the one obtained using the multivariate normal and the multivariate Student’s t distribution Finally, exploiting the recent developments in the vast dimensional time-varying covariances modeling, possible feasible extensions of our model to higher dimensions are suggested and an illustrative example using the Dow Jones index components is presented
TL;DR: This paper calculates response surface models for a large range of quantiles of the Leybourne test for the null hypothesis of a unit root against the alternative of (trend) stationarity and indicates that the critical values depend on the method used to select the number of lags.
Abstract: This paper calculates response surface models for a large range of quantiles of the Leybourne (Oxf Bull Econ Stat 57:559–571, 1995) test for the null hypothesis of a unit root against the alternative of (trend) stationarity. The response surface models allow the estimation of critical values for different combinations of number of observations, T, and lag order in the test regressions, p, where the latter can be either specified by the user or optimally selected using a data-dependent procedure. The results indicate that the critical values depend on the method used to select the number of lags. An Excel spreadsheet is available to calculate the p-value associated with a test statistic.
TL;DR: Asymptotic approximations to the distributions of the ability estimator and its transformations in item response theory are derived beyond the usual normal one when associated item parameters are given as in tailored testing.
Abstract: Asymptotic approximations to the distributions of the ability estimator and its transformations in item response theory are derived beyond the usual normal one when associated item parameters are given as in tailored testing. For the approximations, the asymptotic cumulants of the estimators up to the fourth order with the higher-order asymptotic variances are obtained under possible model misspecification. For testing and interval estimation of abilities, the asymptotic cumulants of the pivots studentized in four ways are derived. Numerical examples with simulations including those for confidence intervals for abilities are given using the three-parameter logistic model.
TL;DR: This paper presents an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression, and shows that the proposed method outperforms linear and local models for several kinds of scenarios.
Abstract: Locally weighted regression is a technique that predicts the response for new data items from their neighbors in the training data set, where closer data items are assigned higher weights in the prediction. However, the original method may suffer from overfitting and fail to select the relevant variables. In this paper we propose combining a regularization approach with locally weighted regression to achieve sparse models. Specifically, the lasso is a shrinkage and selection method for linear regression. We present an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression. The algorithm is tested on three synthetic scenarios and two real data sets. Results show that the proposed method outperforms linear and local models for several kinds of scenarios.
TL;DR: Goodness-of-fit tests are proposed for the case of independent observations coming from the same family of distributions but with different parameters, and based on suitable estimators of the parameters involved, the data are transformed to normality.
Abstract: Goodness-of-fit tests are proposed for the case of independent observations coming from the same family of distributions but with different parameters. The most popular related context is that of generalized linear models (GLMs) where the mean of the distribution varies with regressors. In the proposed procedures, and based on suitable estimators of the parameters involved, the data are transformed to normality. Then any test for normality for i.i.d. data may be applied. The method suggested is in full generality as it may be applied to arbitrary laws with continuous or discrete distribution functions, provided that an efficient method of estimation exists for the parameters. We investigate by Monte Carlo the relative performance of classical tests based on the empirical distribution function, in comparison to a corresponding test which instead of the empirical distribution function, utilizes the empirical characteristic function. Standard measures of goodness-of-fit often used in the context of GLM are also included in the comparison. The paper concludes with several real-data examples.
TL;DR: A genetic algorithm for identifying and estimating nonlinear nonstationary models for time series, where the series is generated from an autoregressive equation whose coefficients change both according to time and the delayed values of the series itself, switching between several regimes.
Abstract: Nonlinear nonstationary models for time series are considered, where the series is generated from an autoregressive equation whose coefficients change both according to time and the delayed values of the series itself, switching between several regimes. The transition from one regime to the next one may be discontinuous (self-exciting threshold model), smooth (smooth transition model) or continuous linear (piecewise linear threshold model). A genetic algorithm for identifying and estimating such models is proposed, and its behavior is evaluated through a simulation study and application to temperature data and a financial index.
TL;DR: This paper proposes a new step called the P-step to handle the linear or nonlinear equality constraint in addition to the conventional EM algorithm, and shows finally that there exists a Kuhn–Tucker vector at the limit point produced by the method.
Abstract: This paper proposes a new step called the P-step to handle the linear or nonlinear equality constraint in addition to the conventional EM algorithm. This new step is easy to implement, first because only the first derivatives of the object function and the constraint function are necessary, and secondly, because the P-step is carried out after the conventional EM algorithm. The estimate sequence produced by our method enjoys a monotonic increase in the observed likelihood function. We apply the P-step in addition to the conventional EM algorithm to the two illustrative examples. The first example has a linear constraint function. The second has a nonlinear constraint function. We show finally that there exists a Kuhn–Tucker vector at the limit point produced by our method.
TL;DR: A formal parameter estimation procedure for the two-parameter M-Wright distribution is proposed, and closed-form estimators are derived from the moments of the log-transformed M-wright distributed random variable, and are shown to be asymptotically normal.
Abstract: A formal parameter estimation procedure for the two-parameter M-Wright distribution is proposed. This procedure is necessary to make the model useful for real-world applications. Note that its generalization of the Gaussian density makes the M-Wright distribution appealing to practitioners. Closed-form estimators are also derived from the moments of the log-transformed M-Wright distributed random variable, and are shown to be asymptotically normal. Tests using simulated data indicated favorable results for our estimation procedure.
TL;DR: This paper proposes goodness-of-fit methods for a concentrated von Mises-Fisher distribution and analyzes by simulation some questions concerning the application of these tests.
Abstract: The von Mises-Fisher distribution is widely used for modelling directional data. In this paper we propose goodness-of-fit methods for a concentrated von Mises-Fisher distribution and we analyse by simulation some questions concerning the application of these tests. We analyse the empirical power of the Kolmogorov-Smirnov test for several dimensions of the sphere, supposing as alternative hypothesis a mixture of two von Mises-Fisher distributions with known parameters. We also compare the empirical power of the Kolmogorov-Smirnov test with the Rao’s score test for data on the sphere, supposing as alternative hypothesis, a mixture of two Fisher distributions with unknown parameters replaced by their maximum likelihood estimates or a 5-parameter Fisher-Bingham distribution. Finally, we give an example with real spherical data.
TL;DR: This work addresses the problem of gradient estimation with respect to four characterizing parameters of the Meixner distribution and Lévy process with the help of the explicit marginal probability density function and proposes a substantially simple exact simulation method, based on acceptance-rejection sampling and the Esscher density transform.
Abstract: We address the problem of gradient estimation with respect to four characterizing parameters of the Meixner distribution and Levy process. With the help of the explicit marginal probability density function, the likelihood ratio method is directly applicable, while unbiased estimators may contain infinite random series in their score function. We quantify the estimator bias arising when the infinite series is truncated to finite term. We further propose a substantially simple exact simulation method for the Meixner distribution, based on acceptance-rejection sampling and the Esscher density transform. Numerical results are presented in the context of financial Greeks to illustrate the effectiveness of our formulas along with bias estimates.
TL;DR: It is shown that for many classical testing examples, the Bayesian evidence derived by the new testing procedure is not contradictory to its frequentist counterpart any more and is usually coincident with the frequentist observed significance level.
Abstract: Testing point null hypotheses is a very common activity in various applied situations. However, the existing Bayesian testing procedure may give evidence which does not agree with the classical frequentist p-value in many point null testing situations. A typical example for this is the well known Lindley’s paradox (Lindley in Biometrika 44:187–192, 1957). In this paper we propose an alternative testing procedure in the Bayesian framework. It is shown that for many classical testing examples, the Bayesian evidence derived by our new testing procedure is not contradictory to its frequentist counterpart any more. In fact, the new Bayesian evidence under the noninformative prior is usually coincident with the frequentist observed significance level.
TL;DR: The theoretical points are illustrated by applying Laplace importance sampling in both variants to the classic salamander mating model, and the negative assessment by the coupling of the algorithm with “simple importance samplers”,Samplers that are not explicitly parameter dependent.
Abstract: There exists an overall negative assessment of the performance of the simulated maximum likelihood algorithm in the statistics literature, founded on both theoretical and empirical results At the same time, there also exist a number of highly successful applications This paper explains the negative assessment by the coupling of the algorithm with “simple importance samplers”, samplers that are not explicitly parameter dependent The successful applications in the literature are based on explicitly parameter dependent importance samplers Simple importance samplers may efficiently simulate the likelihood function value, but fail to efficiently simulate the score function, which is the key to efficient simulated maximum likelihood The theoretical points are illustrated by applying Laplace importance sampling in both variants to the classic salamander mating model
TL;DR: A new boosting method for a kind of noisy data is developed, where the probability of mislabeling depends on the label of a case, and the mechanism is based on a simple idea and gives natural interpretation as a mislabel model.
Abstract: A new boosting method for a kind of noisy data is developed, where the probability of mislabeling depends on the label of a case. The mechanism of the model is based on a simple idea and gives natural interpretation as a mislabel model. The boosting algorithm is derived from an extension of the exponential loss function, which provides the AdaBoost algorithm. A connection between the proposed method and an asymmetric mislabel model is shown. It is also shown that the loss function proposed constructs a classifier which attains the minimum error rate for a true label. Numerical experiments illustrate how well the proposed method performs in comparison to existing methods.
TL;DR: This work gives the exact first four moments of the null distribution for the statistic using a computer algebra system and gives a normalizing transformation of the statistic based on the Wilson–Hilferty transformation.
Abstract: We propose a new omnibus test statistic for normality based on the Jarque–Bera test statistic. We give the exact first four moments of the null distribution for the statistic using a computer algebra system. Our proposed statistic is an improvement of Jarque–Bera test statistic. Then the cumulants of the standardized statistic satisfy the Cornish–Fisher assumption. We give a normalizing transformation of the statistic based on the Wilson–Hilferty transformation.
TL;DR: It is shown that through the use of compactly supported radial basis functions it becomes possible to recover the band structured matrix feature of univariate spline smoothing and thereby obtain a fast computational algorithm.
Abstract: In this paper, we develop a fast algorithm for a smoothing spline estimator in multivariate regression. To accomplish this, we employ general concepts associated with roughness penalty methods in conjunction with the theory of radial basis functions and reproducing kernel Hilbert spaces. It is shown that through the use of compactly supported radial basis functions it becomes possible to recover the band structured matrix feature of univariate spline smoothing and thereby obtain a fast computational algorithm. Given n data points in R 2, the new algorithm has complexity O(n 2) compared to O(n 3), the order for the thin plate multivariate smoothing splines.
TL;DR: This work considers the task of Bayesian inference of the mean of normal observations when the available data have been discretized and when no prior knowledge about the mean and the variance exists and concludes that it is difficult to provide a widely accepted prior for the considered problem.
Abstract: We consider the task of Bayesian inference of the mean of normal observations when the available data have been discretized and when no prior knowledge about the mean and the variance exists. An application is presented which illustrates that the discretization of the data should not be ignored when their variability is of the order of the discretization step. We show that the standard (noninformative) prior for location-scale family distributions is no longer appropriate. We work out the reference prior of Berger and Bernardo, which leads to different and more reasonable results. However, for this prior the posterior also shows some non-desirable properties. We argue that this is due to the inherent difficulty of the considered problem, which also affects other methods of inference. We therefore complement our analysis by an empirical Bayes approach. While such proceeding overcomes the disadvantages of the standard and reference priors and appears to provide a reasonable inference, it may raise conceptual concerns. We conclude that it is difficult to provide a widely accepted prior for the considered problem.
TL;DR: This paper presents a flexible scan test statistic to detect disease clusters in data sets represented as a hierarchical tree that reaches an automatic compromise between bias and variance and the high accuracy of the MDL to identify clusters that are scattered on the tree.
Abstract: This paper presents a flexible scan test statistic to detect disease clusters in data sets represented as a hierarchical tree. The algorithm searches through the branches of the tree and it is able to aggregate leaves located in different branches. The test statistic combines two terms, the log-likelihood of the data and the amount of information necessary to computationally code each potential cluster. This second term penalizes the search algorithm avoiding the detection of oddly shaped clusters and it is based on the Minimum Description Length (MDL) principle. Our MDL method reaches an automatic compromise between bias and variance. We present simulated results showing that its power performance as compared to the usual scan statistic and the high accuracy of the MDL to identify clusters that are scattered on the tree. The MDL method is illustrated with a large database looking at the relationship between occupation and death from silicosis.
TL;DR: A hybrid bootstrap procedure for augmented Dickey-Fuller tests for the presence of a unit root that combines a time domain parametric autoregressive fit to the data and a nonparametric correction applied in the frequency domain to capture features that are possibly not represented by the parametric model.
Abstract: In this paper, we propose a hybrid bootstrap procedure for augmented Dickey-Fuller (ADF) tests for the presence of a unit root. This hybrid proposal combines a time domain parametric autoregressive fit to the data and a nonparametric correction applied in the frequency domain to capture features that are possibly not represented by the parametric model. It is known that considerable size and power problems can occur in small samples for unit root testing in the presence of an MA parameter using critical values of the asymptotic Dickey-Fuller distribution. The benefit of the sieve bootstrap in this situation has been investigated by Chang and Park (J Time Ser Anal 24:379–400, 2003). They showed asymptotic validity as well as substantial improvements for small sample sizes, but the actual sizes of their bootstrap tests were still quite far away from the nominal size. The finite sample performances of our procedure are extensively investigated through Monte Carlo simulations and compared to the sieve bootstrap approach. Regarding the size of the tests, our results show that the hybrid bootstrap remarkably outperforms the sieve bootstrap.
TL;DR: In this article, an explicit form of confidence intervals for the treatment effect in random effects meta-analysis model obtained from Harville-Jeske-Kenward-Roger approach is given.
Abstract: Abstract An explicit form of confidence intervals for the treatment effect in random effects meta-analysis model obtained from Harville–Jeske–Kenward–Roger approach is given. These restricted likelihood based intervals are compared to alternative procedures commonly used in collaborative studies when the number of participants is small and study-specific variances are heterogeneous. Monte Carlo simulation experiments show that the former intervals have quite conservative coverage probabilities and favor the latter intervals.