TL;DR: This work sets up a more relevant framework in which a true comparison can be positive or negative, and, based on the data, you can state “θ1 > θ2 with confidence,” or “no claim with confidence.”
Abstract: In classical statistics, the significance of comparisons (e.g., θ1− θ2) is calibrated using the Type 1 error rate, relying on the assumption that the true difference is zero, which makes no sense in many applications. We set up a more relevant framework in which a true comparison can be positive or negative, and, based on the data, you can state “θ1 > θ2 with confidence,” “θ2 > θ1 with confidence,” or “no claim with confidence.” We focus on the Type S (for sign) error, which occurs when you claim “θ1 > θ2 with confidence” when θ2> θ1 (or vice-versa). We compute the Type S error rates for classical and Bayesian confidence statements and find that classical Type S error rates can be extremely high (up to 50%). Bayesian confidence statements are conservative, in the sense that claims based on 95% posterior intervals have Type S error rates between 0 and 2.5%. For multiple comparison situations, the conclusions are similar.
TL;DR: It is found that the Bayesian approach with a particular choice of diffuse inverse Wishart prior distribution for the (co)variance parameters performs at least as well—in terms of bias of estimates and actual coverage of nominal 95% intervals—as maximum likelihood methods in RSR models with medium sample sizes.
Abstract: We use simulation studies (a) to compare Bayesian and likelihood fitting methods, in terms of validity of conclusions, in two-level random-slopes regression (RSR) models, and (b) to compare several Bayesian estimation methods based on Markov chain Monte Carlo, in terms of computational efficiency, in random-effects logistic regression (RELR) models. We find (a) that the Bayesian approach with a particular choice of diffuse inverse Wishart prior distribution for the (co) variance parameters performs at least as well—in terms of bias of estimates and actual coverage of nominal 95% intervals—as maximum likelihood methods in RSR models with medium sample sizes (expressed in terms of the number J 7 of level-2 units), but neither approach performs as well as might be hoped with small J; and (b) that an adaptive hybrid Metropolis-Gibbs sampling method we have developed for use in the multilevel modeling package M1wiN outperforms adaptive rejection Gibbs sampling in the RELR models we have considered, sometimes by a wide margin.
TL;DR: This paper proposes some extensions of the PCA with the aim of representing, in a space of reduced dimensions, images of hypercubes, pointing out differences and similarities according to their structural features.
Abstract: The present paper deals with the study of continuous interval data by means of suitable Principal Component Analyses (PCA). Statistical units described by interval data can be assumed as special cases of Symbolic Objects (SO) (Diday, 1987). In Symbolic Data Analysis (SDA), these data are represented as hypercubes. In the present paper, we propose some extensions of the PCA with the aim of representing, in a space of reduced dimensions, images of such hypercubes, pointing out differences and similarities according to their structural features.
TL;DR: In this paper, the authors restrict attention to the univariate smoothing setting with Gaussian noise and the truncated polynomial regression spline basis, and compare them through a simulation study.
Abstract: Regression spline smoothing involves modelling a regression function as a piecewise polynomial with a high number of pieces relative to the sample size. Because the number of possible models is so large, efficient strategies for choosing among them are required. In this paper we review approaches to this problem and compare them through a simulation study. For simplicity and conciseness we restrict attention to the univariate smoothing setting with Gaussian noise and the truncated polynomial regression spline basis.
TL;DR: In this article, it was shown that the class of limiting distributions of such random sums, as the number of terms converges to infinity, consists of multivariate asymmetric distributions that are natural generalizations of univariate Laplace laws.
Abstract: Consider a sum of independent and identically distributed random vectors with finite second moments, where the number of terms has a geometric distribution independent of the summands. We show that the class of limiting distributions of such random sums, as the number of terms converges to infinity, consists of multivariate asymmetric distributions that are natural generalizations of univariate Laplace laws. We call these limits multivariate asymmetric Laplace laws. We give an explicit form of their multidimensional densities and show representations that effectively facilitate computer simulation of variates from this class. We also discuss the relation to other formerly considered classes of distributions containing Laplace laws.
TL;DR: This paper gives a description of posterior predictive checking (introduced by Rubin, 1984) for detecting departures between the data and the posited model and illustrates how the posterior predictive check can be used in practice.
Abstract: In this paper, we give a description of posterior predictive checking (introduced by Rubin, 1984) for detecting departures between the data and the posited model and illustrate how the posterior predictive check can be used in practice. We further discuss interpretability, frequency properties, and prior sensitivity of the posterior predictive p-value.
TL;DR: A recently developed method to transform the original input vectors into high-dimensional space, and then construct a linear regression function or hyperplane in that space by applying the kernel technique is reviewed.
Abstract: The paper first reviews a recently developed method called the Support Vector Machine. The main feature of the method is to transform the original input vectors into high-dimensional space, and then construct a linear regression function or hyperplane in that space. The transformation is usually done by applying the kernel technique. The paper then shows that the same kernel technique can be applied to classical algorithms such as Ridge Regression. In conclusion, we present a new transductive learning algorithm that also allows us to compute confidence levels.
TL;DR: In this article, the authors developed the imputation method which takes advantage both of a multivariate regression model and a nearest neighbor hot decking method, which is successfully applied to such complex cases where the variable being imputed is of a ratio-scale type and consists of a high number of unknown zero values.
Abstract: The paper develops the imputation method which takes advantage both of a multivariate regression model and a nearest neighbour hot decking method. This method is successfully applied to such complex cases where the variable being imputed is of a ratio-scale type and consists of a high number of unknown zero values. The results obtained by means of the method are compared with the two other techniques, (i) random hot decking and (ii) two-step model based method. The latter one first takes advantage of logistic regression and then of standard regression imputation. Our results do not give the only one conclusion. On average, regression based nearest neighbour hot decking is the best, but the two-step model based method also has some advantages. The paper cannot deal with other important questions, but we want to emphasise the importance of variance estimation: it leads to an additional variance component called imputation variance. The paper also discusses a diagnostic test for the quality of imputations; this test checks how many times the same donor is used in imputing missing values.
TL;DR: This work is proposing an approximation algorithm that has a very solid theoretical background and is surprisingly accurate for extremely large set of arguments/parameter values and is also applied for a reliable approximation of the quantiles of the distribution for large values of noncentrality and degrees of freedom.
Abstract: The cumulative distribution function (cdf) of the noncentral χ2 distribution with positive degrees of freedom ν > 0 and a noncentrality parameter δ2 ≥ 0 is usually expressed as an infinite weighted sum of central χ2 cdf’s. For the purpose of numerical evaluation this infinite sum is being approximated by a finite sum. For large values of the noncentrality parameter, the sum converges slowly. Alternative approximation algorithms have been proposed instead in the literature. A comparison of these is given in Johnson & Kotz (1970). Most of the approximation algorithms have advantages for certain values of the arguments/parameters and perform poorly for other values. We are proposing an approximation algorithm that has a very solid theoretical background and is surprisingly accurate for extremely large set of arguments/parameter values. It is also applied for a reliable approximation of the quantiles of the distribution for large values of noncentrality and degrees of freedom. Although being asymptotic in spirit (with respect to degrees of freedom ν), the algorithm gives quite accurate approximation even down to ν = 1.
TL;DR: In this article, a nonlinear Gauss-Seidel type algorithm is proposed for computing the maximum posterior estimates of the random effects in a generalized linear mixed model, and the algorithm converges in virtually all typical situations of GMLM.
Abstract: A nonlinear Gauss-Seidel type algorithm is proposed for computing the maximum posterior estimates of the random effects in a generalized linear mixed model. We show that the algorithm converges in virtually all typical situations of generalized linear mixed models. A numerical example shows the superiority of the proposed algorithm over the standard Newton-Raphson procedure when the number of random effects is large.
TL;DR: The aim of this paper is the adaptation of the classical Bayesian discrimination rule to the Symbolic Objects problematic by the a priori probabilities’ estimation and by a kernel density estimation.
Abstract: Current technological progress in Hardware, Data Bases and Object Oriented languages implies the manipulation, stock and representation of objects with more and more complex data. The notion of Symbolic Objects is introduced on the base of Diday’s work and the necessity to be adapted to this notion appears for most recent classification methods. The aim of this paper is the adaptation of the classical Bayesian discrimination rule to the Symbolic Objects problematic. This will be performed by the a priori probabilities’ estimation and by a kernel density estimation.
TL;DR: In this article, the augmented Gibbs sampler (a special case of MCMC), illustrated using the random intercept model, is used to give an illustration of the power of Markov Chain Monte Carlo (MCMC).
Abstract: Many papers (including most of the papers in this issue of Computational Statistics) deal with Markov Chain Monte Carlo (MCMC) methods. This paper will give an introduction to the augmented Gibbs sampler (a special case of MCMC), illustrated using the random intercept model. A’ nonstandard’ application of the augmented Gibbs sampler will be discussed to give an illustration of the power of MCMC methods. Furthermore, it will be illustrated that the posterior sample resulting from an application of MCMC can be used for more than determination of convergence and the computation of simple estimators like the a posteriori expectation and standard deviation. Posterior samples give access to many other inferential possibilities. Using a simulation study, the frequency properties of some of these possibilities will be evaluated.
TL;DR: In this paper, the authors proposed the use of B-splines instead of truncated power basis functions for flexible modeling of high-dimensional data, which allows to generate models competitive with those of the original MARS.
Abstract: We investigate one of the possible ways for improving Friedman’s Multivariate Adaptive Regression Splines (MARS) algorithm designed for flexible modelling of high-dimensional data. In our version of MARS called BMARS we use B-splines instead of truncated power basis functions. The fact that B-splines have compact support allows us to introduce the notion of a “scale” of a basis function. The algorithm starts building up models by using large-scale basis functions and switches over to a smaller scale after the fitting ability of the large scale splines has been exhausted. The process is repeated until the prespecified number of basis functions has been produced. In addition, we discuss a parallelisation of BMARS as well as an application of the algorithm to processing of a large commercial data set. The results demonstrate the computational efficiency of our algorithm and its ability to generate models competitive with those of the original MARS.
TL;DR: A method to obtain simple descriptions of strata with a common rule is presented in a generalised recursive tree-building algorithm for populations partitioned into strata, extended to individuals described by probabilistic symbolic objects.
Abstract: Based in a generalised recursive tree-building algorithm for populations partitioned into strata a method to obtain simple descriptions of strata is presented. Also strata with a common rule are obtained. Common predictors and criterion variable describe population in all strata or classes of individuals. Algorithm considers strata structure in tree-building algorithm and combines in each step maximisation of an information content measure for the criterion variable in a new binary partition of the population and selection of decisional nodes, based in quality of prediction for subsets of strata. Each decisional tree node is composed of a set of strata and a rule for individuals in these strata that will jointly explain the criterion variable. Symbolic data analysis fits the method. Input of the algorithm is composed of classes of individuals. Algorithm is extended to individuals described by probabilistic symbolic objects. As output, symbolic objects describe tree, decisional nodes and strata.
TL;DR: In this paper, a Monte-Carlo method for finding the solution of an estimating equation that can be expressed as the expected value of a "full data" estimating equation in which the expected values is with respect to the distribution of the missing data given the observed data was presented.
Abstract: We present a new Monte-Carlo method for finding the solution of an estimating equation that can be expressed as the expected value of a ‘full data’ estimating equation in which the expected value is with respect to the distribution of the missing data given the observed data. Equations such as these arise whenever the E-M algorithm can be used. The algorithm alternates between two steps: an S-step, in which the missing data are simulated, either from the conditional distribution described above or from a more convenient importance sampling distribution, and a U-step, in which parameters are updated using a closed-form expression that does not require a numerical maximization. We present two numerical examples to illustrate the method. Theoretical results are obtained establishing consistency and asymptotic normality of the approximate solution obtained by our method.
TL;DR: A futurist vision for data analysts is painted and some of the tools and techniques likely to be used, with respect to data mining, visualization and quantization methods, are discussed.
Abstract: This paper attempts to paint a futurist vision for data analysts and in doing so discusses some of the tools and techniques likely to be used. A major premise of this vision is that mathematical statistics like classical mechanics is essentially a completed discipline. Moreover, that changes in the nature, modes of collection, and scale of data imply new tools and techniques are inevitable. Complexity of algorithms and data structures imply an increased focus on algorithmic efficiency and, to some extent, more automated procedures. Suggestions for advancement in theory are made with respect to data mining, visualization and quantization methods. Suggestions are also made on likely architectures for digital text and data libraries, for modes of accessing distributed databases, and for the implications on collaboration.
TL;DR: This paper extends deterministic models for Boolean regression within a Bayesian framework to include a proper account of the uncertainty in the model estimates and various possibilities for model checking (using posterior predictive checks).
Abstract: This paper extends deterministic models for Boolean regression within a Bayesian framework. For a given binary criterion variable Y and a set of k binary predictor variables X1,…, Xk, a Boolean regression model is a conjunctive (or disjunctive) logical combination consisting of a subset S of the X variables, which predicts Y. Formally, Boolean regression models include a specification of a k-dimensional binary indicator vector (θ1,…,θk) with θj = 1 iff Xj ∈ S. In a probabilistic extension, a parameter π is added which represents the probability of the predicted value
${\hat y_i}$
and the observed value yi differing (for any observation i). Within a Bayesian framework, a posterior distribution of the parameters (θ1,…, θk, π) is looked for. The advantages of such a Bayesian approach include a proper account of the uncertainty in the model estimates and various possibilities for model checking (using posterior predictive checks). We illustrate this method with an example using real data.
TL;DR: The grand tour method implemented in a dynamic graphics environment and endowed with dynamically changing concentration ellipses and count plots is recalled and a sort of classification of the found outliers is carried out by performing cluster analysis based on angular similarities of the suspected outliers.
Abstract: The paper presents our experience with identifying and verifying outlying data points. Firstly we recall the grand tour method implemented in a dynamic graphics environment and endowed with dynamically changing concentration ellipses and count plots — as proposed by Bartkowiak & Szustalewicz (1997). The method permits to select and identify some data points as suspected outliers. Next we propose to carry out a sort of classification of the found outliers by performing cluster analysis based on angular similarities of the suspected outliers. The procedure returns bundles of data vectors similar with respect to their outlyingness. The considerations are illustrated with the Milk container data, analyzed formerly, a.o. by Atkinson (1994) and Muruzabal and Munoz (1997).
TL;DR: A characterization for exchangeability of a stable random vector, in terms of its spectral measure, is given and forTRAN subroutines to simulate a desirable exchangeable stable random vectors and to create an exchangeable partition are written.
Abstract: This work concerns the simulation of an exchangeable stable random vector. A characterization for exchangeability of a stable random vector, in terms of its spectral measure, is given. The Modarres and Nolan’s simulating method on stable random vectors is modified to the exchangeable case. FORTRAN subroutines to simulate a desirable exchangeable stable random vector and to create an exchangeable partition are written.
TL;DR: In this paper, the weighted average density derivative (WAD) estimator is used to estimate regression parameters up to scale under the assumption of a single-index model and the small sample performance of ratio estimators is studied.
Abstract: Under the assumption of a single-index model the weighted average density derivative (WAD) estimator, estimates regression parameters up to scale. The small sample performance of ratio estimators are studied. For spherical errors in a latent variable specification the WAD estimator, in terms of bias and mean square error (MSE), demonstrates performance similar to the logit maximum likelihood estimator. Under heteroskedastic errors the WAD estimator performs better. In an empirical application concerning choices of freight transports we find that the WAD estimator evidences improved performance in one of the two sectors studied compared with standard parametric models.
TL;DR: Some of the latest results of research in the area of metadata are summarized, including modelling of metainformation using templates and object oriented models within the context of a metadata database is initially compared to the traditional way of using simple verbal footnotes.
Abstract: This paper aims in summarising some of the latest results of research in the area of metadata. The modelling of metainformation using templates and object oriented models within the context of a metadata database is initially compared to the traditional way of using simple verbal footnotes. The possibility of further automating the procedures inside National Statistical Offices using metadata guided statistical processing is discussed and certain related aspects of man-machine interfaces are considered. Finally, selected topics concerning the quality of metadata and their integration inside large statistical information systems are examined.
TL;DR: Theoretical properties of wavelet based random densities subject to some standard constraints: smoothness, symmetry, unimodality, and skewness are described.
Abstract: In this paper we describe the theoretical properties of wavelet based random densities and present algorithms for their generation. We exhibit random densities subject to some standard constraints: smoothness, symmetry, unimodality, and skewness. We also provide three relevant applications of wavelet based-random densities.
TL;DR: A transformation model is sketched which allows the use of metadata in production activities and some conceptual issues for the architecture of metadata driven statistical processing systems are discussed.
Abstract: Based on the analysis of the statistical production process inside and outside national statistical offices requirements for metadata structures accompanying statistical data are outlined. Furthermore a transformation model is sketched which allows the use of metadata in production activities and some conceptual issues for the architecture of metadata driven statistical processing systems are discussed.
TL;DR: In this paper, the authors compare the performance of series-type estimators with the results obtained by two of the most popular nonparametric regression estimation methods: kernel estimation and least-squares cubic splines.
Abstract: This paper is concerned with the practical performances of series-type estimators of a regression function. For different choices of orthonormal bases (Legendre polynomials, trigonometric functions, wavelets) we compare, by simulation arguments, the performances of series-type estimators with the results obtained by two of the most popular nonparametric regression estimation methods: kernel estimation and least-squares cubic splines. It will be shown that orthonormal series estimators are competitive in relation to these former nonparametric procedures. No agreement has emerged on the best method, the results being highly dependent on the nature of the estimated regression function.
TL;DR: In this paper, the authors argue that the change from paper questionnaires and paper publications to e-questionnaires and electronic publications makes a Business Process Redesign of National Statistical Institutes (NSI) inevitable.
Abstract: In a world facing so many and such rapid technological developments, the statistical processes of National Statistical Institutes (NSI’s) can not remain unchanged. This paper argues that the change from paper questionnaires and paper publications to electronic questionnaires and electronic publications makes a Business Process Redesign of NSI’s inevitable. The traditional stove-pipe approach, where each individual survey had its own questionnaire and publications, must be replaced by a new approach focused on external sources (for data collection) and external customers (for dissemination). As a result, the internal processes will be integrated, and the focus will be on corporate databases instead of separate departmental databases. Some aspects of the new approach will be illustrated by examples of developments at Statistics Netherlands (SN).
TL;DR: In this paper, a number of commonly used estimates of the inverse autocorrelation function can be modified to deal with outlier contaminated data, and the robust analogues of the orthogonal and interpolation based techniques appear to be new, and provide an alternative to the robust autoregressive approach.
Abstract: We show how a number of commonly used estimates of the inverse autocorrelation function can be modified to deal with outlier contaminated data. The robust analogues of the orthogonal and interpolation based techniques appear to be new, and provide an alternative to the robust autoregressive approach. We examine the performance of these techniques in a large scale numerical experiment. This shows significant improvements in performance in outlier contaminated data when robust techniques are used. While there was no uniformly best robust technique, our experiments support the use of the autoregressive approach to avoid catastrophic reductions in performance, and robust interpolation for short series corrupted by few outliers.
TL;DR: In this article, the mean and mean squared error of the smoothed bootstrap mean integrated squared error estimator in Gaussian kernel estimation of normal mixture densities is obtained. But the results are not asymptotic.
Abstract: New expressions are obtained for the mean and mean squared error of the smoothed bootstrap mean integrated squared error estimator in Gaussian kernel estimation of normal mixture densities. The use of such densities is in the same spirit as Marron and Wand (1992) and provides the same benefits. The resulting expressions are easily computable and describe the exact behavior of the estimator, thus complementing known asymptotic results for it. They reveal important information for small samples, not indicated by asymptotics. In particular, while asymptotics call for oversmoothing the estimator, undersmoothing may actually be more appropriate for small samples.