TL;DR: In this paper, the authors proposed a method called the "gap statistic" for estimating the number of clusters (groups) in a set of data, which uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution.
Abstract: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.
TL;DR: In this paper, a simple test of Granger (1969) non-causality for hetero- geneous panel data models is proposed, based on the individual Wald statistics of Granger non causality averaged across the cross-section units.
TL;DR: In this paper, the authors present a test of independence that can be applied to the estimated residuals of any time series model, which can be transformed into a model driven by independent and identically distributed errors.
Abstract: This paper presents a test of independence that can be applied to the estimated residuals of any time series model that can be transformed into a model driven by independent and identically distributed errors. The first order asymptotic distribution of the test statistic is independent of estimation error provided that the parameters of the model under test can be estimated -consistently. Because of this, our method can be used as a model selection tool and as a specification test. Widely used software1 written by Dechert and LeBaron can be used to implement the test. Also, this software is fast enough that the null distribution of our test statistic can be estimated with bootstrap methods. Our method can be viewed as a nonlinear analog of the Box-Pierce Q statistic used in ARIMA analysis.
TL;DR: In this paper, the authors present an analysis of correlation and correlation coefficients for the Mann-Whitney Test, the Newman-Keuls' and Tukey Mulitple-Comparison Tests, and the Signed-Pairs, Signed-Ranks Test.
Abstract: 1. Organizing Data and Some Simple Computations. 2. Confidence Intervals. 3. Correlation and Related Topics. 4. Analysis of Variance. 5. Supplemental Computations for Analysis of Variance. 6. Multivariate Analyses. 7. Nonparametric Tests, Miscellaneous Tests of Significance, and Indexes of Relationships. Appendices. Normal-Curve Areas. Critical Values of "Student's" t Statistic. Critical Values for Sandler's A Statistic. Values of the Chi-Square Statistic. Probabilities of the F Distribution. Fisher's z Transformation for Pearson's r Correlation Coefficient. Critical Values of Pearson's r Correlation Coefficient for Five Alpha Significance Levels. Critical Values of the U Statistic of the Mann-Whitney Test. Critical Values for Hartley's Maximum F Ratio Significance Test for Homogeneity of Variances. Significant Studentized Ranges for Duncan's New Multiple-Range Test. Significant Studentized Ranges for the Newman-Keuls' and Tukey Mulitple-Comparison Tests. Dunnett's Test: Comparison of Treatment Means with a Control. Critical Values of Wilcoxon's t Statistic for the Matched- Pairs, Signed-Ranks Test. Coefficients for Orthogonal Polynomials. Cumulative Probability Distribution for r', the Total Number of Runs Up or Down. Sample Size and Power.
TL;DR: This work proposes two statistical tests to determine if two samples are from different distributions, and applies this approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where the test performs strongly.
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.