TL;DR: This paper proposes a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation and develops an automatic cluster centroid selection method through maximizing an average silhouette index.
Abstract: Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.
TL;DR: A Java-based computer application is presented called KD X to facilitate the visualization of data and the utilization of numerical tools used in frequency distribution statistics to data.
Abstract: The application of frequency distribution statistics to data provides objective means to assess the nature of the data distribution and viability of numerical models that are used to visualize and interpret data Two commonly used tools are the kernel density estimation and reduced chi-squared statistic used in combination with a weighted mean Due to the wide applicability of these tools, we present a Java-based computer application called KD X to facilitate the visualization of data and the utilization of these numerical tools
TL;DR: This work introduces a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and gives general theorems bounding the variance of such estimators.
Abstract: Given a set of points P⊄ R^d and a kernel k, the Kernel Density Estimate at a point x∊R^d is defined as \mathrm{KDE}_{P}(x)=\frac{1}{|P|}\sum_{y\in P} k(x,y). We study the problem of designing a data structure that given a data set P and a kernel function, returns approximations to the kernel density} of a query point in sublinear time}. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions.
TL;DR: In this paper, a kernel density estimation (KDE) method is proposed to estimate the probability density function (PDF) of wind speed, without making any assumption on the form of the underlying wind speed distribution, and capable of uncovering the statistical information hidden in the historical data.
Abstract: An accurate probability distribution model of wind speed is critical to the assessment of reliability contribution of wind energy to power systems. Most of current models are built using the parametric density estimation (PDE) methods, which usually assume that the wind speed are subordinate to a certain known distribution (e.g. Weibull distribution and Normal distribution) and estimate the parameters of models with the historical data. This paper presents a kernel density estimation (KDE) method which is a nonparametric way to estimate the probability density function (PDF) of wind speed. The method is a kind of data-driven approach without making any assumption on the form of the underlying wind speed distribution, and capable of uncovering the statistical information hidden in the historical data. The proposed method is compared with three parametric models using wind data from six sites. The results indicate that the KDE outperforms the PDE in terms of accuracy and flexibility in describing the long-term wind speed distributions for all sites. A sensitivity analysis with respect to kernel functions is presented and Gauss kernel function is proved to be the best one. Case studies on a standard IEEE reliability test system (IEEE-RTS) have verified the applicability and effectiveness of the proposed model in evaluating the reliability performance of wind farms.
TL;DR: In this paper, a new method is presented in order to statistically characterize arbitrary particle shapes using an optimal choice of shape functions identified on a set of 1000 digitized railway ballast particles obtained through 3D Scan.
Abstract: Based on proper orthogonal decomposition (POD), a new method is presented in order to statistically characterize arbitrary particle shapes using an optimal choice of shape functions identified on a set of 1000 digitized railway ballast particles obtained through 3D Scan. The coefficients of the POD expansion enable a description of ballast grains with varying levels of accuracy. On exploiting the knowledge of their statistical distribution we are able, implementing an appropriate multivariate kernel density estimation method, to generate irregular particles with similar morphological features. The description and generation methods are validated by comparing statistical distributions of basic characteristics: surface area, volume, average radius, elongation, flatness, and aspect ratio. Using suitable geometric descriptors defining local curvatures, we identify which surface points might be regarded as forming faces. This shows that the proposed particle generation method is well suited for irregularly shaped granular materials, as a first geometric definition step, before numerical simulations of their collective mechanical properties are carried out by a Discrete Element code dealing with polyhedral shapes. We illustrate this process with the simple case of the assembling of a granular pack from a loose configuration, by one-dimensional compression, using different levels of accuracy in the representation of grain shape.
TL;DR: This work provides Markov chain Monte Carlo algorithms for computing the bandwidth matrix for multivariate kernel density estimation by optimizing the likelihood cross-validation criterion, and shows that the resulting bandwidths are superior to all existing methods.
Abstract: Paper not available. Full text of working paper suppressed by author. We provide Markov chain Monte Carlo (MCMC) algorithms for computing the bandwidth matrix for multivariate kernel density estimation. Our approach is based on treating the elements of the bandwidth matrix as parameters to be estimated, which we do by optimizing the likelihood cross-validation criterion. Numerical results show that the resulting bandwidths are superior to all existing methods; for dimensions greater than two, our algorithm is the first practical method for estimating the optimal bandwidth matrix. Moreover, the MCMC algorithm for bandwidth selection for multivariate data has no increased difficulty as the dimension of data increases.
TL;DR: This paper presents the Locally Gaussian Density Estimator (LGDE), which introduces a similar idea to the problem of density estimation, and it is shown that the LGDE converges at a speed that does not depend on the dimension.
Abstract: It is well known that the Curse of Dimensionality causes the standard Kernel Density Estimator to break down quickly as the number of variables increases. In non-parametric regression, this effect is relieved in various ways, for example by assuming additivity or some other simplifying structure on the interaction between variables. This paper presents the Locally Gaussian Density Estimator (LGDE), which introduces a similar idea to the problem of density estimation. The LGDE is a new method for the non-parametric estimation of multivariate probability density functions. It is based on preliminary transformations of the marginal observation vectors towards standard normality, and a simplified local likelihood fit of the resulting distribution with standard normal marginals. The LGDE is introduced, and asymptotic theory is derived. In particular, it is shown that the LGDE converges at a speed that does not depend on the dimension. Examples using real and simulated data confirm that the new estimator performs very well on finite sample sizes.
TL;DR: In this paper, a new multivariate kernel density estimation (KDE) method was introduced to infer Hutchinsonian hypervolumes in the modelling of ecological niches, and the authors argued that their method matches or outperforms several methods for estimating hypervolume geometries and for conducting species distribution modelling.
Abstract: Blonder et al. (2014, Global Ecology and Biogeography, 23, 595–609) introduced a new multivariate kernel density estimation (KDE) method to infer Hutchinsonian hypervolumes in the modelling of ecological niches. The authors argued that their KDE method matches or outperforms several methods for estimating hypervolume geometries and for conducting species distribution modelling. Further clarification, however, is appropriate with respect to the assumptions and limitations of KDE as a method for species distribution modelling. Using virtual species and controlled environmental scenarios, we show that KDE both under- and overestimates niche volumes depending on the dimensionality of the dataset and the number of occurrence records considered. We suggest that KDE may be a viable approach when dealing with large sample sizes, limited sampling bias and only a few environmental dimensions.
TL;DR: In this article, a more general solution is presented where the above mentioned limitation is relaxed and the presented solution can be easily adopted also for the task of efficient computation of integrated density derivative functionals involving an arbitrary derivative order.
TL;DR: This work proposes multivariate non-parametric kernel density estimation that reverses the rounding process by using a Bayesian measurement error model, applied to the Berlin register of residents for deriving density estimates of ethnic minorities and aged people.
Abstract: Modern systems of official statistics require the timely estimation of area-specific densities of subpopulations. Ideally estimates should be based on precise geocoded information, which is not available because of confidentiality constraints. One approach for ensuring confidentiality is by rounding the geoco-ordinates. We propose multivariate non-parametric kernel density estimation that reverses the rounding process by using a measurement error model. The methodology is applied to the Berlin register of residents for deriving density estimates of ethnic minorities and aged people. Estimates are used for identifying areas with a need for new advisory centres for migrants and infrastructure for older people.
TL;DR: This paper introduces a simple technique for improving the performance of using a KDE to classify points by their density (density classification), and applies threshold-based pruning to spatial index traversal to achieve asymptotic speedups over naïve KDE, while maintaining accuracy guarantees.
Abstract: Density estimation forms a critical component of many analytics tasks including outlier detection, visualization, and statistical testing. These tasks often seek to classify data into high and low-density regions of a probability distribution. Kernel Density Estimation (KDE) is a powerful technique for computing these densities, offering excellent statistical accuracy but quadratic total runtime. In this paper, we introduce a simple technique for improving the performance of using a KDE to classify points by their density (density classification). Our technique, thresholded kernel density classification (tKDC), applies threshold-based pruning to spatial index traversal to achieve asymptotic speedups over naive KDE, while maintaining accuracy guarantees. Instead of exactly computing each point's exact density for use in classification, tKDC iteratively computes density bounds and short-circuits density computation as soon as bounds are either higher or lower than the target classification threshold. On a wide range of dataset sizes and dimensions, tKDC demonstrates empirical speedups of up to 1000x over alternatives.
TL;DR: In this article, the Fast Sum Updating approach is extended to the general multivariate case for general input data and rectilinear evaluation grid, including the triangular, cosine and Silverman kernels, and its combination with a fast approximate k-nearest-neighbors bandwidth for multivariate datasets.
Abstract: Kernel density estimation and kernel regression are powerful but computationally expensive techniques: a direct evaluation of kernel density estimates at $M$ evaluation points given $N$ input sample points requires a quadratic $\mathcal{O}(MN)$ operations, which is prohibitive for large scale problems. For this reason, approximate methods such as binning with Fast Fourier Transform or the Fast Gauss Transform have been proposed to speed up kernel density estimation. Among these fast methods, the Fast Sum Updating approach is an attractive alternative, as it is an exact method and its speed is independent of the input sample and the bandwidth. Unfortunately, this method, based on data sorting, has for the most part been limited to the univariate case. In this paper, we revisit the fast sum updating approach and extend it in several ways. Our main contribution is to extend it to the general multivariate case for general input data and rectilinear evaluation grid. Other contributions include its extension to a wider class of kernels, including the triangular, cosine and Silverman kernels, its combination with parsimonious additive multivariate kernels, and its combination with a fast approximate k-nearest-neighbors bandwidth for multivariate datasets. Our numerical tests of multivariate regression and density estimation confirm the speed, accuracy and stability of the method. We hope this paper will renew interest for the fast sum updating approach and help solve large-scale practical density estimation and regression problems.
TL;DR: This paper investigates the problem of optimizing complex multivariate performance measures to learn classifiers for pattern classification problems and proposes to construct an optimal kernel by weighted linear combination of some candidate kernels.
Abstract: In this paper, we investigate the problem of optimizing complex multivariate performance measures to learn classifiers for pattern classification problems. For the first time, the multi-kernel learning is considered to construct a classifier to optimize a given nonlinear and non-smooth multivariate classifier performance measure. We estimate and optimize the upper bound of the given multivariate performance measure, instead of optimizing it directly. Moreover, to solve the problem of kernel function selection and kernel parameter tuning,
we proposed to construct an optimal kernel by weighted linear combination of some candidate kernels. The learning of the classifier parameter and the kernel weight are unified in a single objective function considering minimizing the upper bound of the given multivariate performance measure. The objective function is optimized with regard to classifier parameter and kernel weight alternately in an iterative algorithm. The developed algorithm is evaluated on two different pattern classification methods with regard to various multivariate performance measure optimization problems. The experiment results show the proposed algorithm outperforms the competing methods.
TL;DR: The asymptotic normality of considered wavelet-based estimators, under easily verifiable conditions, is characterized, by means of the martingale approach.
Abstract: In the present paper, we are mainly concerned with the non parametric estimation of the density as well as the regression function by using orthonormal wavelet bases. We provide the strong uniform consistency properties with rates of these estimators, over compact subsets of , under a general ergodic condition on the underlying processes. We characterize the asymptotic normality of considered wavelet-based estimators, under easily verifiable conditions. The asymptotic properties of these estimators are obtained, by means of the martingale approach.
TL;DR: This article shows that decomposing the graphical model into a hierarchical structure reduces estimating a multivariate density function to the estimation of low-dimensional/conditional probabilities.
Abstract: Probabilistic graphical models, such as Bayesian networks, have recently gained attention in process monitoring and fault diagnosis. Their application, however, is limited to discrete or continuous Gaussian distributed variables, which results from the difficulty in efficiently estimating multivariate density functions. This article shows that decomposing the graphical model into a hierarchical structure reduces estimating a multivariate density function to the estimation of low-dimensional/conditional probabilities. These conditional density functions can be effectively estimated from data using a nonparametric kernel method and the low-dimensional densities can be estimated using a kernel density estimation (KDE). On the basis of the estimated densities, anomalous process behavior can be detected and diagnosed by examining which probability is lower than its corresponding confidence limit. Applications to simulated examples and an industrial blast furnace iron-making process show that the proposed metho...
TL;DR: Simulation tests show how ELMs can be successfully employed in the density estimation framework, as a possible alternative to other standard methods.
Abstract: In this paper, we discuss how the extreme learning machine (ELM) framework can be effectively employed in the unsupervised context of multivariate density estimation. In particular, two algorithms are introduced, one for the estimation of the cumulative distribution function underlying the observed data, and one for the estimation of the probability density function. The algorithms rely on the concept of ${F}$ -discrepancy, which is closely related to the Kolmogorov–Smirnov criterion for goodness of fit. Both methods retain the key feature of the ELM of providing the solution through random assignment of the hidden feature map and a very light computational burden. A theoretical analysis is provided, discussing convergence under proper hypotheses on the chosen activation functions. Simulation tests show how ELMs can be successfully employed in the density estimation framework, as a possible alternative to other standard methods.
TL;DR: In this paper, a varying asymmetric kernel estimation of the density f for nonnegative data is proposed, regardless of f (0 ) = 0 or f ( 0 ) > 0.
Abstract: This paper considers a varying asymmetric kernel estimation of the density f for nonnegative data. Regardless of f ( 0 ) = 0 or f ( 0 ) > 0 , it is important to give a good varying shape/scale parameter for the inverse gamma (IGam) kernel, due to the problem of f ( 0 ) = 0 in some existing literature. After reformulating the IGam kernel density estimator, asymptotic properties like mean integrated squared error, mean integrated absolute error, strong consistency, and asymptotic normality are investigated in detail, under some conditions on the target density f . Simulation studies are conducted to compare the proposed IGam kernel density estimators with the existing gamma kernel density estimators.
TL;DR: In this article, a new structural model for in-sample density forecasting is proposed, where the density is a product of one-dimensional functions with one function sitting on the scale of a transformed space of observations.
Abstract: In this paper we consider a new structural model for in-sample density forecasting. In-sample density forecasting is to estimate a structured density on a region where data are observed and then re-use the estimated structured density on some region where data are not observed. Our structural assumption is that the density is a product of one-dimensional functions with one function sitting on the scale of a transformed space of observations. The transformation involves another unknown one-dimensional function, so that our model is formulated via a known smooth function of three underlying unknown one-dimensional functions. We present an innovative way of estimating the one-dimensional functions and show that all the estimators of the three components achieve the optimal one-dimensional rate of convergence. We illustrate how one can use our approach by analyzing a real dataset, and also verify the tractable finite sample performance of the method via a simulation study.
TL;DR: A kernelized non-parametric classifier based on feature ranking in anisotropic Gaussian kernel (KNR-AGK), which focuses on the selection of different bandwidths in kernel density estimation and has better performance than Gaussian Kernel density estimation based classifier.
TL;DR: A new class of p th-order kernels corresponding to new moments on the circle is introduced and two methods for constructing higher-order kernel density estimators are proposed and derived.
TL;DR: This work derives a variable kernel bandwidth estimator by minimizing the leave-one-out entropy objective function and shows that this estimator is capable of performing estimation in high-dimensional feature spaces with great success.
Abstract: Estimating the joint probability density function of a dataset is a central task in many machine learning applications. In this work we address the fundamental problem of kernel bandwidth estimation for variable kernel density estimation in high-dimensional feature spaces. We derive a variable kernel bandwidth estimator by minimizing the leave-one-out entropy objective function and show that this estimator is capable of performing estimation in high-dimensional feature spaces with great success. We compare the performance of this estimator to state-of-the art maximum-likelihood estimators on a number of representative high-dimensional machine learning tasks and show that the newly introduced minimum leave-one-out entropy estimator performs optimally on a number of high-dimensional datasets considered.
TL;DR: A class of non-parametric density estimators under Bayesian settings obtained by adaptively partitioning the sample space can adapt to the unknown smoothness of the true density function, thus achieving the optimal convergence rate without artificial conditions on the density.
Abstract: We study a class of non-parametric density estimators under Bayesian settings. The estimators are obtained by adaptively partitioning the sample space. Under a suitable prior, we analyze the concentration rate of the posterior distribution, and demonstrate that the rate does not directly depend on the dimension of the problem in several special cases. Another advantage of this class of Bayesian density estimators is that it can adapt to the unknown smoothness of the true density function, thus achieving the optimal convergence rate without artificial conditions on the density. We also validate the theoretical results on a variety of simulated data sets.
TL;DR: In this paper, the background material related to the nonparametric density estimation is described, and a short overview of the fundamental concepts related to histograms is presented, followed by a description of a smart extension of certain well-known histograms aimed at avoiding some of their drawbacks.
Abstract: This chapter describes the background material related to the nonparametric density estimation. Techniques such as histograms (together with its extension, known as ASH, see Sect. 2.3), Parzen windows and k-nearest neighbors are at the core of the applications of nonparametric density estimation. For that reason, we decided to include a chapter describing these for the sake of completeness and to allow less experienced readers develop their intuitions in terms of the nonparametric estimation. Most of the material is presented taking into account only the univariate case; extending the results to cover more than one variable, however, is often a straightforward task. The chapter is organized as follows: Sect. 2.2 presents a short overview of the fundamental concepts related to histograms. Section2.3 is devoted to a description of a smart extension of certain well-known histograms aimed at avoiding some of their drawbacks. Section2.4 presents basic concepts related to the nonparametric density estimation. Section2.5 is devoted to the Parzen windows, while Sect. 2.6 to the k-nearest neighbors approach.
TL;DR: In this article, an extension of the strong stability analysis in risk models using nonparametric kernel density estimation for the claim amounts is presented. Butt et al. proposed different kernel estimators for the density of claim amounts in the real model, and a simulation study is performed to numerically compare between the approximation errors obtained using the different proposed kernel densities.
Abstract: This paper presents an extension of the strong stability analysis in risk models using nonparametric kernel density estimation for the claim amounts. First, we detail the application of the strong stability method in risk models realized by V. Kalashnikov in 2000. In particular, we investigate the conditions and the approximation error of the real model, in which the probability distribution of the claim amounts is not known, by the classical risk model with exponentially distributed claim sizes. Using the nonparametric approach, we propose different kernel estimators for the density of claim amounts in the real model. A simulation study is performed to numerically compare between the approximation errors (stability bounds) obtained using the different proposed kernel densities.
TL;DR: A multivariate distribution fitting methodology is introduced, which, combined with multivariate random data sampling provides a global yield estimation approach and the estimation variance of the proposed method is two times smaller.
Abstract: The standard multivariate metrics for semiconductor product yield estimation and prediction in production processes usually assume that the parameters contributing to the yield are all normally distributed. However, the data met in production processes is not always multivariate normal. A variety of methods has been developed for multivariate non-normal data, but these usually rely on no statistical information, address only a specific type of multivariate distributions, or become very time consuming from the point of view of the computational cost. Moreover, the sample size of the multivariate data is often insufficient, as only a limited number of measurements are affordable. This results in inaccurate product yield estimation and high variance of the estimates. In this paper, a multivariate distribution fitting methodology is introduced, which, combined with multivariate random data sampling provides a global yield estimation approach. Compared with the simple failure counts method the estimation variance of the proposed method is two times smaller.
TL;DR: In this paper, the authors considered the GARCH-type model S = σ2Z where σ 2 and Z are independent random variables, and they constructed adaptive and non-adaptive wavelet estimators for the derivative of the density and obtained sharp upper bounds on their mean integrated squared errors.
Abstract: We consider the GARCH-type model S = σ2Z where σ2 and Z are independent random variables. We assume that the density of σ2 is unknown with support [0, 1] but differentiable whereas the density fS of S is bounded. We will also assume that the probability density function of the random variable Z is known and has the same distribution as the ν-fold product of independent random variables uniformly distributed on the interval [0, 1]. We want to estimate the derivative of the density of σ2 from n independent and identically distributed observations of S. We will construct adaptive and non adaptive wavelet estimators for the derivative of the density and obtain sharp upper bounds on their mean integrated squared errors.
TL;DR: In this paper, the results of a comparison of the most common optimization methods for the nonparametric estimation of the probability density of Rosenblatt-parzen kernel functions are presented.
Abstract: The results of a comparison of the most common optimization methods for the nonparametric estimation of the probability density of Rosenblatt–Parzen are presented. To select the optimal values of the blur coefficients of kernel functions, minimum conditions for the standard deviation of the nonparametric estimate of the probability density and the maximum of the likelihood function are used.
TL;DR: By considering histogram data as a numerical sequence, a simple approach for PDF estimation is presented, and it is shown that the proposed approach is as accurate as kernel-based estimators, widely adopted in the statistical literature.
Abstract: In this paper, we discuss nonparametric estimation of the probability density function (PDF) of a univariate random variable. This problem has been the subject of a vast amount of scientific literature in many domains, while statisticians are mainly interested in the analysis of the properties of proposed estimators, and engineers treat the histogram as a ready-to-use tool for a data set analysis. By considering histogram data as a numerical sequence, a simple approach for PDF estimation is presented in this paper. It is based on basic notions related to the reconstruction of a continuous-time signal from a sequence of samples. When estimating continuous PDFs, it is shown that the proposed approach is as accurate as kernel-based estimators, widely adopted in the statistical literature. Conversely, it can provide better accuracy when the PDF to be estimated exhibits a discontinuous behavior. The main statistical properties of the proposed estimators are derived and then verified by simulations related to the common cases of normal and uniform density functions. The obtained results are also used to derive optimal, i.e., minimum integral of the mean square error, estimators.
TL;DR: A generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters is derived and an adaptive data-driven strategy for choosing the kernel bandwidth is analyzed.
Abstract: We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as Holder continuity, but only require intuitive geometric assumptions of non-parametric nature.