TL;DR: A new R package ks for multivariate kernel smoothing is introduced, containing functionality for kernel density estimation and kernel discriminant analysis and implementing a wide range of data-driven diagonal and unconstrained bandwidth selectors.
Abstract: Kernel smoothing is one of the most widely used non-parametric data smoothing techniques. We introduce a new R package ks for multivariate kernel smoothing. Currently it contains functionality for kernel density estimation and kernel discriminant analysis. It is a comprehensive package for bandwidth matrix selection, implementing a wide range of data-driven diagonal and unconstrained bandwidth selectors.
TL;DR: A novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed, modifying a nonparametric density estimate with a variable kernel to yield a robust local density estimation.
Abstract: Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).
TL;DR: This method is shown to outperform a current state-of-the-art incremental one-class learning algorithm (Incremental SVDD) on a variety of datasets, while requiring only an upper limit on model complexity to be specified.
Abstract: An incremental one-class learning algorithm is proposed for the purpose of outlier detection. Outliers are identified by estimating - and thresholding - the probability distribution of the training data. In the early stages of training a non-parametric estimate of the training data distribution is obtained using kernel density estimation. Once the number of training examples reaches the maximum computationally feasible limit for kernel density estimation, we treat the kernel density estimate as a maximally-complex Gaussian mixture model, and keep the model complexity constant bymerging a pair of components for each newkernel added. This method is shown to outperform a current state-of-the-art incremental one-class learning algorithm (Incremental SVDD [5]) on a variety of datasets, while requiring only an upper limit on model complexity to be specified.
TL;DR: Using a modification of a recently developed nonparametric regression framework called rodeo, a method to greedily select bandwidths in a kernel density estimate is proposed to achieve near optimal minimax rates of convergence, and thus avoids the curse of dimensionality.
Abstract: We consider the problem of estimating the joint density of a d-dimensional random vector X = (X1,X2, ...,Xd) when d is large. We assume that the density is a product of a parametric component and a nonparametric component which depends on an unknown subset of the variables. Using a modification of a recently developed nonparametric regression framework called rodeo (regularization of derivative expectation operator), we propose a method to greedily select bandwidths in a kernel density estimate. It is shown empirically that the density rodeo works well even for very high dimensional problems. When the unknown density function satisfies a suitably defined sparsity condition, and the parametric baseline density is smooth, the approach is shown to achieve near optimal minimax rates of convergence, and thus avoids the curse of dimensionality.
TL;DR: In this paper, a combination of a combined parametric-nonparametric method and I 2-optimality was used to estimate the probability of misclassification of heavy-tailed distributions.
Abstract: Preface 1 Definitions and rough detection of tail heaviness 11 Definitions and basic properties of classes of heavy-tailed Distributions 12 Tail index estimation 121 Estimators of a positive-valued tail index 122 The choice of k in Hill's estimator 123 Estimators of a real-valued tail index 124 On-line estimation of the tail index 13 Detection of tail heaviness and dependence 131 Rough tests of tail heaviness 132 Analysis of Web traffic and TCP flow data 133 Dependence detection from univariate data 134 Dependence detection from bivariate data 135 Bivariate analysis of TCP flow data 14 Notes and comments 15 Exercises 2 Classical methods of probability density estimation 21 Principles of density estimation 22 Methods of density estimation 221 Kernel estimators 222 Projection estimators 223 Spline estimators 224 Smoothing methods 225 Illustrative examples 23 Kernel estimation from dependent data 231 Statement of the problem 232 Numerical calculation of the bandwidth 233 Data-driven selection of the bandwidth 24 Applications 241 Finance: evaluation of market risk 242 Telecommunications 243 Population analysis 25 Exercises 3 Heavy-tailed density estimation 31 Problems of the estimation of heavy-tailed densities 32 Combined parametric-nonparametric method 321 Nonparametric estimation of the density by structural risk minimization 322 Illustrative examples 323 Web data analysis by a combined parametric-nonparametric method 33 Barrona s estimator and I 2-optimality 34 Kernel estimators with variable bandwidth 35 Retransformed nonparametric estimators 36 Exercises 4 Transformations and heavy-tailed density estimation 41 Problems of data transformations 42 Estimates based on a fixed transformation 43 Estimates based on an adaptive transformation 431 Estimation algorithm 432 Analysis of the algorithm 433 Further remarks 44 Estimating the accuracy of retransformed estimates 45 Boundary kernels 46 Accuracy of a nonvariable bandwidth kernel estimator 47 The D method for a nonvariable bandwidth kernel estimator 48 The D method for a variable bandwidth kernel estimator 481 Method and results 482 Application to Web traffic characteristics 49 The I 2 method for the projection estimator 410 Exercises 5 Classification and retransformed density estimates 51 Classification and quality of density estimation 52 Convergence of the estimated probability of misclassification 53 Simulation study 54 Application of the classification technique to Web data analysis 541 Intelligent browser 542 Web data analysis by traffic classification 543 Web prefetching 55 Exercises 6 Estimation of high quantiles 61 Introduction 62 Estimators of high quantiles 63 Distribution of high quantile estimates 64 Simulation study 641 Comparison of high quantile estimates in terms of relative bias and mean squared error 642 Comparison of high quantile estimates in terms of confidence intervals 65 Application to Web traffic data 66 Exercises 7 Nonparametric estimation of the hazard rate function 71 Definition of the hazard rate function 72 Statistical regularization method 73 Numerical solution of ill-posed problems 74 Estimation of the hazard rate function of heavy-tailed distributions 75 Hazard rate estimation for compactly supported distributions 751 Estimation of the hazard rate from the simplest equations 752 Estimation of the hazard rate from a special kernel equation 76 Estimation of the ratio of hazard rates 761 Failure time detection 762 Hormesis detection 77 Hazard rate estimation in teletraffic theory 771 Teletraffic processes at the packet level 772 Estimation of the intensity of a nonhomogeneous Poisson process 78 Semi-Markov modeling in teletraffic engineering 781 The Gilbert-Elliott model 782 Estimation of a retrial process 79 Exercises 8 Nonparametric estimation of the renewal function 81 Traffic modeling by recurrent marked point processes 82 Introduction to renewal function estimation 83 Histogram-type estimator of the renewal function 84 Convergence of the histogram-type estimator 85 Selection of k by a bootstrap method 86 Selection of k by a plot 87 Simulation study 88 Application to the inter-arrival times of TCP connections 89 Conclusions and discussion 810 Exercises Appendices A Proofs of Chapter 2 B Proofs of Chapter 4 C Proofs of Chapter 5 D Proofs of Chapter 6 E Proofs of Chapter 7 F Proofs of Chapter 8 List of Main Symbols and Abbreviations References Index
TL;DR: In this article, the authors proposed two weighted kernel density estimators (WKDEs) for use with stratified random sampling to obtain unbiased estimates of space use in home-range and habitat-use studies.
Abstract: Ecologists and wildlife biologists have long recognized the importance of random sampling but have largely used haphazard (i.e., nonrandom) designs for collecting location data for home-range and habitat-use studies. Using simulated movement paths, I illustrate the importance of random sampling in obtaining unbiased estimates of space use in home-range and habitat-use studies. Stratified random sampling will typically be more time efficient and easier to implement than simple random sampling. Therefore, I propose 2 weighted kernel density estimators (WKDEs) for use with stratified designs. Simulations indicate that these weighted estimators perform considerably better than traditional kernel density estimators when observations are sampled nonuniformly in time. Lastly, I illustrate the use of WKDEs to analyze data for a female northern white-tailed deer (Odocoileus virginianus) collected using Global Positioning Systems with seasonally varying intensity levels. By correcting for nonuniform sampling intensities, these estimators may provide a more accurate description of space use over the fixed study period.
TL;DR: In this article, a nonparametric test for checking parametric hypotheses about a multivariate density f of independent identically distributed random vectors Z1, Z2, which are observed under additional noise with density ψ is proposed.
TL;DR: Kernel density estimation proves capable of providing more flexible modelling of the emission probabilities and, unlike Gaussian mixtures, does not suffer from being highly parametric and of difficult initialisation.
Abstract: In this paper, we present a modified hidden Markov model with emission probabilities modelled by kernel density estimation and its use for activity recognition in videos. In the proposed approach, kernel density estimation of the emission probabilities is operated simultaneously with that of all the other model parameters by an adapted Baum-Welch algorithm. This allows us to retain maximum-likelihood estimation while overcoming the known limitations of mixture of Gaussians in modelling certain probability distributions. Experiments on activity recognition have been performed on ground-truthed data from the CAVIAR video surveillance database and reported in the paper. The error on the training and validation sets with kernel density estimation remains around 14-16% while for the conventional Gaussian mixture approach varies between 15 and 24%, strongly depending on the initial values chosen for the parameters. Overall, kernel density estimation proves capable of providing more flexible modelling of the emission probabilities and, unlike Gaussian mixtures, does not suffer from being highly parametric and of difficult initialisation.
TL;DR: In this article, a new approach to shape-preserving estimation of cumulative distribution functions and probability density functions using the wavelet methodology for multivariate dependent data is introduced, which preserves shape constraints such as monotonicity, positivity and integration to one.
Abstract: We introduce a new approach to shape-preserving estimation of cumulative distribution functions and probability density functions using the wavelet methodology for multivariate dependent data. Our estimators preserve shape constraints such as monotonicity, positivity and integration to one, and allow for low spatial regularity of the underlying functions. We discuss conditional quantile estimation for financial time series data as an application. Our methodology can be implemented with B-splines. We show by means of Monte Carlo simulations that it performs well in finite samples and for a datadriven choice of the resolution level.
TL;DR: Kernel estimation as mentioned in this paper provides tuneable smooth pictures of probability density functions and event intensity functions, which permit examination of broad features and fine structure, are readily produced with modest computational effort and are essentially free of artefacts arising from binning.
Abstract: Kernel estimation, based on the convolution of a probability density function with a set of magnitudes or event dates, provides tuneable smooth pictures of probability density functions and event intensity functions. Such pictures are in several respects superior to those provided by histograms, box plots, cumulative distributions or raw plots. They permit examination of broad features and fine structure, are readily produced with modest computational effort and are essentially free of artefacts arising from binning. Examples are given using data on cirque lengths, limestone pavements, glacier areas and dated flood deposits. The technique deserves widespread use in geomorphology and allied sciences.
TL;DR: In this paper, a characterization of the kernels for which the parametric mean integrated squared error (MISE) rate n −1 may be obtained, where n is the sample size.
Abstract: In the context of kernel density estimation, we give a characterization of the kernels for which the parametric mean integrated squared error (MISE) rate n −1 may be obtained, where n is the sample size. Also, for the cases where this rate is attainable, we give an asymptotic bandwidth choice that makes the kernel estimator consistent in mean integrated squared error at that rate and a numerical example showing the superior performance of the superkernel estimator when the bandwidth is properly chosen. †Research supported by Spanish Ministerio de Ciencia y Tecnologia project MTM2005-06348.
TL;DR: Kotlyarova and Zinde-Walsh as mentioned in this paper provided asymptotic results on kernel estimation of a continuous density for an arbitrary bandwidth/kernel pair and derived the limit joint distribution of kernel density estimators corresponding to different bandwidths and kernel functions.
Abstract: Results on non-parametric kernel estimators of density differ according to the assumed degree of density smoothness. A kernel/bandwidth pair that was optimal for a twice differentiable function may not be suitable when the density is piecewise linear. If there is uncertainty about the degree of smoothness, an inappropriate choice may lead to under- or oversmoothing. To examine various possible outcomes we provide asymptotic results on kernel estimation of a continuous density for an arbitrary bandwidth/kernel pair and derive the limit joint distribution of kernel density estimators corresponding to different bandwidths and kernel functions. Using these results, we propose a combined estimator constructed as an optimal linear combination of several estimators with different bandwidth/kernel pairs. Its theoretical properties [Kotlyarova, Y. and Zinde-Walsh, V., 2006, Non- and semi-parametric estimation in models with unknown smoothness. Economics Letters, 93, 379–386] are such that it automatically attains ...
TL;DR: A new type of reweighted kernel density estimator is proposed in which the weights are defined by a cubic spline on the logit scale, and the free parameters of this spline are optimized with respect to a leave-one-out performance criterion.
TL;DR: The proposed steganalysis scheme can successfully attack steganographic tools like Jsteg and JP Hide and Seek as well and can distinguish between the quantized-cover and the QIM-stego with low false alarm rates.
Abstract: This paper presents a novel steganalysis technique to attack quantization index modulation (QIM) steganography. Our method is based on the observation that QIM embedding disturbs neighborhood correlation in the transform domain. We estimate the probability density function (pdf) of this statistical change in a systematic manner using a kernel density estimate (KDE) method. The estimated parametric density model is then used for stego message detection. The impact of the choice of kernels on the estimated density is investigated experimentally. Simulation results evaluated on a large dataset of 6000 quantized images indicate that the proposed method is reliable. The impact of the choice of message embedding parameters on the accuracy of the steganalysis detection is also evaluated. Simulation results show that the proposed method can distinguish between the quantized-cover and the QIM-stego with low false alarm rates (i.e. Pfn≤0.03 and Pfp≤0.19). We demonstrate that the proposed steganalysis scheme can successfully attack steganographic tools like Jsteg and JP Hide and Seek as well.
TL;DR: In this paper, the existence of conditional least squares and conditional likelihood estimators is proved and their consistency and their asymptotic normality are established, and kernel estimators of the noise's density and its derivatives are defined and shown to be uniformly consistent.
Abstract: Parameter estimation in a class of heteroscedastic time series models is investigated. The existence of conditional least-squares and conditional likelihood estimators is proved. Their consistency and their asymptotic normality are established. Kernel estimators of the noise's density and its derivatives are defined and shown to be uniformly consistent. A simulation experiment conducted shows that the estimators perform well for large sample size.
TL;DR: In this paper, a p-order multivariate kernel density model based on kernel density theory has been developed for synthetic generation of multivariate variables, which is more flexible than conventional parametric models used in stochastic hydrology.
TL;DR: A method to automatically determine the widths of Gaussians by considering the information available locally at a data point has been proposed.
Abstract: In this paper we propose a density based clustering approach. A kernel based density estimation technique is used to estimate the density of the given data set using a Gaussian kernel. Generally, a fixed width parameter is used for all the Gaussians in such methods. Here, a method to automatically determine the widths of Gaussians by considering the information available locally at a data point has been proposed. Cluster boundary information is subsequently extracted from the estimated density of the data. The performance of the proposed method is demonstrated on several data sets. Studies comparing the performance of the proposed method with that of DBSCAN and SVC are also presented.
TL;DR: This work develops on prior results on probability estimation, and specialize the results to uniform distributions in order to obtain sampling rules for support size estimation and considers text classification.
Abstract: We develop on prior results on probability estimation obtained in [1]. We specialize the results to uniform distributions in order to obtain sampling rules for support size estimation. We consider text classification, and show that the estimators developed for probability estimation can improve current state of the art techniques.
TL;DR: Mix models are proposed here in order to reduce model complexity and computational effort for monitoring non-normally distributed data with principal component analysis (PCA) with kernel density estimation.
Abstract: The issue of monitoring non-normally distributed data with principal component analysis (PCA) is addressed through the application of density estimation for evaluating the quality of the principal component scores. Although kernel density estimation has been previously cited as a method for monitoring such data, mixture models are proposed here in order to reduce model complexity and computational effort. Furthermore, several adaptation strategies for the density estimators are developed and suggestions are provided on their use. A rapid thermal anneal case study demonstrates how the estimators outperform the traditional Hotelling's T2 statistic due to the presence of a first wafer effect.
TL;DR: A supervised density-based clustering named SCDE is introduced and discussed in detail, which forms clusters by associating data points with supervised density attractors which represent maxima and minima of a supervised density function.
Abstract: The basic idea of traditional density estimation is to model the overall point density analytically as the sum of influence functions of data points. However, traditional density estimation techniques only consider the location of a point. Supervised density estimation techniques, on the other hand, additionally consider a variable of interest that is associated with a point. Density in supervised density estimation is measured as the product of an influence function with the variable of interest. Based on this novel idea, a supervised density-based clustering named SCDE is introduced and discussed in detail. The SCDE algorithm forms clusters by associating data points with supervised density attractors which represent maxima and minima of a supervised density function.
TL;DR: In this article, a greedy forward selection procedure using updates of the orthogonal decomposition in an order-recursive manner was proposed for sparse kernel density estimates by regression of the empirical cumulative density function.
Abstract: A novel training algorithm for sparse kernel density estimates by regression of the empirical cumulative density function (ECDF) is presented. It is shown how an overdetermined linear least-squares problem may be solved by a greedy forward selection procedure using updates of the orthogonal decomposition in an order-recursive manner. We also present a method for improving the accuracy of the estimated models which uses output-sensitive computation of the ECDF. Experiments show the superior performance of our proposed method compared to stateof-the-art density estimation methods such as Parzen windows, Gaussian Mixture Models, and ǫ-Support Vector Density models [1].
TL;DR: This study considers the nonparametric estimation of a regression function when the response variable is the waiting time between two consecutive events of a stationary renewal process, and where this variable is not completely observed.
Abstract: This study considers the nonparametric estimation of a regression function when the response variable is the waiting time between two consecutive events of a stationary renewal process, and where this variable is not completely observed. In these circumstances, our data are the recurrence times from the occurrence of the last event up to a pre-established time, along with the corresponding values of a certain set of covariates. Estimation of the error density function and some of its characteristics are also considered. For the proposed estimators, we first analyze their asymptotic behavior and, thereafter, carry out a simulation study to highlight their behavior in finite samples. Finally, we apply this methodology to an illustrative example with biomedical data.
TL;DR: The convergence with probability one of the estimates is proved and the main parts of the asymptotic mean square errors of the Estimates of semi-recursive kernel type estimates of functions depending on multivariate density functionals and their derivatives are found.
TL;DR: Bianchi et al. as mentioned in this paper considered the problem of density and drift estimation by the observation of a trajectory of an R d dimensional homogeneous diffusion process with a unique invariant density.
TL;DR: The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima.
Abstract: A method is proposed for semiparametric estimation where parametric and non-parametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
TL;DR: For a homogeneous and uniformly ergodic Markov chain, with transition kernel P(x, A) = \int_{A} f(y|x)\hbox{d}y, x \in E \subset R^{d} as mentioned in this paper, sufficient conditions for strong consistency were obtained for estimates based on kernel density estimators.
Abstract: For a homogeneous and uniformly ergodic Markov chain, with transition kernel \(P(x, A) = \int_{A} f(y|x)\hbox{d}y, x \in E \subset R^{d}\), we analyse some reliability measures and failure rates associated with the transition probabilities. Sufficient conditions for strong consistency are obtained for estimates based on kernel density estimators.
TL;DR: In this paper, the authors considered the problem of testing for a change of the marginal density of a strictly stationary sequence, which is either associated or negatively associated, and established a functional central limit theorem for the kernel density estimator under appropriate conditions.
Abstract: In this paper, we consider the problem of testing for a change of the marginal density of a strictly stationary sequence {X n , n≥1}, which is either associated or negatively associated. The test statistic is constructed based on the sequential kernel estimate of the density function. We first establish a functional central limit theorem for the kernel density estimator under appropriate conditions. Then, we show that the limiting distribution of the test statistic is a functional of independent Brownian bridges.
TL;DR: Kernel-based non-parametric density estimation methods are considered and formulae for variable kernel density estimation using generalized, elliptic Gaussian kernels are derived.
Abstract: Estimation of probability density functions based on available data is important problem arising in various fields, such as telecommunications, machine learning, data mining, pattern recognition and computer vision. In this paper, we consider Kernel-based non-parametric density estimation methods and derive formulae for variable kernel density estimation using generalized, elliptic Gaussian kernels. The proposed technique is verified on simulated data.
TL;DR: In this paper, the authors considered a multivariate density model where they estimate the excess mass of the unknown probability density at a given level from the i.i.d. observed random variables.
Abstract: We consider a multivariate density model where we estimate the excess mass of the unknown probability density $f$ at a given level $
u>0$ from $n$ i.i.d. observed random variables. This problem has several applications such as multimodality testing, density contour clustering, anomaly detection, classification and so on. For the first time in the literature we estimate the excess mass as an integrated functional of the unknown density $f$. We suggest an estimator and evaluate its rate of convergence, when $f$ belongs to general Besov smoothness classes, for several risk measures. A particular care is devoted to implementation and numerical study of the studied procedure. It appears that our procedure improves the plug-in estimator of the excess mass.
TL;DR: Experimental results show that with the novel kernel density estimator the proposed predictor has been able to outperform the state-of-art predictors currently available and prediction accuracy will continue to increase in the future as the size of the protein structure database keeps growing.
Abstract: Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases such as the Protein Data Bank (PDB) continue to grow exponentially. The new challenge concerns how to effectively exploit the huge amount of structural information deposited in large protein structure databases and deliver ever-improving accuracy as the sizes of the databases continue to grow. This new challenge is addressed in this article by resorting to a kernel density estimation based approach. The kernel density estimator proposed in this article is distinctive in that the pointwise MSE (mean square error) of its basic form converges at O(n -2/3 ) regardless of the dimension of the vector space, where n is the number of instances in the training dataset. In addition, just like many conventional kernel density estimators, it features average O(nlogn) time complexity for generating the approximation function. The experimental results show that with the novel kernel density estimator the proposed predictor has been able to outperform the state-of-art predictors currently available. Experimental results further reveal that prediction accuracy delivered by the proposed predictor will continue to increase in the future as the size of the protein structure database keeps growing.