TL;DR: This work provides a first in-depth analysis comparing both instance-incremental methods, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach.
Abstract: Many real world problems involve the challenging context of data streams, where classifiers must be incremental: able to learn from a theoretically-infinite stream of examples using limited time and memory, while being able to predict at any point. Two approaches dominate the literature: batch-incremental methods that gather examples in batches to train models; and instance-incremental methods that learn from each example as it arrives. Typically, papers in the literature choose one of these approaches, but provide insufficient evidence or references to justify their choice. We provide a first in-depth analysis comparing both approaches, including how they adapt to concept drift, and an extensive empirical study to compare several different versions of each approach. Our results reveal the respective advantages and disadvantages of the methods, which we discuss in detail.
TL;DR: An improved particle swarm optimization IPSO algorithm using the opposite sign test OST is proposed, which increases population diversity in the PSO mechanism, and avoids local optimal trapping by improving the jump ability of flying particles.
Abstract: Searching for an optimal feature subset in a high-dimensional feature space is an NP-complete problem; hence, traditional optimization algorithms are inefficient when solving large-scale feature selection problems. Therefore, meta-heuristic algorithms have been extensively adopted to solve the feature selection problem efficiently. This study proposes an improved particle swarm optimization IPSO algorithm using the opposite sign test OST. The test increases population diversity in the PSO mechanism, and avoids local optimal trapping by improving the jump ability of flying particles. Data sets collected from UCI machine learning databases are used to evaluate the effectiveness of the proposed approach. Classification accuracy is employed as a criterion to evaluate classifier performance. Results show that the proposed approach outperforms both genetic algorithms and sequential search algorithms.
TL;DR: Three non-parametric state-of-the-art regression methods are compared and it is shown that RF is the most promising approach from the three and possible to obtain more accurate results using PPR but with extra pre-processing work, namely on example selection and domain values definition.
Abstract: Long-term travel time prediction TTP can be an important planning tool for both freight transport and public transport companies. In both cases it is expected that the use of long-term TTP can improve the quality of the planned services by reducing the error between the actual and the planned travel times. However, for reasons that we try to stretch out along this paper, long-term TTP is almost not mentioned in the scientific literature. In this paper we discuss the relevance of this study and compare three non-parametric state-of-the-art regression methods: Projection Pursuit Regression PPR, Support Vector Machine SVM and Random Forests RF. For each one of these methods we study the best combination of input parameters. We also study the impact of different methods for the pre-processing tasks feature selection, example selection and domain values definition in the accuracy of those algorithms. We use bus travel time's data from a bus dispatch system. From an off-the-shelf point-of-view, our experiments show that RF is the most promising approach from the three we have tested. However, it is possible to obtain more accurate results using PPR but with extra pre-processing work, namely on example selection and domain values definition.
TL;DR: Here, some of the standard indices used in the literature are compared using more realistic databases that include outliers or noisy dimensions, which is more like a real problem-solving approach.
Abstract: Quality indices in clustering are used not only to assess the quality of the partitions but also to determine the number of clusters in the final result. When these indices are evaluated in a case study, real data conditions or different clustering algorithms are seldom taken into account. Here, some of the standard indices used in the literature are compared using more realistic databases that include outliers or noisy dimensions, which is more like a real problem-solving approach. Besides, three different clustering methods are used in an attempt to identify different behaviours. Also, the performance of the quality index-clustering algorithm tandem is compared to random grouping, with the aim of running an additional check. The indices are ranked, and index-based conclusions are drawn for all the scenarios.
TL;DR: This paper addresses the problem of monitoring the evolution of clusters over time and proposes the MEC framework, a framework that traces evolution through the detection and categorization of clusters transitions, such as births, deaths and merges, and enables their visualization through bipartite graphs.
Abstract: The study of evolution has become an important research issue, especially in the last decade, due to our ability to collect and store high detailed and time-stamped data. The need for describing and understanding the behavior of a given phenomena over time led to the emergence of new frameworks and methods focused on the temporal evolution of data and models. In this paper we address the problem of monitoring the evolution of clusters over time and propose the MEC framework. MEC traces evolution through the detection and categorization of clusters transitions, such as births, deaths and merges, and enables their visualization through bipartite graphs. It includes a taxonomy of transitions, a tracking method based in the computation of conditional probabilities, and a transition detection algorithm. We use MEC with two main goals: to determine the general evolution trends and to detect abnormal behavior or rare events. To demonstrate the applicability of our framework we present real world economic and financial case studies, using datasets extracted from Banco de Portugal Central Balance-Sheet Database and the The Data Page of New York University --Leonard N. Stern School of Business. The results allow us to draw interesting conclusions about the evolution of activity sectors and European companies.
TL;DR: This work proposes one novel method of that sort, with a refined concept of similarity of nodes that involves matching their neighbors that has some additional desirable properties that, to the knowledge, the existing methods lack.
Abstract: The problem of measuring similarity of graph nodes is important in a range of practical problems. There is a number of proposed measures, usually based on iterative calculation of similarity and the principle that two nodes are as similar as their neighbors are. In our work, we propose one novel method of that sort, with a refined concept of similarity of nodes that involves matching their neighbors. We prove convergence of the proposed method and show that it has some additional desirable properties that, to our knowledge, the existing methods lack. In addition, we construct a measure of similarity of whole graphs based on the similarities of nodes. We illustrate the proposed method on several specific problems and empirically compare it to other methods.
TL;DR: This work proposes the extension of existing approaches to deal with the problem of recurring concepts by reusing previously learned decision models in situations where concepts reappear and addresses the challenge of retrieving the most appropriate concept for a particular context.
Abstract: The problem of recurring concepts in data stream classification is a special case of concept drift where concepts may reappear. Although several existing methods are able to learn in the presence of concept drift, few consider contextual information when tracking recurring concepts. Nevertheless, in many real-world scenarios context information is available and can be exploited to improve existing approaches in the detection or even anticipation of recurring concepts. In this work, we propose the extension of existing approaches to deal with the problem of recurring concepts by reusing previously learned decision models in situations where concepts reappear. The different underlying concepts are identified using an existing drift detection method, based on the error-rate of the learning process. A method to associate context information and learned decision models is proposed to improve the adaptation to recurring concepts. The method also addresses the challenge of retrieving the most appropriate concept for a particular context. Finally, to deal with situations of memory scarcity, an intelligent strategy to discard models is proposed. The experiments conducted so far, using synthetic and real datasets, show promising results and make it possible to analyze the trade-off between the accuracy gains and the learned models storage cost.
TL;DR: A new graph theoretic approach for automatically identifying and evaluating subgoals and a method for providing some useful prior knowledge for corresponding policy of developed skills based on two graph centrality measures, namely node connection graph stability and co-betweenness centrality are proposed.
Abstract: Mechanisms on automatic discovery of macro actions or skills in reinforcement learning methods are mainly focused on subgoal discovery methods. Among the proposed algorithms, those based on graph centrality measures demonstrate a high performance gain. In this paper, we propose a new graph theoretic approach for automatically identifying and evaluating subgoals. Moreover, we propose a method for providing some useful prior knowledge for corresponding policy of developed skills based on two graph centrality measures, namely node connection graph stability and co-betweenness centrality. Investigating some benchmark problems, we show that the proposed approach improves the learning performance of the agent significantly.
TL;DR: An incremental update mechanism to avoid the recalculation of DFT coefficients when new readings arrive and thus minimizes the processing time is proposed and makes the proposed clustering technique suitable for sensors network environment where computing and power capabilities are limited.
Abstract: Data streams and their applications appear in several fields such as physics, finance, medicine, environmental science, etc. As sensor technology improves, sensor data rates continue to increase. Consequently, analyzing data streams becomes ever more challenging. Fast online response is a must for applications that involve multiple data streams, especially when the number of data streams is large. This paper proposes an efficient clustering technique called Multi-way Grid-based join algorithm MG-join to find clusters in multiple data streams. The proposed algorithm uses a Discrete Fourier Transformation DFT to reduce the dimensionality of the streams. Each stream is represented by a point in a multi-dimensional grid in the frequency domain. The MG-join algorithm finds the different clusters in multiple data streams in the frequency domain. Moreover, this paper proposes an incremental update mechanism to avoid the recalculation of DFT coefficients when new readings arrive and thus minimizes the processing time. Experiments on synthetic data streams show that the proposed clustering technique is much faster than traditional clustering techniques and yet its accuracy is as good as that of the traditional clustering techniques. This makes the proposed technique suitable for sensors network environment where computing and power capabilities are limited.
TL;DR: This work proposes an effective constraint-clustering approach handling a large set of constraints which are described by a generic constraint-based language and shows how each constraint is encoded in SAT and solved by taking benefit from several features of SAT solvers.
Abstract: Constrained clustering - finding clusters that satisfy user-specified constraints - aims at providing more relevant clusters by adding constraints enforcing required properties. Leveraging the recent progress in declarative and constraint-based pattern mining, we propose an effective constraint-clustering approach handling a large set of constraints which are described by a generic constraint-based language. Starting from an initial solution, queries can easily be refined in order to focus on more interesting clustering solutions. We show how each constraint (and query) is encoded in SAT and solved by taking benefit from several features of SAT solvers. Experiments performed using MiniSat on several datasets from the UCI repository show the feasibility and the advantages of our approach.
TL;DR: The resulting system is a hybrid CPS HCPS and is based on Multiple Forward Stepwise Logistic Regression MFSLR model, which leads to more accurate prediction than some other methods tried, namely, feedforward neural networks, radial basis networks and regression trees.
Abstract: In today's world, customer purchasing behavior prediction is one of the most important aspects of customer attraction. Good prediction can help to develop marketing strategies more accurately and to spend resources more effectively. When designing a customer prediction system CPS two issues are key, namely, feature selection and the prediction method to be used. Furthermore, it seems necessary to design CPSs with both high computational speed and good prediction abilities. The purpose of this paper is to develop such a system by using a hybrid approach. The resulting system is a hybrid CPS HCPS and is based on Multiple Forward Stepwise Logistic Regression MFSLR model. The MFSLR model combines a forward stepwise regression FSR technique that rapidly selects an optimal subset of features with multiple logistic regression MLR technique. In practice, the new MFSLR model provides very good prediction results. Since customer identification is one of the principal concerns in the insurance industry, an insurance company dataset has been used. The obtained results show that the FSR selects around 55% of the initially available features, in this way considerably reducing computational costs. In addition, the results show that the MLR method leads to more accurate prediction than some other methods we tried, namely, feedforward neural networks, radial basis networks and regression trees.
TL;DR: A framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches was proposed and was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.
Abstract: Pre-processing plays a vital role in classification tasks, particularly when complex features are involved, and this demands a highly intelligent method. In bioinformatics, where datasets are categorised as having complex features, the need for pre-processing is unavoidable. In this paper, we propose a framework for selecting the discriminatory features from protein sequences prior to classification by integrating the filter and wrapper approaches. Several state-of-the-art multivariate filters were explored in the first phase to remove the unwanted features that contributed to noise, while particle swarm optimisation PSO with support vector machine SVM was adopted in the wrapper phase to produce the most optimal features. Several PSO variants were investigated in the wrapper phase to compare the most suitable PSO variants for the problem domain. The results of both phases were analysed based on classification accuracy, number of selected features, modelling time and area under the curve on the main dataset and, five benchmark machine learning datasets of similar complexity. The higher classification accuracy of the proposed framework was highly reliable with an improvement over the filter phase and the use of full features despite using smaller features.
TL;DR: A new fast heuristic for building decision trees from large training sets is presented, which overcomes some of the restrictions of the state of the art algorithms, using all the instances of the training set without storing all of them in main memory.
Abstract: Decision trees are commonly used in supervised classification. Currently, supervised classification problems with large training sets are very common, however many supervised classifiers cannot handle this amount of data. There are some decision tree induction algorithms that are capable to process large training sets, however almost all of them have memory restrictions because they need to keep in main memory the whole training set, or a big amount of it. Moreover, algorithms that do not have memory restrictions have to choose a subset of the training set, needing extra time for this selection; or they require to specify the values for some parameters that could be very difficult to determine by the user. In this paper, we present a new fast heuristic for building decision trees from large training sets, which overcomes some of the restrictions of the state of the art algorithms, using all the instances of the training set without storing all of them in main memory. Experimental results show that our algorithm is faster than the most recent algorithms for building decision trees from large training sets.
TL;DR: A methodology to extract the main words in a static web site is proposed and one of the key elements in this methodology is to determine which pages in a web site can further attract the users attention when they are browsing the site.
Abstract: The construction of a web site is a great challenge that integrates different elements such as the hyperlink structure, colors, pictures, movies and textual contents. In the latter, the correct textual content can be the key to attracting users to visit the site. In fact, many users visit a web site by using a web search engine such as, Google or Yahoo!, and continue exploring the site if it contains the information that they are looking for. In this paper, a methodology to extract the main words in a static web site is proposed. Furthermore, one of the key elements in this methodology is to determine which pages in a web site can further attract the users attention when they are browsing the site. These words are called web site keywords and by using them in the site textual content, significant improvements, from the point of view of the user, can be achieved. A web user's browsing behaviour can be classified in two categories: those of amateurs and experienced. The former is a user with little or no experience in using web-based systems. Their browsing behaviour is normally erratic and it can take them a considerable amount of time to find what they are looking for. The latter is a user with a greater amount of experience with web-based systems whose behaviour is more controlled and purpose driven, and thus takes them less time in determining whether the site contains worthwhile information. What is important, regarding the experienced web users is that there is a correlation between the amount of time spent on a webpage during a session and the extent to which they are interested in the page content. By using this characteristic, a feature vector is created in relation to the time spent on each page during a user's session. The described vectors are the input for two clustering algorithms: SOFM and K-means, which enables the extraction of significant patterns about users with similar or identical browsing behaviour and content preferences. Then, these patterns form the basis in identification of the web site keywords. In order to validate the proposed methodology, web data originated in a complex static web site belonging to a Chilean bank was used. From the clusters identified, a set of web site keywords were identified and their utility was tested on a group of real users, thus illustrating the effectiveness of the proposed methodology.
TL;DR: This work proposes to enhance multi- label classifiers with features constructed from local patterns representing explicitly such interdependencies, and experimentally shows that using such constructed features can improve the classification performance of decompositive multi-label learning techniques.
Abstract: The straightforward approach to multi-label classification is based on decomposition, which essentially treats all labels independently and ignores interactions between labels. We propose to enhance multi-label classifiers with features constructed from local patterns representing explicitly such interdependencies. An Exceptional Model Mining instance is employed to find local patterns representing parts of the data where the conditional dependence relations between the labels are exceptional. We construct binary features from these patterns that can be interpreted as partial solutions to local complexities in the data. These features are then used as input for multi-label classifiers. We experimentally show that using such constructed features can improve the classification performance of decompositive multi-label learning techniques.
TL;DR: A critical study of concise representations of frequent patterns with respect to several aspects and comparative criteria which proves the importance of considering closed sets and minimal generators.
Abstract: The last years witnessed an explosive progress in networking, storage, and processing technologies resulting in an unprecedented amount of digitalization of data Hence, there has been a considerable need for tools or techniques to delve and efficiently discover valuable, non-obvious information from large databases In this situation, data mining is an important research field which offers efficient solutions for such an extraction Much research in data mining from large databases have focused on the discovery of frequent patterns which are then used to identify relationships between sets of items in a database, through for example association rule derivation In practice, however, the number of frequently occurring patterns is very large, hampering their effective exploitation by the end-users In this situation, many works have been interested in defining manageably-sized sets of patterns, called concise representations, from which redundant patterns can be regenerated In this paper, we concentrate on exact concise representations of frequent patterns Thus, we describe their close relation with important concepts like the framework of e-adequate representation and the minimum description length principle Based on the mathematical settings of Formal Concept Analysis, we also show the complementarity between minimal generators and closed itemsets Then, we focus on the key role played by these patterns for solving several problem associated to various pattern classes In this respect, we classify concise representations of frequent itemsets according to their common characteristics Then, we analyze a representative of each class and show its close link with minimal generators Finally, we carry out a critical study of concise representations with respect to several aspects and comparative criteria which proves the importance of considering closed sets and minimal generators
TL;DR: CAR-NF introduces a new strategy for computing CARs, using the Netconf as measure of interest, that allows to prune the CAR search space for building specific rules with high Netconf.
Abstract: In this paper, an accurate classifier based on Class Association Rules CARs, called CAR-NF, is proposed CAR-NF introduces a new strategy for computing CARs, using the Netconf as measure of interest, that allows to prune the CAR search space for building specific rules with high Netconf Moreover, we propose and prove a proposition that supports the use of a Netconf threshold value equal to 05 for mining the CARs Additionally, a new way for ordering the set of CARs based on their rule sizes and Netconf values is introduced in CAR-NF The ordering strategy together with the "Best K rules" satisfaction mechanism allows CAR-NF to have better accuracy than CBA, CMAR, CPAR, TFPC and HARMONY classifiers, the best classifiers based on CARs reported in the literature
TL;DR: In this paper, the problem of modeling prior information of a data miner about the data, with the purpose of quantifying subjective interestingness of patterns, has been addressed, using information theory.
Abstract: In this paper, we are concerned with the problem of modelling prior information of a data miner about the data, with the purpose of quantifying subjective interestingness of patterns. Recent results have achieved this for the specific case of prior expectations on the row and column marginals, based on the Maximum Entropy principle [2,9]. In the current paper, we extend these ideas to make them applicable to more general prior information, such as knowledge of frequencies of itemsets, a cluster structure in the data, or the presence of dense areas in the database. As in [2,9], we show how information theory can be used to quantify subjective interestingness against this model, in particular the subjective interestingness of tile patterns [3]. Our method presents an efficient, flexible, and rigorous alternative to the randomization approach presented in [5]. We demonstrate our method by searching for interesting patterns in real-life data with respect to various realistic types of prior information.
TL;DR: Results from the computational experiments indicate that MCRSU algorithm is more effective than MSMU in minimizing the non-sensitive itemsets affected as well as maintaining data quality in the sanitized database.
Abstract: Privacy preserving data mining is a vibrant area in data mining. The sharing of data between the organizations is found to be beneficial for business growth. However, privacy policies and threats prevent the data owners from sharing the data for mining. The current data sanitization approaches focus on hiding either frequent itemsets or utility itemsets separately. This paper proposes to study the problem of hiding the sensitive utility and frequent itemsets. To resolve this problem, two effective data sanitization algorithms MSMU and MCRSU are presented to hide the sensitive utility and frequent itemsets in the modified database. While hiding the sensitive itemsets, the algorithms sanitize the database with minimum impact on the non-sensitive itemsets. To accomplish this, MSMU is devised to identify the victim items with minimum support and maximum utility whereas MCRSU uses conflict ratio. Results from the computational experiments on the synthetic and real datasets indicate that MCRSU algorithm is more effective than MSMU in minimizing the non-sensitive itemsets affected as well as maintaining data quality in the sanitized database.
TL;DR: A new strategy is proposed, called MCut, which automatically estimates a value for the threshold and does not need any parametrization to be trained and is easy to implement and parameter free.
Abstract: The multi-label classification is a frequent task in machine learning notably in text categorization. When binary classifiers are not suited, an alternative consists in using a multiclass classifier that provides for each document a score per category and then in applying a thresholding strategy in order to select the set of categories which must be assigned to the document. The common thresholding strategies, such as RCut, PCut and SCut methods, need a training step to determine the value of the threshold. To overcome this limit, we propose a new strategy, called MCut which automatically estimates a value for the threshold. This method does not have to be trained and does not need any parametrization. Experiments performed on two textual corpora, XML Mining 2009 and RCV1 collections, show that the MCut strategy results are on par with the state of the art but MCut is easy to implement and parameter free.
TL;DR: This method improves CBR in two ways: first, how to use the knowledge stored in the case-base is disregarding the problem itself and is universally; second, this method stores the probabilistic descriptions of the previous solutions in order to make the stored knowledge more flexible.
Abstract: The Case-Based Reasoning CBR solves problems by using the past problem solving experiences. How to apply these experiences depends on the type of the problem. The method presented in this paper tries to overcome this difficulty in CBR for optimization problems, using Bayesian Optimization Algorithm BOA. BOA evolves a population of candidate solutions through constructing Bayesian networks and sampling them. After solving the problems through BOA, Bayesian networks describing solutions features are obtained. In our method, these Bayesian networks are stored in a case-base. For solving a new problem, the Bayesian networks of those problems which are similar to the new problem, are retrieved and combined. This compound Bayesian network is used for generating the initial population and constructing the probabilistic models of BOA in solving the new problem. Our method improves CBR in two ways: first, in our method, how to use the knowledge stored in the case-base is disregarding the problem itself and is universally; second, this method stores the probabilistic descriptions of the previous solutions in order to make the stored knowledge more flexible. Experimental results showed that in addition to the mentioned advantages, our method improved the solutions quality.
TL;DR: The main characteristics of some of the most important sequential pattern mining algorithms are presented and a comparative performance study among these algorithms is shown.
Abstract: From the beginning of sequential pattern mining to the present, this field has received important attention within the data mining area, because it has a wide application in several significant computational problems. Many algorithms have been created and several techniques have been used with the objective of improving the discovery of the frequent sequence set. In this paper we present the main characteristics of some of the most important sequential pattern mining algorithms. Also, we show a comparative performance study among these algorithms.
TL;DR: This work proposes a unifying approach, named GeT_Move, using a frequent closed itemset-based spatio-temporal pattern-mining algorithm to mine and manage different spatio -temporal patterns.
Abstract: Recent improvements in positioning technology have led to a massive moving object data. A crucial task is to find the moving objects that travel together. Usually, they are called spatio-temporal patterns. Due to the emergence of many different kinds of spatio-temporal patterns in recent years, different approaches have been proposed to extract them. However, each approach only focuses on mining a specific kind of pattern. In addition to the fact that it is a painstaking task due to the large number of algorithms used to mine and manage patterns, it is also time consuming. Additionally, we have to execute these algorithms again whenever new data are added to the existing database. To address these issues, we first redefine spatio-temporal patterns in the itemset context. Secondly, we propose a unifying approach, named GeT_Move, using a frequent closed itemset-based spatio-temporal pattern-mining algorithm to mine and manage different spatio-temporal patterns. GeT_Move is implemented in two versions which are GeT_Move and Incremental GeT_Move. Experiments are performed on real and synthetic datasets and the results show that our approaches are very effective and outperform existing algorithms in terms of efficiency.
TL;DR: The results confirmed that combining SPIDER with an ensemble improved the performance in terms of the G-mean measures in comparison to a single classifier with SPIDER for all tested types of classifiers and two SPIDER pre-processing options weak and strong amplification.
Abstract: In the paper we present IIvotes --a new framework for constructing an ensemble of classifiers from imbalanced data. IIvotes incorporates the SPIDER method for selective data pre-processing into the adaptive Ivotes ensemble. Such an integration is aimed at improving balance between sensitivity and specificity evaluated by the G-mean measure for the minority class in comparison with single classifiers also combined with SPIDER. Using SPIDER to pre-process specific learning samples inside the ensemble improves sensitivity of derived component classifiers. At the same time the controlling mechanism of IIvotes ensures that overall accuracy and thus specificity is kept at a reasonable level. The new proposed IIvotes ensemble was thoroughly evaluated in a series of experiments where we tested it with symbolic decision trees and rules and non-symbolic Naive Bayes component classifiers. The results confirmed that combining SPIDER with an ensemble improved the performance in terms of the G-mean measures in comparison to a single classifier with SPIDER for all tested types of classifiers and two SPIDER pre-processing options weak and strong amplification. These advantages were especially evident for decision trees and rules where differences between single and ensemble classifiers with SPIDER were more significant for both pre-processing options than for Naive Bayes. Moreover, the results demonstrated advantages of using a special abstaining classification strategy inside IIvotes rule ensembles, where component rule-based classifiers may refrain from predicting a class when in doubt. Abstaining rule ensembles performed much better with regard to G-mean than their non-abstaining variants.
TL;DR: Over-fitting in model selection is empirically demonstrated to pose a substantial pitfall in the application of kernel learning methods and Gaussian process classifiers and evaluation of machine learning methods can easily be significantly biased unless the evaluation protocol properly accounts for this type of over-fitting.
Abstract: Over-fitting is a ubiquitous problem in machine learning, and a variety of techniques to avoid over-fitting the training sample have proven highly effective, including early stopping, regularization, and ensemble methods. However, while over-fitting in training is widely appreciated and its avoidance now a standard element of best practice, over-fitting can also occur in model selection. This form of over-fitting can significantly degrade generalization performance, but has thus far received little attention. For example the kernel and regularization parameters of a support vector machine are often tuned by optimizing a cross-validation based model selection criterion. However the cross-validation estimate of generalization performance will inevitably have a finite variance, such that its minimizer depends on the particular sample on which it is evaluated, and this will generally differ from the minimizer of the true generalization error. Therefore if the cross-validation error is aggressively minimized, generalization performance may be substantially degraded. In general, the smaller the amount of data available, the higher the variance of the model selection criterion, and hence the more likely over-fitting in model selection will be a significant problem. Similarly, the more hyper-parameters to be tuned in model selection, the more easily the variance of the model selection criterion can be exploited, which again increases the likelihood of over-fitting in model selection.
Over-fitting in model selection is empirically demonstrated to pose a substantial pitfall in the application of kernel learning methods and Gaussian process classifiers. Furthermore, evaluation of machine learning methods can easily be significantly biased unless the evaluation protocol properly accounts for this type of over-fitting. Fortunately the common solutions to avoiding over-fitting in training also appear to be effective in avoiding over-fitting in model selection. Three examples are presented based on regularization of the model selection criterion, early stopping in model selection and minimizing the number of hyper-parameters to be tuned during model selection.
TL;DR: An evaluation measure and an explicit discriminative dimensionality reduction mapping using the Fisher information are proposed based on recent work for the unsupervised case.
Abstract: Discriminative dimensionality reduction aims at a low dimensional, usually nonlinear representation of given data such that information as specified by auxiliary discriminative labeling is presented as accurately as possible. This paper centers around two open problems connected to this question: (i) how to evaluate discriminative dimensionality reduction quantitatively? (ii) how to arrive at explicit nonlinear discriminative dimensionality reduction mappings? Based on recent work for the unsupervised case, we propose an evaluation measure and an explicit discriminative dimensionality reduction mapping using the Fisher information.
TL;DR: This paper shows that MW-PAM, particularly when initialized with the Build algorithm (also using the Minkowski metric), is superior to other medoid-based algorithms in terms of both accuracy and identification of irrelevant features.
Abstract: In this paper we introduce the Minkowski weighted partition around medoids algorithm (MW-PAM). This extends the popular partition around medoids algorithm (PAM) by automatically assigning K weights to each feature in a dataset, where K is the number of clusters. Our approach utilizes the within-cluster variance of features to calculate the weights and uses the Minkowski metric.
We show through many experiments that MW-PAM, particularly when initialized with the Build algorithm (also using the Minkowski metric), is superior to other medoid-based algorithms in terms of both accuracy and identification of irrelevant features.
TL;DR: This paper proposes a novel methodology to produce online predictions regarding the spatial distribution of passenger demand throughout taxi stand networks by assembling two well-known time series short-term forecast models: the time-varying Poisson models and ARIMA models.
Abstract: In recent years, both companies and researchers have been exploring intelligent data analysis to increase the profitability of the taxi industry. Intelligent systems for online taxi dispatching and time saving route finding have been built to do so. In this paper, we propose a novel methodology to produce online predictions regarding the spatial distribution of passenger demand throughout taxi stand networks. We have done so by assembling two well-known time series short-term forecast models: the time-varying Poisson models and ARIMA models. Our tests were performed using data gathered over a period of 6 months and collected from 63 taxi stands within the city of Porto, Portugal. Our results demonstrate that this model is a true major contribution to the driver mobility intelligence: 78% of the 253745 demanded taxi services were correctly forecasted in a 30 minutes horizon.
TL;DR: A framework for characterizing spike (and spike-train) synchrony in parallel neuronal spike trains that is based on identifying spikes with what the authors call influence maps: real-valued functions describing an influence region around the corresponding spike times within which possibly graded synchrony with other spikes is defined is presented.
Abstract: We present a framework for characterizing spike (and spike-train) synchrony in parallel neuronal spike trains that is based on identifying spikes with what we call influence maps: real-valued functions describing an influence region around the corresponding spike times within which possibly graded synchrony with other spikes is defined. We formalize two models of synchrony in this framework: the bin-based model (the almost exclusively applied model in the literature) and a novel, alternative model based on a continuous, graded notion of synchrony, aimed at overcoming the drawbacks of the bin-based model. We study the task of identifying frequent (and synchronous) neuronal patterns from parallel spike trains in our framework, formalized as an instance of what we call the fuzzy frequent pattern mining problem (a generalization of standard frequent pattern mining) and briefly evaluate our synchrony models on this task.
TL;DR: It is proven that EBRPSO outperforms the existing discretizers in terms of classification accuracy as well as reduction of the decision rules.
Abstract: Conventional cut selection in Boolean reasoning BR based discretization often produces under-optimistic prime cuts. This is due to the linearity of traditional heuristics in tackling high-dimensional space problem. We proposed a flexible yet compact and holistic solution by incorporating Particle Swarm Optimization PSO into the existing framework. The first challenge is to downsize the search space such that the probability of finding the global optimum is increased. The second task is to reconstruct the present fitness function so as to improve the classification performance of the induction algorithm, which in this case, C4.5. By injecting a filtration phase prior to the cut selection and introducing a tertiary term to the fitness function, the proposed extended BR with PSO EBRPSO discretizer is developed. Based on the evaluation using four real-world datasets i.e.: Heart, Breast, Iris and Wine, it is proven that EBRPSO outperforms the existing discretizers in terms of classification accuracy as well as reduction of the decision rules.