TL;DR: This paper introduces FastDTW, an approximation of DTW that has a linear time and space complexity and shows a large improvement in accuracy over existing methods.
Abstract: Dynamic Time Warping (DTW) has a quadratic time and space complexity that limits its use to small time series. In this paper we introduce FastDTW, an approximation of DTW that has a linear time and space complexity. FastDTW uses a multilevel approach that recursively projects a solution from a coarser resolution and refines the projected solution. We prove the linear time and space complexity of FastDTW both theoretically and empirically. We also analyze the accuracy of FastDTW by comparing it to two other types of existing approximate DTW algorithms: constraints (such as Sakoe-Chiba Bands) and abstraction. Our results show a large improvement in accuracy over existing methods.
TL;DR: This paper provides an overview of the different representative clustering methods and several clustering validations indices and approaches to automatically determine the number of clusters.
Abstract: Data clustering is the process of identifying natural groupings or clusters within multidimensional data based on some similarity measure Clustering is a fundamental process in many different disciplines Hence, researchers from different fields are actively working on the clustering problem This paper provides an overview of the different representative clustering methods In addition, several clustering validations indices are shown Furthermore, approaches to automatically determine the number of clusters are presented Finally, application of different heuristic approaches to the clustering problem is also investigated
TL;DR: A new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs is introduced and it is proved that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm.
Abstract: The Denclue algorithm employs a cluster model based on kernel density estimation A cluster is defined by a local maximum of the estimated density function Data points are assigned to clusters by hill climbing, ie points going to the same local maximum are put into the same cluster A disadvantage of Denclue 10 is, that the used hill climbing may make unnecessary small steps in the beginning and never converges exactly to the maximum, it just comes close
We introduce a new hill climbing procedure for Gaussian kernels, which adjusts the step size automatically at no extra costs We prove that the procedure converges exactly towards a local maximum by reducing it to a special case of the expectation maximization algorithm We show experimentally that the new procedure needs much less iterations and can be accelerated by sampling based methods with sacrificing only a small amount of accuracy
TL;DR: This paper presents graph-based approaches to uncovering anomalies in domains where the anomalies consist of unexpected entity/relationship alterations that closely resemble non-anomalous behavior, and validate all three approaches using synthetic data.
Abstract: An important area of data mining is anomaly detection, particularly for fraud. However, little work has been done in terms of detecting anomalies in data that is represented as a graph. In this paper we present graph-based approaches to uncovering anomalies in domains where the anomalies consist of unexpected entity/relationship alterations that closely resemble non-anomalous behavior. We have developed three algorithms for the purpose of detecting anomalies in all three types of possible graph changes: label modifications, vertex/edge insertions and vertex/edge deletions. Each of our algorithms focuses on one of these anomalous types, using the minimum description length principle to first discover the normative pattern. Once the common pattern is known, each algorithm then uses a different approach to discover particular anomalous types. In this paper, we validate all three approaches using synthetic data, verifying that each of the algorithms on graphs and anomalies of varying sizes, are able to detect the anomalies with very high detection rates and minimal false positives. We then further validate the algorithms using real-world cargo data and actual fraud scenarios injected into the data set with 100% accuracy and no false positives. Each of these algorithms demonstrates the usefulness of examining a graph-based representation of data for the purposes of detecting fraud.
TL;DR: Var-Part is introduced, which is computationally less complex and approximates PCA partitioning assuming diagonal covariance matrix and leads K-means to yield sum-squared-error values close to the optimum values obtained by several random-start runs and often at faster convergence rates.
Abstract: The performance of K-means and Gaussian mixture model (GMM) clustering depends on the initial guess of partitions Typically, clustering algorithms are initialized by random starts In our search for a deterministic method, we found two promising approaches: principal component analysis (PCA) partitioning and Var-Part (Variance Partitioning) K-means clustering tries to minimize the sum-squared-error criterion The largest eigenvector with the largest eigenvalue is the component which contributes to the largest sum-squared-error Hence, a good candidate direction to project a cluster for splitting is the direction of the cluster's largest eigenvector, the basis for PCA partitioning Similarly, GMM clustering maximizes the likelihood; minimizing the determinant of the covariance matrices of each cluster helps to increase the likelihood The largest eigenvector contributes to the largest determinant and is thus a good candidate direction for splitting However, PCA is computationally expensive We, thus, introduce Var-Part, which is computationally less complex (with complexity equal to one K-means iteration) and approximates PCA partitioning assuming diagonal covariance matrix Experiments reveal that Var-Part has similar performance with PCA partitioning, sometimes better, and leads K-means (and GMM) to yield sum-squared-error (and maximum-likelihood) values close to the optimum values obtained by several random-start runs and often at faster convergence rates
TL;DR: In this paper, a boosting-like method is proposed to train a classifier ensemble from data streams that naturally adapts to concept drift by continuously re-weighting the ensemble members based on their performance on the most recent examples.
Abstract: In many real-world classification tasks, data arrives over time and the target concept to be learned from the data stream may change over time. Boosting methods are well-suited for learning from data streams, but do not address this concept drift problem. This paper proposes a boosting-like method to train a classifier ensemble from data streams that naturally adapts to concept drift. Moreover, it allows to quantify the drift in terms of its base learners. Similar as in regular boosting, examples are re-weighted to induce a diverse ensemble of base models. In order to handle drift, the proposed method continuously re-weights the ensemble members based on their performance on the most recent examples only. The proposed strategy adapts quickly to different kinds of concept drift. The algorithm is empirically shown to outperform learning algorithms that ignore concept drift. It performs no worse than advanced adaptive time window and example selection strategies that store all the data and are thus not suited for mining massive streams. The proposed algorithm has low computational costs.
TL;DR: This paper introduces FastDTW, an approximation of DTW that has a linear time and space complexity that limits its use to small time series.
Abstract: Dynamic Time Warping (DTW) has a quadratic time and space complexity that limits its use to small time series. In this paper we introduce FastDTW, an approximation of DTW that has a linear time and...
TL;DR: The results of the experiments show that the proposed approach has a comparable performance to that of random forests, with the added advantage of being applicable to any base-level algorithm without the need to randomize the latter.
Abstract: Random forests are one of the best performing methods for constructing ensembles They derive their strength from two aspects: using random subsamples of the training data (as in bagging) and randomizing the algorithm for learning base-level classifiers (decision trees) The base-level algorithm randomly selects a subset of the features at each step of tree construction and chooses the best among these We propose to use a combination of concepts used in bagging and random subspaces to achieve a similar effect The latter randomly select a subset of the features at the start and use a deterministic version of the base-level algorithm (and is thus somewhat similar to the randomized version of the algorithm) The results of our experiments show that the proposed approach has a comparable performance to that of random forests, with the added advantage of being applicable to any base-level algorithm without the need to randomize the latter
TL;DR: This paper considers the problem of classification on data streams and develops an instance-based learning algorithm for that purpose and suggests that this algorithm has a number of desirable properties that are not, at least not as a whole, shared by currently existing alternatives.
Abstract: The processing of data streams in general and the mining of such streams in particular have recently attracted considerable attention in various research fields. A key problem in stream mining is to extend existing machine learning and data mining methods so as to meet the increased requirements imposed by the data stream scenario, including the ability to analyze incoming data in an online, incremental manner, to observe tight time and memory constraints, and to appropriately respond to changes of the data characteristics and underlying distributions, amongst others. This paper considers the problem of classification on data streams and develops an instance-based learning algorithm for that purpose. The experimental studies presented in the paper suggest that this algorithm has a number of desirable properties that are not, at least not as a whole, shared by currently existing alternatives. Notably, our method is very flexible and thus able to adapt to an evolving environment quickly, a point of utmost importance in the data stream context. At the same time, the algorithm is relatively robust and thus applicable to streams with different characteristics.
TL;DR: In this article, a simple probabilistic framework for transaction data is presented, which can be used to simulate transaction data when no associations are present, and two new interest measures, hyper-lift and hyper-confidence, are developed to filter or order mined association rules.
Abstract: Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. We start this paper with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic.
TL;DR: It is found that the performance of these systems was remarkably similar and that the extended systems have significant weaknesses which are not apparent for the simpler Naive Bayes learner, SpamBayes.
Abstract: We describe an in-depth analysis of spam-filtering performance of a simple Naive Bayes learner and two extended variants A set of seven mailboxes comprising about 65,000 mails from seven different users, as well as a representative snapshot of 25,000 mails which were received over 18 weeks by a single user, were used for evaluation Our main motivation was to test whether two extended variants of Naive Bayes learning, SA-Train and CRM114, were superior to simple Naive Bayes learning, represented by SpamBayes Surprisingly, we found that the performance of these systems was remarkably similar and that the extended systems have significant weaknesses which are not apparent for the simpler Naive Bayes learner The simpler Naive Bayes learner, SpamBayes, also offers the most stable performance in that it deteriorates least over time Overall, SpamBayes should be preferred over the more complex variants
TL;DR: An important area of data mining is anomaly detection, particularly for fraud, but little work has been done in terms of detecting anomalies in data that is represented as a graph.
Abstract: An important area of data mining is anomaly detection, particularly for fraud. However, little work has been done in terms of detecting anomalies in data that is represented as a graph. In this pap...
TL;DR: This paper proposes an active learning system that is sensitive to significant changes and robust to noisy changes, and can quickly adapt to concept-drift.
Abstract: Mining time-changing data streams is of great interest. The fundamental problems are how to effectively identify the significant changes and organize new training data to adjust the outdated model. In this paper, we propose an active learning system to address these issues. Without need knowing any true labels of the new data, we devise an active approach to detecting the possible changes. Whenever the suspected changes are indicated, it exploits a light-weight uncertainty sampling algorithm to choose the most informative instances to label. With these labeled instances, it further tests the truth of the suspected changes. If the changes indeed cause significant performance deterioration of the current model, it evolves the old model. Thus, our method is sensitive to significant changes and robust to noisy changes, and can quickly adapt to concept-drift. Experimental results from both synthetic and real-world data confirm the advantages of our system.
TL;DR: Relational variants of neural topographic maps including the self-organizing map and neural gas are introduced, which allow clustering and visualization of data given as pairwise similarities or dissimilarities with continuous prototype updates, providing a way to transfer batch optimization to relational data.
Abstract: We introduce relational variants of neural topographic maps including the self-organizing map and neural gas, which allow clustering and visualization of data given as pairwise similarities or dissimilarities with continuous prototype updates. It is assumed that the (dis-)similarity matrix originates from Euclidean distances, however, the underlying embedding of points is unknown. Batch optimization schemes for topographic map formations are formulated in terms of the given (dis-)similarities and convergence is guaranteed, thus providing a way to transfer batch optimization to relational data.
TL;DR: An extensive experimental evaluation shows that ${AP}_{stream}$, the proposed algorithm, yields a good approximation of the exact global result considering both the set of patterns found and their supports.
Abstract: Many critical applications, like intrusion detection or stock market analysis, require a nearly immediate result based on a continuous and infinite stream of data. In most cases finding an exact solution is not compatible with limited availability of resources and real time constraints, but an approximation of the exact result is enough for most purposes.
This paper introduces a new algorithm for approximate mining of frequent itemsets from streams of transactions using a limited amount of memory. The proposed algorithm is based on the computation of frequent itemsets in recent data and an effective method for inferring the global support of previously infrequent itemsets. Both upper and lower bounds on the support of each pattern found are returned along with the interpolated support. An extensive experimental evaluation shows that ${AP}_{stream}$, the proposed algorithm, yields a good approximation of the exact global result considering both the set of patterns found and their supports.
TL;DR: A technique to discover a pattern from a given sequence is presented followed by a general novel method to classify the sequence, which considers mainly the dependencies among the neighbouring elements of a sequence.
Abstract: Sequence classification is a significant problem that arises in many different real-world applications. The purpose of a sequence classifier is to assign a class label to a given sequence. Also, to obtain the pattern that characterizes the sequence is usually very useful. In this paper, a technique to discover a pattern from a given sequence is presented followed by a general novel method to classify the sequence. This method considers mainly the dependencies among the neighbouring elements of a sequence. In order to evaluate this method, a UNIX command environment is presented, but the method is general enough to be applied to other environments.
TL;DR: This paper provides an innovative methodology for Data Fusion based on an incremental imputation algorithm in tree-based models that works for a mixed data structure including both numerical and categorical variables.
Abstract: Data Fusion and Data Grafting are concerned with combining files and information coming from different sources. The problem is not to extract data from a single database, but to merge information collected from different sample surveys. The typical data fusion situation formed of two data samples, the former made up of a complete data matrix X relative to a first survey, and the latter Y which contains a certain number of missing variables. The aim is to complete the matrix Y beginning from the knowledge acquired from the X. Thus, the goal is the definition of the correlation structure which joins the two data matrices to be merged. In this paper, we provide an innovative methodology for Data Fusion based on an incremental imputation algorithm in tree-based models. In addition, we consider robust tree validation by boosting iterations. A relevant advantage of the proposed method is that it works for a mixed data structure including both numerical and categorical variables. As benchmarking methods we consider explicit methods such as standard trees and multiple regression as well as an implicit method based principal component analysis. A widely extended simulation study proves that the proposed method is more accurate than the other methods.
TL;DR: Persist, a temporal discretization method yields with the longest time intervals and lowest error rate, is compared, which is commonly used in data mining as a preprocessing step.
Abstract: Discretization is widely used in data mining as a preprocessing step; discretization usually leads to improved performance. In time series analysis commonly the data is divided into time windows. Measurements are extracted from the time window into a vectorial representation and static mining methods are applied, which avoids an explicit analysis along time. Abstracting time series into meaningful time interval series enables to mine the data explicitly along time. Transforming time series into time intervals can be made through discretization and concatenation of equal value and adjacent time points. We compare in this study five discretization methods on a medical time series dataset. Persist, a temporal discre-tization method yields with the longest time intervals and lowest error rate.
TL;DR: This paper introduces a novel approach to providing recommendations using collaborative filtering when user rating is arrived over an incoming data stream by dynamically building a decision tree for every item as data arrive.
Abstract: Collaborate filtering is one of the most popular recommendation algorithms. Most collaborative filtering algorithms work with static data. This paper introduces a novel approach to providing recommendations using collaborative filtering when user rating is arrived over an incoming data stream. In this case a large number of data records can arrive rapidly making it impossible to save all of them for later analysis. Moreover, user interests may change over time. By dynamically building a decision tree for every item as data arrive, the incoming data stream is used effectively with a trade off between catching up the changes of users interests and accuracy. By adding a simple step using a hierarchy of items taxonomy, it is also possible to further improve the predicted ratings made by each decision tree and generate recommendations in realtime. Empirical studies with the dynamically built decision trees show that our algorithm works effectively and improves the overall prediction accuracy.
TL;DR: This paper presents an approach to automate the annotation of results obtained from ubiquitous data stream clustering to facilitate interpreting and use of the results to enable real-time, mobile decision making.
Abstract: Ubiquitous Data Mining is the process of analysing data emanating from distributed and heterogeneous sources in the form of a continuous stream with mobile and/or embedded devices. Unsupervised learning is clearly beneficial for initial understanding of data streams, and consequently various clustering algorithms have been developed and applied in UDM systems for the purpose of mining data streams. However, unsupervised data mining techniques require human intervention for further understanding and analysis of the clustering results. This becomes an issue as UDM applications aim to support mobile and highly dynamic users/applications and there is a need for real-time decision making and interpretation of results. In this paper we present an approach to automate the annotation of results obtained from ubiquitous data stream clustering to facilitate interpreting and use of the results to enable real-time, mobile decision making.
TL;DR: This paper introduces an unbiased estimator of the conditional mutual information, based on Monte Carlo estimation, and test the performance of the proposed model in a real life context, related to higher education management, where regression problems with discrete and continuous variables are common.
Abstract: In this paper we explore the use of Tree Augmented Naive Bayes (TAN) in regression problems where some of the independent variables are continuous and some others are discrete. The proposed solution is based on the approximation of the joint distribution by a Mixture of Truncated Exponentials (MTE). The construction of the TAN structure requires the use of the conditional mutual information, which cannot be analytically obtained for MTEs. In order to solve this problem, we introduce an unbiased estimator of the conditional mutual information, based on Monte Carlo estimation. We test the performance of the proposed model in a real life context, related to higher education management, where regression problems with discrete and continuous variables are common.
TL;DR: This paper presents a methodology to describe the finite mixture of multivariate Bernoulli distributions with a compact and understandable description, and extracts the maximal frequent itemsets from the cluster-specific data sets.
Abstract: Finite mixturemodels can be used in estimating complex, unknown probability distributions and also in clustering data. The parameters of the models form a complex representation and are not suitable for interpretation purposes as such. In this paper, we present a methodology to describe the finite mixture of multivariate Bernoulli distributions with a compact and understandable description. First, we cluster the data with the mixture model and subsequently extract the maximal frequent itemsets from the cluster-specific data sets. The mixture model is used to model the data set globally and the frequent itemsets model the marginal distributions of the partitioned data locally. We present the results in understandable terms that reflect the domain properties of the data. In our application of analyzing DNA copy number amplifications, the descriptions of amplification patterns are represented in nomenclature used in literature to report amplification patterns and generally used by domain experts in biology and medicine.
TL;DR: A novel QSD (Quick Simple Decomposition) algorithm using simple decompose principle which derived from minimal heap tree is proposed which can discover the frequent itemsets quickly under one database scan and can be applied to on-line incremental mining applications without any modification.
Abstract: The generation of frequent itemsets is an essential and time-consuming step in mining association rules. Most of the studies adopt the Apriori-based approach, which has great effort in generating candidate itemsets and needs multiple database accesses. Recent studies indicate that FP-tree approach has been utilized to avoid the generation of candidate itemsets and scan transaction database only twice, but they work with more complicated data structure. Besides, it needs to adjust the structure of FP-tree when it applied to incremental mining application. It is necessary to adjust the position of an item upward or downward in the structure of FP-tree when a new transaction increases or decreases the accumulation of the item. The process of the adjustment of the structure of FP-tree is the bottlenecks of the FP-tree in incremental mining application. Therefore, algorithms for efficient mining of frequent patterns are in urgent demand.
This paper aims to improve both time and space efficiency in mining frequent itemsets and incremental mining application. We propose a novel QSD (Quick Simple Decomposition) algorithm using simple decompose principle which derived from minimal heap tree, we can discover the frequent itemsets quickly under one database scan. Meanwhile, QSD algorithm doesn't need to scan database and reconstruct data structure again when database is updated or minimum support is varied. It can be applied to on-line incremental mining applications without any modification.
Comprehensive experiments have been conducted to assess the performance of the proposed algorithm. The experimental results show that the QSD algorithm outperforms previous algorithms.
TL;DR: An in-depth empirical comparison and analysis of popular sequence learning methods in terms of the quality of information produced, for several synthetic and real-world datasets, under controlled settings of noise finds that both frequency-based and statistics-based approaches suffer from common statistical biases based on the length of sequences considered.
Abstract: Unsupervised sequence learning is important to many applications. A learner is presented with unlabeled sequential data, and must discover sequential patterns that characterize the data. Popular approaches to such learning include (and often combine) frequency-based approaches and statistical analysis. However, the quality of results is often far from satisfactory. Though most previous investigations seek to address method-specific limitations, we instead focus on general (method-neutral) limitations in current approaches. This paper takes two key steps towards addressing such general quality-reducing flaws. First, we carry out an in-depth empirical comparison and analysis of popular sequence learning methods in terms of the quality of information produced, for several synthetic and real-world datasets, under controlled settings of noise. We find that both frequency-based and statistics-based approaches (i) suffer from common statistical biases based on the length of the sequences considered; (ii) are unable to correctly generalize the patterns discovered, thus flooding the results with multiple instances (with slight variations) of the same pattern. We additionally show empirically that the relative quality of different approaches changes based on the noise present in the data: Statistical approaches do better at high levels of noise, while frequency-based approaches do better at low levels of noise. As our second contribution, we develop methods for countering these common deficiencies. We show how to normalize rankings of candidate patterns such that the relative ranking of different-length patterns can be compared. We additionally show the use of clustering, based on sequence similarity, to group together instances of the same general pattern, and choose the most general pattern that covers all of these. The results show significant improvements in the quality of results in all methods, and across all noise settings.
TL;DR: An efficient method using multi-objective genetic algorithm (MOGAMOD) to discover optimal motifs in sequential data with good performance over the other methods in terms of runtime, the number of shaded samples and multiple motifs.
Abstract: We propose an efficient method using multi-objective genetic algorithm (MOGAMOD) to discover optimal motifs in sequential data. The main advantage of our approach is that a large number of tradeoff (i.e., nondominated) motifs can be obtained by a single run with respect to conflicting objectives: similarity, motif length and support maximization. To the best of our knowledge, this is the first effort in this direction. MOGAMOD can be applied to any data set with a sequential character. Furthermore, it allows any choice of similarity measures for finding motifs. By analyzing the obtained optimal motifs, the decision maker can understand the tradeoff between the objectives. We compare MOGAMOD with the three well-known motif discovery methods, AlignACE, MEME and Weeder. Experimental results on real data set extracted from TRANSFAC database demonstrate that the proposed method exhibits good performance over the other methods in terms of runtime, the number of shaded samples and multiple motifs.
TL;DR: It is shown that by applying the method for finding optimal, abstaining classifiers based on the ROC analysis, one can significantly reduce the rates of false positives and the false negatives as well as overall misclassification cost, making this method particularly viable for this application domain.
Abstract: Intrusion Detection Systems have been observed to trigger an abundance of false positives, that is alerts not reporting security problems. Assuming that in real installations most of the alerts are reviewed by human security analysts in a timely manner, it is possible to use supervised machine learning techniques for automated alert classification to classify alerts into true and false positives. This paper explores the requirements for such an alert classification system and shows that, being a difficult and challenging machine learning problem, it is particularly suited for the application of abstaining classifiers, i.e., classifiers that can refrain from classification in some cases.
We show that by applying our method for finding optimal, abstaining classifiers based on the ROC analysis, one can significantly reduce the rates of false positives and the false negatives as well as overall misclassification cost, making this method particularly viable for this application domain. Finally, we validate our method on one real-world proprietary dataset and one synthetic, publicly available dataset.
TL;DR: The role played by an instrumental variable to stratify either the variables or the objects is considered to introduce a tree-based methodology for conditional classification.
Abstract: The framework of this paper is supervised learning using classification trees. Two types of variables play a role in the definition of the classification rule, namely a response variable and a set of predictors. The tree classifier is built up by a recursive partitioning of the prediction space such to provide internally homogeneous groups of objects with respect to the response classes. In the following, we consider the role played by an instrumental variable to stratify either the variables or the objects. This yields to introduce a tree-based methodology for conditional classification. Two special cases will be discussed to grow multiple discriminant trees and partial predictability trees. These approaches use discriminant analysis and predictability measures respectively. Empirical evidence of their usefulness will be shown in real case studies.
TL;DR: A new statistical approach is introduced which biases the initial support for sequential patterns which holds the advantage to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion.
Abstract: Recently, the knowledge extraction community takes a closer look at new models where data arrive in timely manner like a fast and continuous flow, i.e. data streams. As only a part of the stream can be stored, mining data streams for sequential patterns and updating previously found frequent patterns need to cope with uncertainty. In this paper, we introduce a new statistical approach which biases the initial support for sequential patterns. This approach holds the advantage to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Moreover, these statistical supports help building statistical borders which are the relevant sets of frequent patterns to use into an incremental mining process. From the statistical standpoint, theoretical results show that the technique is not far from the optimum. Experiments performed on sequential patterns demonstrate the interest of this approach and the potential of such techniques.
TL;DR: The paper presents new developments in an extension of Codd's relational model of data by equipping domains of attribute values with a similarity relation and adding ranks to rows of a database table and comments on related approaches presented in the literature.
Abstract: The paper presents new developments in an extension of Codd's relational model of data. The extension consists in equipping domains of attribute values with a similarity relation and adding ranks to rows of a database table. This way, the concept of a table over domains (i.e., relation over a relation scheme) of the classical Codd's model extends to the concept of a ranked table over domains with similarities. When all similarities are ordinary identity relations and all ranks are set to 1, our extension becomes the ordinary Codd's model. The main contribution of our paper is twofold. First, we present an outline of a relational algebra for our extension. Second, we deal with implementation issues of our extension. In addition to that, we also comment on related approaches presented in the literature.
TL;DR: It is shown that SVM solutions with small C are high performers, however, most training documents are then bounded support vectors sharing a same weight C, thus, SVM reduce to a nearest mean classifier; this raises an interesting question on SVM merits in sparse bag of words feature spaces.
Abstract: We are concerned with the problem of learning classification rules in text categorization where many authors presented Support Vector Machines (SVM) as leading classification method. Number of studies, however, repeatedly pointed out that in some situations SVM is outperformed by simpler methods such as naive Bayes or nearest-neighbor rule. In this paper, we aim at developing better understanding of SVM behaviour in typical text categorization problems represented by sparse bag of words feature spaces. We study in details the performance and the number of support vectors when varying the training set size, the number of features and, unlike existing studies, also SVM free parameter C, which is the Lagrange multipliers upper bound in SVM dual. We show that SVM solutions with small C are high performers. However, most training documents are then bounded support vectors sharing a same weight C. Thus, SVM reduce to a nearest mean classifier; this raises an interesting question on SVM merits in sparse bag of words feature spaces. Additionally, SVM suffer from performance deterioration for particular training set size/number of features combinations.