TL;DR: This article presents a summary of the experience and recommendations to compute data set preprocessing and transformation inside a database system, which is the most time-consuming task in data mining projects, and identifies advantages and disadvantages from a practical standpoint based on data mining users feedback.
Abstract: In general, there is a significant amount of data mining analysis performed outside a database system, which creates many data management issues This article presents a summary of our experience and recommendations to compute data set preprocessing and transformation inside a database system (ie data cleaning, record selection, summarization, denormalization, variable creation, coding), which is the most time-consuming task in data mining projects This aspect is largely ignored in the literature We present practical issues, common solutions and lessons learned when preparing and transforming data sets with the SQL language, based on experience from real-life projects We then provide specific guidelines to translate programs written in a traditional programming language into SQL statements Based on successful real-life projects, we present time performance comparisons between SQL code running inside the database system and external data mining programs We highlight which steps in data mining projects become faster when processed by the database system More importantly, we identify advantages and disadvantages from a practical standpoint based on data mining users feedback
TL;DR: This article focuses on the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a novel way and motivating this approach, shows the difference to classical data analysis and concludes by briefly illustrating some types of domain-crossing connections.
Abstract: Data analysis generally focusses on finding patterns within a reasonably well connected domain of interest. In this article we focus on the discovery of new connections between domains (so called bisociations), supporting the creative discovery process in a novel way. We motivate this approach, show the difference to classical data analysis and conclude by briefly illustrating some types of domain-crossing connections along with illustrative examples.
TL;DR: Computational sustainability is a new interdisciplinary research field with the overall goal of developing computational models, methods, and tools to help manage the balance between environmental, economic, and societal needs for sustainable development as mentioned in this paper.
Abstract: Computational sustainability [1] is a new interdisciplinary research field with the overall goal of developing computational models, methods, and tools to help manage the balance between environmental, economic, and societal needs for sustainable development. The notion of sustainable development -- development that meets the needs of the present without compromising the ability of future generations to meet their needs -- was introduced in Our Common Future, the seminal report of the United Nations World Commission on Environment and Development, published in 1987. In this talk I will provide an overview of computational sustainability, with examples ranging from wildlife conservation and biodiversity, to poverty mitigation, to large-scale deployment and management of renewable energy sources. I will highlight overarching computational challenges at the intersection of constraint reasoning, optimization, data mining, and dynamical systems. Finally I will discuss the need for a new approach that views computational sustainability problems as "natural" phenomena, amenable to a scientific methodology, in which principled experimentation, to explore problem parameter spaces and hidden problem structure, plays as prominent a role as formal analysis.
TL;DR: An empirical investigation uncovers several underlying causes of the discrepancy in the results reported in the literature, which allows us to better interpret the current body of research, and inform recommendations for future use of the data set.
Abstract: The KDD Cup '99 data set has been widely used to evaluate intrusion detection prototypes, most based on machine learning techniques, for nearly a decade. The data set served well in the KDD Cup '99 competition to demonstrate that machine learning can be useful in intrusion detection systems. However, there are discrepancies in the findings reported in the literature. Further, some researchers have published criticisms of the data (and the DARPA data from which the KDD Cup '99 data has been derived), questioning the validity of results obtained with this data. Despite the criticisms, researchers continue to use the data due to a lack of better publicly available alternatives. Hence, it is important to identify the value of the data set and the findings from the extensive body of research based on it, which has largely been ignored by the existing critiques. This paper reports on an empirical investigation, demonstrating the impact of several methodological differences in the publicly available subsets, which uncovers several underlying causes of the discrepancy in the results reported in the literature. These findings allow us to better interpret the current body of research, and inform recommendations for future use of the data set.
TL;DR: It is empirically demonstrate that the emotions triggered by viewing abstract art images can be predicted with reasonable accuracy by machine using a variety of low-level image descriptors such as color, shape, and texture.
Abstract: In this work, we study people's emotions evoked by viewing abstract art images based on traditional low-level image features within a binary classification framework. Abstract art is used here instead of artistic or photographic images because those contain contextual information that influences the emotional assessment in a highly individual manner. Whether an image of a cat or a mountain elicits a negative or positive response is subjective. After discussing challenges concerning image emotional semantics research, we empirically demonstrate that the emotions triggered by viewing abstract art images can be predicted with reasonable accuracy by machine using a variety of low-level image descriptors such as color, shape, and texture. The abstract art dataset that we created for this work has been made downloadable to the public.
TL;DR: This work argues the value of unsupervised metalearning and discusses the attendant necessity of suitable similarity, or distance, functions, and uses COD to produce a clustering of 21 learning algorithms.
Abstract: We argue the value of unsupervised metalearning and discuss the attendant necessity of suitable similarity, or distance, functions. We leverage the notion of diversity among learners used in ensemble learning to design a distance function for the clustering of learning algorithms. We revisit the most popular measures of diversity and show that only one of them, Classifier Output Difference COD is a metric. We then use COD to produce a clustering of 21 learning algorithms, and show how this clustering differs from a clustering based on accuracy, and how it can be used to highlight interesting, sometimes unexpected, similarities among learning algorithms.
TL;DR: This work followed the most commonly used approach and induced a binary classifier for each class in text categorization and noticed that performance had been impaired by two factors.
Abstract: An interesting issue in machine learning is induction in multi-label domains where each example can be labeled with two or more classes at the same time. In a work focusing on text categorization, we followed the most commonly used approach and induced a binary classifier for each class. Analyzing the results, we noticed that performance had been impaired by two factors. First, in text domains, each class is characterized by a different set of attributes; an appropriate attribute-selection technique thus has to be applied separately to each of them. Second, the individual classes often have to be induced from imbalanced training sets, a circumstance we addressed here by majority-class undersampling. The paper provides details of the induction system and reports the results of systematic experimentation.
TL;DR: This paper describes various data sources and discusses the principles and techniques of data mining as applied on software engineering data, and surveys the mining approaches that have been used and categorize them according to the corresponding parts of the development process and the task they assist.
Abstract: The increased availability of data created as part of the software development process allows us to apply novel analysis techniques on the data and use the results to guide the process's optimization. In this paper we describe various data sources and discuss the principles and techniques of data mining as applied on software engineering data. Data that can be mined is generated by most parts of the development process: requirements elicitation, development analysis, testing, debugging, and maintenance. Based on this classification we survey the mining approaches that have been used and categorize them according to the corresponding parts of the development process and the task they assist. Thus the survey provides researchers with a concise overview of data mining techniques applied to software engineering data, and aids practitioners on the selection of appropriate data mining techniques for their work.
TL;DR: An extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm is proposed.
Abstract: The amount of data produced by ubiquitous computing applications is quickly growing, due to the pervasive presence of small devices endowed with sensing, computing and communication capabilities. Heterogeneity and strong interdependence, which characterize 'ubiquitous data', require a (multi-)relational approach to their analysis. However, relational data mining algorithms do not scale well and very large data sets are hardly processable. In this paper we propose an extension of a relational algorithm for multi-level frequent pattern discovery, which resorts to data sampling and distributed computation in Grid environments, in order to overcome the computational limits of the original serial algorithm. The set of patterns discovered by the new algorithm approximates the set of exact solutions found by the serial algorithm. The quality of approximation depends on three parameters: the proportion of data in each sample, the minimum support thresholds and the number of samples in which a pattern has to be frequent in order to be considered globally frequent. Considering that the first two parameters are hardly controllable, we focus our investigation on the third one. Theoretically derived conclusions are also experimentally confirmed. Moreover, an additional application in the context of event log mining proves the viability of the proposed approach to relational frequent pattern mining from very large data sets.
TL;DR: Empirical analysis show how and when this PSO-CFC approach outperforms local fuzzy clustering in the domain of ubiquitous knowledge discovery.
Abstract: The goal of this article is to introduce a collaborative clustering approach to the domain of ubiquitous knowledge discovery. This clustering approach is suitable in peer-to-peer networks where different data sites want to cluster their local data as if they consolidated their data sets, but which is prevented by privacy restrictions. Two variants exist, i.e. one for data sites with the same observations but different features and one for data sites with the same features but different observations. The technique contains two parts, i.e. a collaborative fuzzy clustering technique and a particle swarm optimization to optimize the collaboration between data sites. Empirical analysis show how and when this PSO-CFC approach outperforms local fuzzy clustering.
TL;DR: Experimental results on synthetically generated and real life data streams show the superiority of the proposed method with multiple orders of magnitude in terms of runtime and memory usage with respect to other pane based sliding window algorithms.
Abstract: Mining frequent patterns over data streams is an interesting problem due to its wide application area. The researchers in this field have been facing two key challenges, namely reduction in runtime and memory usage. In this study, a novel method for efficient mining of frequent patterns over data streams is proposed. The method is based on sliding window model which divides the window into a number of panes. This method provides a new sliding window mechanism by utilizing a set of simple short lists. Each list stores related information about an item in the sliding window. The proposed mechanism dynamically adopts itself with the concept change. This method is empirically evaluated against recently proposed pane based sliding window algorithms. Experimental results on synthetically generated and real life data streams show the superiority of the proposed method with multiple orders of magnitude in terms of runtime and memory usage with respect to other pane based sliding window algorithms.
TL;DR: This study proposed two new heuristics in decision tree algorithm design, namely removal of insignificant attributes in induction process at each tree node, and usage of combined strategy for generating possible splits for decision trees, utilizing several ways of splitting together, which experimentally showed benefits.
Abstract: Typical data mining algorithms follow a so called "black-box" paradigm, where the logic is hidden from the user not to overburden him. We show that "white-box" algorithms constructed with reusable components design can have significant benefits for researchers, and end users as well. We developed a component-based algorithm design platform, and used it for "white-box" algorithm construction. The proposed platform can also be used for testing algorithm parts reusable components, and their single or joint influence on algorithm performance. The platform is easily extensible with new components and algorithms, and allows testing of partial contributions of an introduced component. We propose two new heuristics in decision tree algorithm design, namely removal of insignificant attributes in induction process at each tree node, and usage of combined strategy for generating possible splits for decision trees, utilizing several ways of splitting together, which experimentally showed benefits. Using the proposed platform we tested 80 component-based decision tree algorithms on 15 benchmark datasets and present the results of reusable components' influence on performance, and statistical significance of the differences found. Our study suggests that for a specific dataset we should search for the optimal component interplay instead of looking for the optimal among predefined algorithms.
TL;DR: This paper proposes a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences.
Abstract: Today, anomaly detection is a highly valuable application in the analysis of current huge datasets. Insurance companies, banks and many manufacturing industries need systems to help humans to detect anomalies in their daily information. In general, anomalies are a very small fraction of the data, therefore their detection is not an easy task. Usually real sources of an anomaly are given by specific values expressed on selective dimensions of datasets, furthermore, many anomalies are not really interesting for humans, due to the fact that interestingness of anomalies is categorized subjectively by the human user. In this paper we propose a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences. Our approach is based on 3 main steps. First, a Bayes network identifies an initial set of candidate anomalies. Afterwards, a subspace clustering technique identifies relevant subsets of dimensions. Finally, a probabilistic active learning scheme, based on properties of Dirichlet distribution, uses the feedback from an expert user to efficiently search for relevant anomalies. Our results, using synthetic and real datasets, indicate that, under noisy data and anomalies presenting regular patterns, our approach correctly identifies relevant anomalies.
TL;DR: This work addresses the problem of estimating the parameters, from observed data in a complex social network, for an information diffusion model that takes time- delay into account, based on the popular independent cascade IC model and proposes an iterative method to search for the parameters time-delay and diffusion that maximize this likelihood.
Abstract: We address the problem of estimating the parameters, from observed data in a complex social network, for an information diffusion model that takes time-delay into account, based on the popular independent cascade IC model. For this purpose we formulate the likelihood to obtain the observed data which is a set of time-sequence data of infected active nodes, and propose an iterative method to search for the parameters time-delay and diffusion that maximize this likelihood. We first show by using a synthetic network that the proposed method outperforms the similar existing method. Next, we apply this method to problems of both 1 predicting the influence of nodes for the considered information diffusion model and 2 ranking the influential nodes. Using three large social networks, we demonstrate the effectiveness of the proposed method.
TL;DR: According to the experimental results, the proposed methodology improved both the precision and the accuracy of forecasting the global CO2 concentration by 28% and 91%, respectively.
Abstract: The global CO2 concentration is considered to be one of the most important causes of global warming that must be closely monitored, accurately forecasted, and controlled as good as possible. To accurately forecast the global CO2 concentration, a hybrid fuzzy linear regression FLR and back propagation network BPN approach is proposed in this study. In this proposed approach, multiple experts construct their own FLR equations from various viewpoints to forecast future global CO2 concentrations. Each FLR equation can be converted into two equivalent nonlinear programming problems to be solved. To combine these fuzzy forecasts, a two-step aggregation mechanism is applied. At the first step, fuzzy intersection is applied to combine the fuzzy global CO2 concentration forecasts into a polygon-shaped fuzzy number, in order to improve the precision. After that, a BPN is constructed to defuzzify the polygon-shaped fuzzy number and to generate a representative/crisp value, so as to enhance the accuracy. Some historical data on global CO2 concentrations were used to evaluate the effectiveness of the proposed methodology. According to the experimental results, the proposed methodology improved both the precision and the accuracy of forecasting the global CO2 concentration by 28% and 91%, respectively.
TL;DR: The Robust Data Quality Analysis is introduced, which exploits formal methods to support Data Quality Improvement Processes and has proved successful, by giving insights on the data quality levels and by providing suggestions on how to ameliorate the overall data quality process.
Abstract: The paper introduces the Robust Data Quality Analysis which exploits formal methods to support Data Quality Improvement Processes. The proposed methodology can be applied to data sources containing sequences of events that can be modelled by Finite State Systems. Consistency rules (derived from domain business rules) can be expressed by formal methods and can be automatically verified on data, both before and after the execution of cleansing activities. The assessment results can provide useful information to improve the data quality processes. The paper outlines the preliminary results of the methodology applied to a real case scenario: the cleansing of a very low quality database, containing the work careers of the inhabitants of an Italian province. The methodology has proved successful, by giving insights on the data quality levels and by providing suggestions on how to ameliorate the overall data quality process.
TL;DR: This paper proposes a new semi-supervised approach for handling concept-drifting data streams containing both labeled and unlabeled instances that is so general that it can be applied to different classification models.
Abstract: Recently, several approaches have been proposed to deal with the increasingly challenging task of mining concept-drifting data streams. However, most are based on supervised classification algorithms assuming that true labels are immediately and entirely available in the data streams. Unfortunately, such an assumption is often violated in real-world applications given that it is expensive or because it takes a long time to obtain all true labels. To deal with this problem, we propose in this paper a new semi-supervised approach for handling concept-drifting data streams containing both labeled and unlabeled instances. First, contrary to existing approaches, we monitor three possible kinds of drift: feature, conditional or dual drift. Drift detection is based on a hypothesis test comparing Kullback-Leibler divergence between old and recent data, whose distribution under the null hypothesis of coming from the same distribution is approximated via a bootstrap method. Then, if any drift occurs, a new classifier is learned from the recent data using the EM algorithm; otherwise, the current classifier is left unchanged. Our approach is so general that it can be applied to different classification models. Experimental studies, using the naive Bayes classifier and logistic regression, on both synthetic and real-world data sets demonstrate that our approach performs well.
TL;DR: An alternative technique for clustering noisy categorical data using Variable Precision Rough Set model is proposed and the results show that the technique provides better performance in selecting the clustering attribute.
Abstract: Clustering a set of objects into homogeneous classes is a fundamental operation in data mining. Several cluster analysis techniques have been developed to group objects having similar characteristics. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. An algorithm termed MMR using classical rough set theory was proposed to deal with problems in clustering categorical data. However, the MMR algorithm fails to handle noisy data as an integral part of databases. In this paper, an alternative technique for clustering noisy categorical data using Variable Precision Rough Set model is proposed. The results show that the technique provides better performance in selecting the clustering attribute.
TL;DR: A new attribute selection strategy is proposed --based on a lazy learning approach --which postpones the identification of relevant attributes until an instance is submitted for classification, which in most cases improves the accuracy of classification.
Abstract: Attribute selection is a data preprocessing step which aims at identifying relevant attributes for the target machine learning task --namely classification in this paper In this paper, we propose a new attribute selection strategy --based on a lazy learning approach --which postpones the identification of relevant attributes until an instance is submitted for classification Our strategy relies on the hypothesis that taking into account the attribute values of an instance to be classified may contribute to identifying the best attributes for the correct classification of that particular instance Experimental results using the k-NN and Naive Bayes classifiers, over 40 different data sets from the UCI Machine Learning Repository and five large data sets from the NIPS 2003 feature selection challenge, show the effectiveness of delaying attribute selection to classification time The proposed lazy technique in most cases improves the accuracy of classification, when compared with the analogous attribute selection approach performed as a data preprocessing step We also propose a metric to estimate when a specific data set can benefit from the lazy attribute selection approach
TL;DR: A general approach to extend unsupervised prototype-based techniques to dissimilarities is reviewed, and a new supervised prototype- based classification technique for dissimilarity data is proposed.
Abstract: Unlike many black-box algorithms in machine learning, prototype-based models offer an intuitive interface to given data sets, since prototypes can directly be inspected by experts in the field. Most techniques rely on Euclidean vectors such that their suitability for complex scenarios is limited. Recently, several unsupervised approaches have successfully been extended to general, possibly non-Euclidean data characterized by pairwise dissimilarities. In this paper, we shortly review a general approach to extend unsupervised prototype-based techniques to dissimilarities, and we transfer this approach to supervised prototypebased classification for general dissimilarity data. In particular, a new supervised prototype-based classification technique for dissimilarity data is proposed.
TL;DR: This paper presents a method to identify leukemia from bone marrow cells images using a combined machine vision and data mining strategy and shows how the combination of descriptive features and eigenvalues helps to improve classification accuracy.
Abstract: The morphological analysis of medical images to support medical diagnosis is an important research area. This is the case of leukemia identification from bone marrow smears in which cells morphology is studied in order to classify the disease into its main family and subtype, so that a proper treatment can be indicated to the patient. In this paper we present a method to identify leukemia from bone marrow cells images using a combined machine vision and data mining strategy. Our process starts with a segmentation method to obtain leukemia cells and extract from them descriptive characteristics (geometrical, texture, statistical) and eigenvalues. We use these attributes to feed machine learning algorithms that learn to classify acute leukemia families and subtypes according to the FAB system. We show how the combination of descriptive features and eigenvalues helps to improve classification accuracy. Our method achieved accuracy above 95.5% to distinguish between the acute myeloblastic and lymphoblastic leukemia families and accuracy of 90% (and above) among five leukemia subtypes (after the acute leukemia families classification).
TL;DR: This paper proposes a type of web application: a virtual newspaper with automatically generated news stories that describe the meaning of quantitative sensor data that can facilitate the use of sensor data by general users and, therefore, can increase the utility of sensor network infrastructures.
Abstract: An important competence of human data analysts is to interpret and explain the meaning of the results of data analysis to end-users. However, existing automatic solutions for intelligent data analysis provide limited help to interpret and communicate information to non-expert users. In this paper we present a general approach to generating explanatory descriptions about the meaning of quantitative sensor data. We propose a type of web application: a virtual newspaper with automatically generated news stories that describe the meaning of sensor data. This solution integrates a variety of techniques from intelligent data analysis into a web-based multimedia presentation system. We validated our approach in a real world problem and demonstrate its generality using data sets from several domains. Our experience shows that this solution can facilitate the use of sensor data by general users and, therefore, can increase the utility of sensor network infrastructures.
TL;DR: Two new hierarchical multilabel classification methods based on the well-known local approach for hierarchical classification are proposed, which presented promising results in experiments performed with bioinformatics datasets.
Abstract: In most classification problems, a classifier assigns a single class to each instance and the classes form a flat non-hierarchical structure, without superclasses or subclasses In hierarchical multilabel classification problems, the classes are hierarchically structured, with superclasses and subclasses, and instances can be simultaneously assigned to two or more classes at the same hierarchical level This article proposes two new hierarchical multilabel classification methods based on the well-known local approach for hierarchical classification The methods are compared with two global methods and one well-known local binary classification method from the literature The proposed methods presented promising results in experiments performed with bioinformatics datasets
TL;DR: This work proposes new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time, based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory.
Abstract: We propose new algorithms for adaptively mining closed rooted trees, both labeled and unlabeled, from data streams that change over time. Closed patterns are powerful representatives of frequent patterns, since they eliminate redundant information. Our approach is based on an advantageous representation of trees and a low-complexity notion of relaxed closed trees, as well as ideas from Galois Lattice Theory. More precisely, we present three closed tree mining algorithms in sequence: an incremental one, IncTreeMiner, a sliding-window based one, WinTreeMiner, and finally one that mines closed trees adaptively from data streams, AdaTreeMiner. By adaptive we mean here that it presents at all times the closed trees that are frequent in the current state of the data stream. To the best of our knowledge this is the first work on mining closed frequent trees in streaming data varying with time. We give a first experimental evaluation of the proposed algorithms.
TL;DR: A feature selection algorithm using sequential floating forward search based on inference correlation is presented and experiments confirm the effectiveness of the feature selection approach when compared to extant feature selection methods.
Abstract: Feature selection is a critical preprocessing step in machine learning It contributes to cost-effective model building and improvement of model prediction performance Generally, a feature selection algorithm requires a dependency measure and a search strategy Extant dependency measures are mostly based on pair-wise correlation analysis, which cannot detect feature interaction To overcome this problem, we developed a unified dependency criterion called inference correlation The inference correlation between a set of predictor variables and a response variable can be efficiently calculated The variables could be discrete, continuous, or mixed Therefore, inference correlation can be applied to select features for both classification and regression problems A feature selection algorithm using sequential floating forward search based on inference correlation is presented Experiments of the algorithm on synthetic datasets and real-world problems confirm the effectiveness of the feature selection approach when compared to extant feature selection methods
TL;DR: Two databases are combined: voting advice application data and the results of the parliamentary elections in 2011, which allows us to model the values of Finnish citizens and the members of the parliament.
Abstract: The main goal of this paper is to model the values of Finnish citizens and the members of the parliament. To achieve this goal, two databases are combined: voting advice application data and the results of the parliamentary elections in 2011. First, the data is converted to a high-dimension space. Then, it is projected to two principal components. The projection allows us to visualize the main differences between the parties. The value grids are produced with a kernel density estimation method without explicitly using the questions of the voting advice application. However, we find meaningful interpretations for the axes in the visualizations with the analyzed data. Subsequently, all candidate value grids are weighted by the results of the parliamentary elections. The result can be interpreted as a distribution grid for Finnish voters' values.
TL;DR: This paper proposes a representation technique based upon graph theory that provides a new viewpoint to understand the writing process and is aimed at representing the data provided by ScriptLog although the concepts can be applied in other contexts.
Abstract: There are currently several systems to collect online writing data in keystroke logging. Each of these systems provides reliable and very precise data. Unfortunately, due to the large amount of data recorded, it is almost impossible to analyze except for very limited recordings. In this paper, we propose a representation technique based upon graph theory that provides a new viewpoint to understand the writing process. The current application is aimed at representing the data provided by ScriptLog although the concepts can be applied in other contexts.
TL;DR: This work generalizes the assumption that certain neighboring proteins tend to have "collaborative", but not necessarily the same, functions, and proposes a few methods that work under this new assumption.
Abstract: The cellular metabolism of a living organism is among the most complex systems that man is currently trying to understand. Part of it is described by so-called protein-protein interaction (PPI) networks, and much effort is spent on analyzing these networks. In particular, there has been much interest in predicting certain properties of nodes in the network (in this case, proteins) from the other information in the network. In this paper, we are concerned with predicting a protein's functions. Many approaches to this problem exist. Among the approaches that predict a protein's functions purely from its environment in the network, many are based on the assumption that neighboring proteins tend to have the same functions. In this work we generalize this assumption: we assume that certain neighboring proteins tend to have "collaborative", but not necessarily the same, functions. We propose a few methods that work under this new assumption. These methods yield better results than those previously considered, with improvements in F-measure ranging from 3% to 17%. This shows that the commonly made assumption of homophily in the network (or "guilt by association"), while useful, is not necessarily the best one can make. The assumption of collaborativeness is a useful generalization of it; it is operational (one can easily define methods that rely on it) and can lead to better results.
TL;DR: This work proposes a methodology based on learning vector quantization (LVQ) to develop a credit rating model that is applied to a French database of private companies over a period of several years and is capable to create robust and stable classes to rank companies.
Abstract: Credit rating is involved in many financial applications to estimate the creditworthiness of corporations or individuals. In addition to building accurate credit rating models, the stability of models is of significant importance to economic performance. In this work we propose a methodology based on learning vector quantization (LVQ) to develop a credit rating model. This model is applied to a French database of private companies over a period of several years. LVQ is trained and calibrated in a supervised way using data from 2006 and then applied to the remaining years. We analyze one year transition matrix and show that the model is capable to create robust and stable classes to rank companies.
TL;DR: A first formalization for the detection of potentially interesting, domain-crossing relations in large, heterogeneous information repositories based purely on structural properties of a relational knowledge description is proposed.
Abstract: The discovery of surprising relations in large, heterogeneous information repositories is gaining increasing importance in real world data analysis. If these repositories come from diverse origins, forming different domains, domain bridging associations between otherwise weakly connected domains can provide insights into the data that can otherwise not be accomplished. In this paper, we propose a first formalization for the detection of such potentially interesting, domain-crossing relations based purely on structural properties of a relational knowledge description.