TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.
TL;DR: This paper proposes three novel tree structures to efficiently perform incremental and interactive HUP mining that can capture the incremental data without any restructuring operation, and shows that these tree structures are very efficient and scalable.
Abstract: Recently, high utility pattern (HUP) mining is one of the most important research issues in data mining due to its ability to consider the nonbinary frequency values of items in transactions and different profit values for every item. On the other hand, incremental and interactive data mining provide the ability to use previous data structures and mining results in order to reduce unnecessary calculations when a database is updated, or when the minimum threshold is changed. In this paper, we propose three novel tree structures to efficiently perform incremental and interactive HUP mining. The first tree structure, Incremental HUP Lexicographic Tree (IHUPL-Tree), is arranged according to an item's lexicographic order. It can capture the incremental data without any restructuring operation. The second tree structure is the IHUP transaction frequency tree (IHUPTF-Tree), which obtains a compact size by arranging items according to their transaction frequency (descending order). To reduce the mining time, the third tree, IHUP-transaction-weighted utilization tree (IHUPTWU-Tree) is designed based on the TWU value of items in descending order. Extensive performance analyses show that our tree structures are very efficient and scalable for incremental and interactive HUP mining.
TL;DR: This work provides a review of significant contributions in the literature on multiway models, algorithms as well as their applications in diverse disciplines including chemometrics, neuroscience, social network analysis, text mining and computer vision.
Abstract: Two-way arrays or matrices are often not enough to represent all the information in the data and standard two-way analysis techniques commonly applied on matrices may fail to find the underlying structures in multi-modal datasets. Multiway data analysis has recently become popular as an exploratory analysis tool in discovering the structures in higher-order datasets, where data have more than two modes. We provide a review of significant contributions in the literature on multiway models, algorithms as well as their applications in diverse disciplines including chemometrics, neuroscience, social network analysis, text mining and computer vision.
TL;DR: A new dimensionality reduction algorithm is developed, termed discrim inative locality alignment (DLA), by imposing discriminative information in the part optimization stage, and thorough empirical studies demonstrate the effectiveness of DLA compared with representative dimensionality Reduction algorithms.
Abstract: Spectral analysis-based dimensionality reduction algorithms are important and have been popularly applied in data mining and computer vision applications. To date many algorithms have been developed, e.g., principal component analysis, locally linear embedding, Laplacian eigenmaps, and local tangent space alignment. All of these algorithms have been designed intuitively and pragmatically, i.e., on the basis of the experience and knowledge of experts for their own purposes. Therefore, it will be more informative to provide a systematic framework for understanding the common properties and intrinsic difference in different algorithms. In this paper, we propose such a framework, named "patch alignment,rdquo which consists of two stages: part optimization and whole alignment. The framework reveals that (1) algorithms are intrinsically different in the patch optimization stage and (2) all algorithms share an almost identical whole alignment stage. As an application of this framework, we develop a new dimensionality reduction algorithm, termed discriminative locality alignment (DLA), by imposing discriminative information in the part optimization stage. DLA can (1) attack the distribution nonlinearity of measurements; (2) preserve the discriminative ability; and (3) avoid the small-sample-size problem. Thorough empirical studies demonstrate the effectiveness of DLA compared with representative dimensionality reduction algorithms.
TL;DR: Granular structure of concept lattices with application in knowledge reduction in formal concept analysis is examined in this paper and knowledge hidden in such a context is unraveled in the form of compact implication rules.
Abstract: Granular computing and knowledge reduction are two basic issues in knowledge representation and data mining. Granular structure of concept lattices with application in knowledge reduction in formal concept analysis is examined in this paper. Information granules and their properties in a formal context are first discussed. Concepts of a granular consistent set and a granular reduct in the formal context are then introduced. Discernibility matrices and Boolean functions are, respectively, employed to determine granular consistent sets and calculate granular reducts in formal contexts. Methods of knowledge reduction in a consistent formal decision context are also explored. Finally, knowledge hidden in such a context is unraveled in the form of compact implication rules.
TL;DR: This paper summarizes the existing improved algorithms and proposes a novel Bayes model: hidden naive Bayes (HNB), which significantly outperforms NB, SBC, NBTree, TAN, and AODE in terms of CLL and AUC.
Abstract: Because learning an optimal Bayesian network classifier is an NP-hard problem, learning-improved naive Bayes has attracted much attention from researchers. In this paper, we summarize the existing improved algorithms and propose a novel Bayes model: hidden naive Bayes (HNB). In HNB, a hidden parent is created for each attribute which combines the influences from all other attributes. We experimentally test HNB in terms of classification accuracy, using the 36 UCI data sets selected by Weka, and compare it to naive Bayes (NB), selective Bayesian classifiers (SBC), naive Bayes tree (NBTree), tree-augmented naive Bayes (TAN), and averaged one-dependence estimators (AODE). The experimental results show that HNB significantly outperforms NB, SBC, NBTree, TAN, and AODE. In many data mining applications, an accurate class probability estimation and ranking are also desirable. We study the class probability estimation and ranking performance, measured by conditional log likelihood (CLL) and the area under the ROC curve (AUC), respectively, of naive Bayes and its improved models, such as SBC, NBTree, TAN, and AODE, and then compare HNB to them in terms of CLL and AUC. Our experiments show that HNB also significantly outperforms all of them.
TL;DR: The goal is to enable the groups and their facilitators to see relevant aspects of the group's operation and provide feedback if these are more likely to be associated with positive or negative outcomes and indicate where the problems are.
Abstract: Group work is widespread in education. The growing use of online tools supporting group work generates huge amounts of data. We aim to exploit this data to support mirroring: presenting useful high-level views of information about the group, together with desired patterns characterizing the behavior of strong groups. The goal is to enable the groups and their facilitators to see relevant aspects of the group's operation and provide feedback if these are more likely to be associated with positive or negative outcomes and indicate where the problems are. We explore how useful mirror information can be extracted via a theory-driven approach and a range of clustering and sequential pattern mining. The context is a senior software development project where students use the collaboration tool TRAC. We extract patterns distinguishing the better from the weaker groups and get insights in the success factors. The results point to the importance of leadership and group interaction, and give promising indications if they are occurring. Patterns indicating good individual practices were also identified. We found that some key measures can be mined from early data. The results are promising for advising groups at the start and early identification of effective and poor practices, in time for remediation.
TL;DR: A new active learning-based approach (ALBA) to extract comprehensible rules from opaque SVM models by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary.
Abstract: Support vector machines (SVMs) are currently state-of-the-art for the classification task and, generally speaking, exhibit good predictive performance due to their ability to model nonlinearities. However, their strength is also their main weakness, as the generated nonlinear models are typically regarded as incomprehensible black-box models. In this paper, we propose a new active learning-based approach (ALBA) to extract comprehensible rules from opaque SVM models. Through rule extraction, some insight is provided into the logics of the SVM model. ALBA extracts rules from the trained SVM model by explicitly making use of key concepts of the SVM: the support vectors, and the observation that these are typically close to the decision boundary. Active learning implies the focus on apparent problem areas, which for rule induction techniques are the regions close to the SVM decision boundary where most of the noise is found. By generating extra data close to these support vectors that are provided with a class label by the trained SVM model, rule induction techniques are better able to discover suitable discrimination rules. This performance increase, both in terms of predictive accuracy as comprehensibility, is confirmed in our experiments where we apply ALBA on several publicly available data sets.
TL;DR: Comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.
Abstract: In this paper, a new density-based clustering framework is proposed by adopting the assumption that the cluster centers in data space can be regarded as target objects in image space. First, the level set evolution is adopted to find an approximation of cluster centers by using a new initial boundary formation scheme. Accordingly, three types of initial boundaries are defined so that each of them can evolve to approach the cluster centers in different ways. To avoid the long iteration time of level set evolution in data space, an efficient termination criterion is presented to stop the evolution process in the circumstance that no more cluster centers can be found. Then, a new effective density representation called level set density (LSD) is constructed from the evolution results. Finally, the valley seeking clustering is used to group data points into corresponding clusters based on the LSD. The experiments on some synthetic and real data sets have demonstrated the efficiency and effectiveness of the proposed clustering framework. The comparisons with DBSCAN method, OPTICS method, and valley seeking clustering method further show that the proposed framework can successfully avoid the overfitting phenomenon and solve the confusion problem of cluster boundary points and outliers.
TL;DR: This paper proposes a relation-based page rank algorithm to be used in conjunction with semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources.
Abstract: With the tremendous growth of information available to end users through the Web, search engines come to play ever a more critical role. Nevertheless, because of their general-purpose approach, it is always less uncommon that obtained result sets provide a burden of useless pages. The next-generation Web architecture, represented by the Semantic Web, provides the layered architecture possibly allowing overcoming this limitation. Several search engines have been proposed, which allow increasing information retrieval accuracy by exploiting a key content of semantic Web resources, that is, relations. However, in order to rank results, most of the existing solutions need to work on the whole annotated knowledge base. In this paper, we propose a relation-based page rank algorithm to be used in conjunction with semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition.
TL;DR: This paper extends the definitions of k-anonymity to multiple relations and shows that previously proposed methodologies either fail to protect privacy or overly reduce the utility of the data in a multiple relation setting.
Abstract: k-anonymity protects privacy by ensuring that data cannot be linked to a single individual. In a k-anonymous data set, any identifying information occurs in at least k tuples. Much research has been done to modify a single-table data set to satisfy anonymity constraints. This paper extends the definitions of k-anonymity to multiple relations and shows that previously proposed methodologies either fail to protect privacy or overly reduce the utility of the data in a multiple relation setting. We also propose two new clustering algorithms to achieve multirelational anonymity. Experiments show the effectiveness of the approach in terms of utility and efficiency.
TL;DR: This paper presents a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2).
Abstract: Due to their ability to detect clusters with irregular boundaries, minimum spanning tree-based clustering algorithms have been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum spanning trees is the main source of computation and the standard solutions take O(N2) time. In this paper, we present a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2).
TL;DR: An optimization framework based on reconstruction error analysis, which can yield a global optimum for nonlinear dimensionality reduction (NLDR), is developed and extended to embed out of samples via spline interpolation.
Abstract: This paper presents a new algorithm for nonlinear dimensionality reduction (NLDR). Our algorithm is developed under the conceptual framework of compatible mapping. Each such mapping is a compound of a tangent space projection and a group of splines. Tangent space projection is estimated at each data point on the manifold, through which the data point itself and its neighbors are represented in tangent space with local coordinates. Splines are then constructed to guarantee that each of the local coordinates can be mapped to its own single global coordinate with respect to the underlying manifold. Thus, the compatibility between local alignments is ensured. In such a work setting, we develop an optimization framework based on reconstruction error analysis, which can yield a global optimum. The proposed algorithm is also extended to embed out of samples via spline interpolation. Experiments on toy data sets and real-world data sets illustrate the validity of our method.
TL;DR: Investigation of the effect of other types of values, which express the distribution of a word in the document, shows that the distributional features are useful for text categorization, especially when documents are long and the writing style is casual.
Abstract: Text categorization is the task of assigning predefined categories to natural language text. With the widely used 'bag of words' representation, previous researches usually assign a word with values such that whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abundant information contained in the document. This paper explores the effect of other types of values, which express the distribution of a word in the document. These novel values assigned to a word are called distributional features, which include the compactness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tf idf style equation and different features are combined using ensemble learning techniques. Experiments show that the distributional features are useful for text categorization. In contrast to using the traditional term frequency values solely, including the distributional features requires only a little additional cost, while the categorization performance can be significantly improved. Further analysis shows that the distributional features are especially useful when documents are long and the writing style is casual.
TL;DR: It is proved that it is NP-hard and study polynomial approximations for the optimal solution and three information-theoretic measures for capturing the amount of information that is lost during the anonymization process are proposed.
Abstract: The technique of k-anonymization allows the releasing of databases that contain personal information while ensuring some degree of individual privacy. Anonymization is usually performed by generalizing database entries. We formally study the concept of generalization, and propose three information-theoretic measures for capturing the amount of information that is lost during the anonymization process. The proposed measures are more general and more accurate than those that were proposed by Meyerson and Williams and Aggarwal et al. We study the problem of achieving k-anonymity with minimal loss of information. We prove that it is NP-hard and study polynomial approximations for the optimal solution. Our first algorithm gives an approximation guarantee of O(ln k) for two of our measures as well as for the previously studied measures. This improves the best-known O(k)-approximation in. While the previous approximation algorithms relied on the graph representation framework, our algorithm relies on a novel hypergraph representation that enables the improvement in the approximation ratio from O(k) to O(ln k). As the running time of the algorithm is O(n2k}), we also show how to adapt the algorithm in in order to obtain an O(k)-approximation algorithm that is polynomial in both n and k.
TL;DR: A new method called DBE (dark block extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for visual assessment of cluster tendency (VAT) of a data set, using several common image and signal processing techniques.
Abstract: Clustering is a popular tool for exploratory data analysis. One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper we investigate a new method called DBE (dark block extraction) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for visual assessment of cluster tendency (VAT) of a data set, using several common image and signal processing techniques. Basic steps include: 1) generating a VAT image of an input dissimilarity matrix; 2) performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering; 3) applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal; 4) smoothing the projection signal, computing its first-order derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our new DBE method is nearly "automatic", depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE.
TL;DR: A prototype application called Open Smart Classroom is built, built on the software infrastructure based on the multiagent system architecture using Web Service technology in Smart Space, which demonstrates the influence of these new features on the educational effect.
Abstract: Real-time interactive virtual classroom with teleeducation experience is an important approach in distance learning. However, most current systems fail to meet new challenges in extensibility and scalability, which mainly lie with three issues. First, an open system architecture is required to better support the integration of increasing human-computer interfaces and personal mobile devices in the classroom. Second, the learning system should facilitate opening its interfaces, which will help easy deployment that copes with different circumstances and allows other learning systems to talk to each other. Third, problems emerge on binding existing systems of classrooms together in different places or even different countries such as tackling systems intercommunication and distant intercultural learning in different languages. To address these issues, we build a prototype application called Open Smart Classroom built on our software infrastructure based on the multiagent system architecture using Web Service technology in Smart Space. Besides the evaluation of the extensibility and scalability of the system, an experiment connecting two Open Smart Classrooms deployed in different countries is also undertaken, which demonstrates the influence of these new features on the educational effect. Interesting and optimistic results obtained show a significant research prospect for developing future distant learning systems.
TL;DR: An analysis of what software engineering ontology is, what it consists of, and what it is used for in the form of usage example scenarios is given.
Abstract: This paper aims to present an ontology model of software engineering to represent its knowledge. The fundamental knowledge relating to software engineering is well described in the textbook entitled Software Engineering by Sommerville that is now in its eighth edition (2004) and the white paper, Software Engineering Body of Knowledge (SWEBOK), by the IEEE (203) upon which software engineering ontology is based. This paper gives an analysis of what software engineering ontology is, what it consists of, and what it is used for in the form of usage example scenarios. The usage scenarios presented in this paper highlight the characteristics of the software engineering ontology. The software engineering ontology assists in defining information for the exchange of semantic project information and is used as a communication framework. Its users are software engineers sharing domain knowledge as well as instance knowledge of software engineering.
TL;DR: ANGEL as mentioned in this paper is a new anonymization technique that is as effective as generalization in privacy protection, but is able to retain significantly more information in the microdata, which is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.).
Abstract: Generalization is a well-known method for privacy preserving data publication. Despite its vast popularity, it has several drawbacks such as heavy information loss, difficulty of supporting marginal publication, and so on. To overcome these drawbacks, we develop ANGEL,1 a new anonymization technique that is as effective as generalization in privacy protection, but is able to retain significantly more information in the microdata. ANGEL is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.), with its superiority (in correlation preservation) especially obvious when tight privacy control must be enforced. We show that ANGEL lends itself elegantly to the hard problem of marginal publication. In particular, unlike generalization that can release only restricted marginals, our technique can be easily used to publish any marginals with strong privacy guarantees.
TL;DR: A new rule-based framework to identify and address issues of sharing in virtual university environments through role-based access control (RBAC) management is built and compared with other related work.
Abstract: A global education system, as a key area in future IT, has fostered developers to provide various learning systems with low cost. While a variety of e-learning advantages has been recognized for a long time and many advances in e-learning systems have been implemented, the needs for effective information sharing in a secure manner have to date been largely ignored, especially for virtual university collaborative environments. Information sharing of virtual universities usually occurs in broad, highly dynamic network-based environments, and formally accessing the resources in a secure manner poses a difficult and vital challenge. This paper aims to build a new rule-based framework to identify and address issues of sharing in virtual university environments through role-based access control (RBAC) management. The framework includes a role-based group delegation granting model, group delegation revocation model, authorization granting, and authorization revocation. We analyze various revocations and the impact of revocations on role hierarchies. The implementation with XML-based tools demonstrates the feasibility of the framework and authorization methods. Finally, the current proposal is compared with other related work.
TL;DR: This paper introduces algorithms that, as parametric plans are populated, are able to frequently bypass the optimizer but still execute optimal or near-optimal plans.
Abstract: Commercial applications usually rely on pre-compiled parameterized procedures to interact with a database. Unfortunately, executing a procedure with a set of parameters different from those used at compilation time may be arbitrarily sub-optimal. Parametric query optimization (PQO) attempts to solve this problem by exhaustively determining the optimal plans at each point of the parameter space at compile time. However, PQO is likely not cost-effective if the query is executed infrequently or if it is executed with values only within a subset of the parameter space. In this paper we propose instead to progressively explore the parameter space and build a parametric plan during several executions of the same query. We introduce algorithms that, as parametric plans are populated, are able to frequently bypass the optimizer but still execute optimal or near-optimal plans.
TL;DR: This paper considers feature selection method for multimodally distributed data, and presents a large margin feature weighting method for k-nearest neighbor (kNN) classifiers, which aims at separating different classes by large local margins and pulling closer together points from the same class.
Abstract: The problem of feature selection is a difficult combinatorial task in machine learning and of high practical relevance. In this paper, we consider feature selection method for multimodally distributed data, and present a large margin feature weighting method for k-nearest neighbor (kNN) classifiers. The method learns the feature weighting factors by minimizing a cost function, which aims at separating different classes by large local margins and pulling closer together points from the same class, based on using as few features as possible. The consequent optimization problem can be efficiently solved by linear programming. Finally, the proposed approach is assessed through a series of experiments with UCI and microarray data sets, as well as a more specific and challenging task, namely, radar high-resolution range profiles (HRRP) automatic target recognition (ATR). The experimental results demonstrate the effectiveness of the proposed algorithms.
TL;DR: It is revealed that personalized Web search does not work equally well under various situations and represents a significant improvement over generic Web search for some queries, while it has little effect and even harms query performance under some situations.
Abstract: Although personalized search has been under way for many years and many personalization algorithms have been investigated, it is still unclear whether personalization is consistently effective on different queries for different users and under different search contexts. In this paper, we study this problem and provide some findings. We present a large-scale evaluation framework for personalized search based on query logs and then evaluate five personalized search algorithms (including two click-based ones and three topical-interest-based ones) using 12-day query logs of Windows Live Search. By analyzing the results, we reveal that personalized Web search does not work equally well under various situations. It represents a significant improvement over generic Web search for some queries, while it has little effect and even harms query performance under some situations. We propose click entropy as a simple measurement on whether a query should be personalized. We further propose several features to automatically predict when a query will benefit from a specific personalization algorithm. Experimental results show that using a personalization algorithm for queries selected by our prediction model is better than using it simply for all queries.
TL;DR: This work proposes a robust partitional distance-based projected clustering algorithm capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full- dimensional space.
Abstract: Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the k-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.
TL;DR: Armada is the first delay-bounded general range query scheme on constant-degree DHTs, and can return the results for any range query within 2logN hops in a P2P system with N peers.
Abstract: With the increasing popularity of the peer-to-peer (P2P) computing paradigm, many general range query schemes for distributed hash table (DHT)-based P2P systems have been proposed. Although those schemes can provide range query capability without modifying the underlying DHTs, they have the query delay depending on both the scale of the system and the size of the query space or the specific query, and thus cannot guarantee to return the query results in a bounded delay. In this paper, we propose Armada, an efficient range query processing scheme to support delay-bounded single-attribute and multiple-attribute range queries. It is the first delay-bounded general range query scheme on constant-degree DHTs, and can return the results for any range query within 2logN hops in a P2P system with N peers. Results of analysis and simulations show that the average delay in Armada is less than logN, and the average message cost of single-attribute range queries is about logN+2n 2 (n is the number of peers that intersect with the query). These results are very close to the lower bounds on delay and message cost of range queries over constant-degree DHTs.
TL;DR: In this paper, a feedback-based distributed skyline (FDS) algorithm is proposed to support arbitrary horizontal partitioning, which aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network.
Abstract: We consider skyline computation when the underlying data set is horizontally partitioned onto geographically distant servers that are connected to the Internet. The existing solutions are not suitable for our problem, because they have at least one of the following drawbacks: (1) applicable only to distributed systems adopting vertical partitioning or restricted horizontal partitioning, (2) effective only when each server has limited computing and communication abilities, and (3) optimized only for skyline search in subspaces but inefficient in the full space. This paper proposes an algorithm, called feedback-based distributed skyline (FDS), to support arbitrary horizontal partitioning. FDS aims at minimizing the network bandwidth, measured in the number of tuples transmitted over the network. The core of FDS is a novel feedback-driven mechanism, where the coordinator iteratively transmits certain feedback to each participant. Participants can leverage such information to prune a large amount of local data, which otherwise would need to be sent to the coordinator. Extensive experimentation confirms that FDS significantly outperforms alternative approaches in both effectiveness and progressiveness.
TL;DR: A probabilistic approach that assigns relevance weights to discrete features that are considered as random variables modeled by finite discrete mixtures that is successfully applied also for text clustering.
Abstract: In this paper, we consider the problem of unsupervised discrete feature selection/weighting. Indeed, discrete data are an important component in many data mining, machine learning, image processing, and computer vision applications. However, much of the published work on unsupervised feature selection has concentrated on continuous data. We propose a probabilistic approach that assigns relevance weights to discrete features that are considered as random variables modeled by finite discrete mixtures. The choice of finite mixture models is justified by its flexibility which has led to its widespread application in different domains. For the learning of the model, we consider both Bayesian and information-theoretic approaches through stochastic complexity. Experimental results are presented to illustrate the feasibility and merits of our approach on a difficult problem which is clustering and recognizing visual concepts in different image data. The proposed approach is successfully applied also for text clustering.
TL;DR: Wang et al. as discussed by the authors proposed a clustering with local and global regularization (CLGR) method, which aims to minimize a cost function that properly trades off the local cost and global cost.
Abstract: Clustering is an old research topic in data mining and machine learning. Most of the traditional clustering methods can be categorized as local or global ones. In this paper, a novel clustering method that can explore both the local and global information in the data set is proposed. The method, Clustering with Local and Global Regularization (CLGR), aims to minimize a cost function that properly trades off the local and global costs. We show that such an optimization problem can be solved by the eigenvalue decomposition of a sparse symmetric matrix, which can be done efficiently using iterative methods. Finally, the experimental results on several data sets are presented to show the effectiveness of our method.
TL;DR: This work proposes a new index structure and query processing technique to improve retrieval effectiveness and efficiency and considers strategies to minimize the effects of users' inaccurate relevance feedback.
Abstract: Target search in content-based image retrieval (CBIR) systems refers to finding a specific (target) image such as a particular registered logo or a specific historical photograph. Existing techniques, designed around query refinement based on relevance feedback, suffer from slow convergence, and do not guarantee to find intended targets. To address these limitations, we propose several efficient query point movement methods. We prove that our approach is able to reach any given target image with fewer iterations in the worst and average cases. We propose a new index structure and query processing technique to improve retrieval effectiveness and efficiency. We also consider strategies to minimize the effects of users' inaccurate relevance feedback. Extensive experiments in simulated and realistic environments show that our approach significantly reduces the number of required iterations and improves overall retrieval performance. The experimental results also confirm that our approach can always retrieve intended targets even with poor selection of initial query points.
TL;DR: This paper designs a tunable clustering algorithm for establishing form structure based on multiple "similar" queries, which includes a mechanism for extending forms to support future " similar" queries.
Abstract: One of the simplest ways to query a database is through a form where a user can fill in relevant information and obtain desired results by submitting the form. Designing good forms is a nontrivial manual task, and the designer needs a sound understanding of both the data organization and the querying needs. Furthermore, form design usually has conflicting goals: each form should be simple and easy to understand, while collectively, the interface must support as many queries as possible. In this paper, we present a framework for generating forms in an automatic and principled way, given a database and a sample query workload. We design a tunable clustering algorithm for establishing form structure based on multiple "similar" queries, which includes a mechanism for extending forms to support future "similar" queries. The algorithm is adaptive and can incrementally adjust forms to reflect the most current querying trends. We have implemented our form generation system on a real database and evaluated it on a comprehensive set of query loads and database schemas. We observe that our system generates a modest number of forms for large and diverse query loads even after placing a strict bound on form complexity.