Top 81 papers presented at Data and Knowledge Engineering in 2009

Showing papers presented at "Data and Knowledge Engineering in 2009"

Journal Article•10.1016/J.DATAK.2009.08.005•

Knowledge discovery from imbalanced and noisy data

[...]

Jason Van Hulse¹, Taghi M. Khoshgoftaar¹•Institutions (1)

1 Dec 2009

TL;DR: Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located, and simple sampling techniques such as random undersampling are generally the most effective.

...read moreread less

Abstract: Class imbalance and labeling errors present significant challenges to data mining and knowledge discovery applications. Some previous work has discussed these important topics, however the relationship between these two issues has not received enough attention. Further, much of the previous work in this domain is fragmented and contradictory, leading to serious questions regarding the reliability and validity of the empirical conclusions. In response to these issues, we present a comprehensive suite of experiments carefully designed to provide conclusive, reliable, and significant results on the problem of learning from noisy and imbalanced data. Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located (which, as discussed throughout this work, has very important implications to noise handling). The impacts of noise, however, vary dramatically depending on the learning algorithm and simple algorithms such as naive Bayes and nearest neighbor learners are often more robust than more complex learners such as support vector machines or random forests. Sampling techniques, which are often used to alleviate the adverse impacts of imbalanced data, are shown to improve the performance of learners built from noisy and imbalanced data. In particular, simple sampling techniques such as random undersampling are generally the most effective.

...read moreread less

227 citations

Journal Article•10.1016/J.DATAK.2009.02.002•

Reusing ontologies on the Semantic Web: A feasibility study

[...]

Elena Simperl¹•Institutions (1)

University of Innsbruck¹

1 Oct 2009

TL;DR: The need for a context- and task-sensitive treatment of ontologies is emphasized, both from an engineering and a usage perspective, and the typical phases of reuse processes which could profit considerably from such an approach are identified.

...read moreread less

Abstract: Technologies for the efficient and effective reuse of ontological knowledge are one of the key success factors for the Semantic Web. Putting aside matters of cost or quality, being reusable is an intrinsic property of ontologies, originally conceived of as a means to enable and enhance the interoperability between computing applications. This article gives an account, based on empirical evidence and real-world findings, of the methodologies, methods and tools currently used to perform ontology-reuse processes. We study the most prominent case studies on ontology reuse, published in the knowledge-/ontology-engineering literature from the early nineties. This overview is complemented by two self-conducted case studies in the areas of eHealth and eRecruitment in which we developed Semantic Web ontologies for different scopes and purposes by resorting to existing ontological knowledge on the Web. Based on the analysis of the case studies, we are able to identify a series of research and development challenges which should be addressed to ensure reuse becomes a feasible alternative to other ontology-engineering strategies such as development from scratch. In particular, we emphasize the need for a context- and task-sensitive treatment of ontologies, both from an engineering and a usage perspective, and identify the typical phases of reuse processes which could profit considerably from such an approach. Further on, we argue for the need for ontology-reuse methodologies which optimally exploit human and computational intelligence to effectively operationalize reuse processes.

...read moreread less

224 citations

Journal Article•10.1016/J.DATAK.2009.04.001•

Semantics preserving SPARQL-to-SQL translation

[...]

Artem Chebotko¹, Shiyong Lu², Farshad Fotouhi²•Institutions (2)

University of Texas–Pan American¹, Wayne State University²

1 Oct 2009

TL;DR: The experimental study showed that the proposed generic translation can serve as a good alternative to existing schema dependent translations in terms of efficient query evaluation and/or ensured query result correctness.

...read moreread less

Abstract: Most existing RDF stores, which serve as metadata repositories on the Semantic Web, use an RDBMS as a backend to manage RDF data. This motivates us to study the problem of translating SPARQL queries into equivalent SQL queries, which further can be optimized and evaluated by the relational query engine and their results can be returned as SPARQL query solutions. The main contributions of our research are: (i) We formalize a relational algebra based semantics of SPARQL, which bridges the gap between SPARQL and SQL query languages, and prove that our semantics is equivalent to the mapping-based semantics of SPARQL; (ii) Based on this semantics, we propose the first provably semantics preserving SPARQL-to-SQL translation for SPARQL triple patterns, basic graph patterns, optional graph patterns, alternative graph patterns, and value constraints; (iii) Our translation algorithm is generic and can be directly applied to existing RDBMS-based RDF stores; and (iv) We outline a number of simplifications for the SPARQL-to-SQL translation to generate simpler and more efficient SQL queries and extend our defined semantics and translation to support the bag semantics of a SPARQL query solution. The experimental study showed that our proposed generic translation can serve as a good alternative to existing schema dependent translations in terms of efficient query evaluation and/or ensured query result correctness.

...read moreread less

172 citations

Journal Article•10.1016/J.DATAK.2009.04.002•

Improving the performance of focused web crawlers

[...]

Sotiris Batsakis¹, Euripides G. M. Petrakis¹, Evangelos E. Milios²•Institutions (2)

Technical University of Crete¹, Dalhousie University²

1 Oct 2009

TL;DR: This work addresses issues related to the design and implementation of focused crawlers and proposes several variants of state-of-the-art crawlers capable of learning not only the content of relevant pages but also paths leading to relevant pages.

...read moreread less

Abstract: This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.

...read moreread less

147 citations

Journal Article•10.1016/J.DATAK.2009.01.004•

Explaining instance classifications with interactions of subsets of feature values

[...]

Erik Štrumbelj¹, Igor Kononenko¹, M. Robnik Šikonja¹•Institutions (1)

University of Ljubljana¹

1 Oct 2009

TL;DR: A novel method for explaining the decisions of an arbitrary classifier, independent of the type of classifiers, that works at the instance level, decomposing the model's prediction for an instance into the contributions of the attributes' values.

...read moreread less

Abstract: In this paper, we present a novel method for explaining the decisions of an arbitrary classifier, independent of the type of classifier. The method works at the instance level, decomposing the model's prediction for an instance into the contributions of the attributes' values. We use several artificial data sets and several different types of models to show that the generated explanations reflect the decision-making properties of the explained model and approach the concepts behind the data set as the prediction quality of the model increases. The usefulness of the method is justified by a successful application on a real-world breast cancer recurrence prediction problem.

...read moreread less

141 citations

Journal Article•10.1016/J.DATAK.2009.02.009•

On managing business processes variants

[...]

Ruopeng Lu, Shazia Sadiq¹, Guido Governatori²•Institutions (2)

University of Queensland¹, NICTA²

1 Jul 2009

TL;DR: An approach for managing business processes that is conducive to dynamic change and the need for flexibility in execution is presented, based on the notion of process constraints, which provides a technique for effective utilization of the adaptations manifested in process variants.

...read moreread less

Abstract: Variance in business process execution can be the result of several situations, such as disconnection between documented models and business operations, workarounds in spite of process execution engines, dynamic change and exception handling, flexible and ad-hoc requirements, and collaborative and/or knowledge intensive work. It is imperative that effective support for managing process variances be extended to organizations mature in their BPM (business process management) uptake so that they can ensure organization wide consistency, promote reuse and capitalize on their BPM investments. This paper presents an approach for managing business processes that is conducive to dynamic change and the need for flexibility in execution. The approach is based on the notion of process constraints. It further provides a technique for effective utilization of the adaptations manifested in process variants. In particular, we will present a facility for discovery of preferred variants through effective search and retrieval based on the notion of process similarity, where multiple aspects of the process variants are compared according to specific query requirements. The advantage of this approach is the ability to provide a quantitative measure for the similarity between process variants, which further facilitates various BPM activities such as process reuse, analysis and discovery.

...read moreread less

139 citations

Journal Article•10.1016/J.DATAK.2009.07.010•

A survey on summarizability issues in multidimensional modeling

[...]

Jose-Norberto Mazón¹, Jens Lechtenbörger², Juan Trujillo¹•Institutions (2)

University of Alicante¹, University of Münster²

1 Dec 2009

TL;DR: In this article, a survey sheds light on the weak and strong points of current approaches for modeling complex multidimensional structures that reflect real-world situations in a conceptual multi-dimensional model.

...read moreread less

Abstract: The development of a data warehouse (DW) system is based on a conceptual multidimensional model, which provides a high level of abstraction in accurately and expressively describing real-world situations. Once this model is designed, the corresponding logical representation must be obtained as the basis of the implementation of the DW according to one specific technology. However, even though a good conceptual multidimensional model is designed underneath a DW, there is a semantic gap between this model and its logical representation. In particular, this gap complicates an adequate treatment of summarizability issues, which in turn may lead to erroneous results of data analysis tools. Research addressing this topic has produced only partial solutions, and individual terminology used by different parties hinders further progress. Consequently, based on a unifying vocabulary, this survey sheds light on (i) the weak and strong points of current approaches for modeling complex multidimensional structures that reflect real-world situations in a conceptual multidimensional model and (ii) existing mechanisms to avoid summarizability problems when conceptual multidimensional models are being implemented.

...read moreread less

127 citations

Journal Article•10.1016/J.DATAK.2008.09.004•

Improved model management with aggregated business process models

[...]

Hajo A. Reijers¹, RS Ronny Mans¹, R. A. van der Toorn•Institutions (1)

Eindhoven University of Technology¹

1 Feb 2009

TL;DR: This paper proposes an extension of Event-driven Process Chains, called the aggregate EPC (aEPC), which can be used to describe a set of similar processes with a single model, by doing so, the number of process models that must be managed can be decreased.

...read moreread less

Abstract: Contemporary organizations invest much efforts in creating models of their business processes. This raises the issue of how to deal with large sets of process models that become available over time. This paper proposes an extension of Event-driven Process Chains, called the aggregate EPC (aEPC), which can be used to describe a set of similar processes with a single model. By doing so, the number of process models that must be managed can be decreased. But at the same time, the process logic for each specific element of the set over which aggregation takes place can still be distinguished. The presented approach is supported as an add-on to the ARIS modeling tool box. To show the feasibility and effectiveness of the approach, we discuss its practical application in the context of a large financial organization.

...read moreread less

114 citations

Journal Article•10.1016/J.DATAK.2008.11.001•

Frequent items in streaming data: An experimental evaluation of the state-of-the-art

[...]

Nishad Manerikar¹, Themis Palpanas¹•Institutions (1)

University of Trento¹

1 Apr 2009

TL;DR: This paper presents the results of the first extensive comparative experimental study of the most prominent algorithms in the literature, comprehensively tested using a common test framework on several real and synthetic datasets.

...read moreread less

Abstract: The problem of detecting frequent items in streaming data is relevant to many different applications across many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution of the above problem. In this paper, we review these algorithms, and we present the results of the first extensive comparative experimental study of the most prominent algorithms in the literature. The algorithms were comprehensively tested using a common test framework on several real and synthetic datasets. Their performance with respect to the different parameters (i.e., parameters intrinsic to the algorithms, and data related parameters) was studied. We report the results, and insights gained through these experiments.

...read moreread less

105 citations

Journal Article•10.1016/J.DATAK.2009.04.003•

Interacting services: From specification to execution

[...]

Gero Decker¹, Oliver Kopp², Frank Leymann², Mathias Weske¹•Institutions (2)

Hasso Plattner Institute¹, University of Stuttgart²

1 Oct 2009

TL;DR: The requirements framework provides the basis for introducing the language BPEL4Chor, which extends the industry standard WS-BPEL with choreography-specific concepts, and integration with executable service orchestrations is discussed.

...read moreread less

Abstract: Interacting services play a key role to realize business process integration among different business partners by means of electronic message exchange. In order to provide seamless integration of these services, the messages exchanged as well as their dependencies must be well-defined. Service choreographies are a means to describe the allowed conversations. This article presents a requirements framework for service choreography languages, along which existing choreography languages are assessed. The requirements framework provides the basis for introducing the language BPEL4Chor, which extends the industry standard WS-BPEL with choreography-specific concepts. A validation is provided and integration with executable service orchestrations is discussed.

...read moreread less

99 citations

Journal Article•10.1016/J.DATAK.2008.08.006•

Incremental clustering of dynamic data streams using connectivity based representative points

[...]

Sebastian Lühr¹, Mihai Lazarescu¹•Institutions (1)

Curtin University¹

1 Jan 2009

TL;DR: An incremental graph-based clustering algorithm whose design was motivated by a need to extract and retain meaningful information from data streams produced by applications such as large scale surveillance, network packet inspection and financial transaction monitoring is presented.

...read moreread less

Abstract: We present an incremental graph-based clustering algorithm whose design was motivated by a need to extract and retain meaningful information from data streams produced by applications such as large scale surveillance, network packet inspection and financial transaction monitoring. To this end, the method we propose utilises representative points to both incrementally cluster new data and to selectively retain important cluster information within a knowledge repository. The repository can then be subsequently used to assist in the processing of new data, the archival of critical features for off-line analysis, and in the identification of recurrent patterns.

...read moreread less

Journal Article•10.1016/J.DATAK.2008.12.001•

Privacy-preserving data publishing for cluster analysis

[...]

Benjamin C. M. Fung¹, Ke Wang², Lingyu Wang¹, Patrick C. K. Hung³•Institutions (3)

Concordia University¹, Simon Fraser University², University of Ontario Institute of Technology³

1 Jun 2009

TL;DR: A practical data publishing framework for generating a masked version of data that preserves both individual privacy and information usefulness for cluster analysis and presents a framework to evaluate the cluster quality on the masked data.

...read moreread less

Abstract: Releasing person-specific data could potentially reveal sensitive information about individuals. k-anonymization is a promising privacy protection mechanism in data publishing. Although substantial research has been conducted on k-anonymization and its extensions in recent years, only a few prior works have considered releasing data for some specific purpose of data analysis. This paper presents a practical data publishing framework for generating a masked version of data that preserves both individual privacy and information usefulness for cluster analysis. Experiments on real-life data suggest that by focusing on preserving cluster structure in the masking process, the cluster quality is significantly better than the cluster quality of the masked data without such focus. The major challenge of masking data for cluster analysis is the lack of class labels that could be used to guide the masking process. Our approach converts the problem into the counterpart problem for classification analysis, wherein class labels encode the cluster structure in the data, and presents a framework to evaluate the cluster quality on the masked data.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.07.012•

Cerno: Light-weight tool support for semantic annotation of textual documents

[...]

Nadzeya Kiyavitskaya¹, Nicola Zeni¹, James R. Cordy², Luisa Mich¹, John Mylopoulos³ - Show less +1 more•Institutions (3)

University of Trento¹, Queen's University², University of Toronto³

1 Dec 2009

TL;DR: Cerno as mentioned in this paper is a framework for semi-automatic semantic annotation of textual documents according to a domain-specific semantic model, which is based on light-weight techniques and tools intended for legacy code analysis and markup.

...read moreread less

Abstract: Enrichment of text documents with semantic metadata reflecting their meaning facilitates document organization, indexing and retrieval. However, most web data remain unstructured because of the difficulty and the cost of manually annotating text. In this work, we present Cerno, a framework for semi-automatic semantic annotation of textual documents according to a domain-specific semantic model. The proposed framework is founded on light-weight techniques and tools intended for legacy code analysis and markup. To illustrate the feasibility of our proposal, we report experimental results of its application to two different domains. These results suggest that light-weight semi-automatic techniques for semantic annotation are feasible, require limited human effort for adaptation to a new domain, and demonstrate markup quality comparable with state-of-the-art methods.

...read moreread less

Journal Article•10.1016/J.DATAK.2008.08.008•

An active learning framework for semi-supervised document clustering with language modeling

[...]

Ruizhang Huang¹, Wai Lam²•Institutions (2)

Hong Kong Polytechnic University¹, The Chinese University of Hong Kong²

1 Jan 2009

TL;DR: A gain-directed document pair selection method that measures how much the authors can learn by revealing judgments of selected document pairs is designed and uses the estimation of term co-occurrence probabilities as a clue for finding informative document pairs.

...read moreread less

Abstract: This paper investigates a framework that actively selects informative document pairs for obtaining user feedback for semi-supervised document clustering. A gain-directed document pair selection method that measures how much we can learn by revealing judgments of selected document pairs is designed. We use the estimation of term co-occurrence probabilities as a clue for finding informative document pairs. Term co-occurrence probabilities are considered in the semi-supervised document clustering process to capture term-to-term dependence relationships. In the semi-supervised document clustering, each cluster is represented by a language model. We have conducted extensive experiments on several real-world corpora. The results demonstrate that our proposed framework is effective.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.01.001•

Improving XML schema matching performance using Prüfer sequences

[...]

Alsayed Algergawy¹, Eike Schallehn¹, Gunter Saake¹•Institutions (1)

Otto-von-Guericke University Magdeburg¹

1 Aug 2009

TL;DR: This paper develops and implements the XPruM system, which consists mainly of two parts-schema preparation and schema matching, and introduces the concept of compatible nodes to identify semantic correspondences across complex elements first, then the matching process is refined to identify correspondences among simple elements inside each pair of compatible node.

...read moreread less

Abstract: Schema matching is a critical step for discovering semantic correspondences among elements in many data-shared applications. Most of existing schema matching algorithms produce scores between schema elements resulting in discovering only simple matches. Such results partially solve the problem. Identifying and discovering complex matches is considered one of the biggest obstacle towards completely solving the schema matching problem. Another obstacle is the scalability of matching algorithms on large number and large-scale schemas. To tackle these challenges, in this paper, we propose a new XML schema matching framework based on the use of Prufer encoding. In particular, we develop and implement the XPruM system, which consists mainly of two parts-schema preparation and schema matching. First, we parse XML schemas and represent them internally as schema trees. Prufer sequences are constructed for each schema tree and employed to construct a sequence representation of schemas. We capture schema tree semantic information in Label Prufer Sequences (LPS) and schema tree structural information in Number Prufer Sequences (NPS). Then, we develop a new structural matching algorithm exploiting both LPS and NPS. To cope with complex matching discovery, we introduce the concept of compatible nodes to identify semantic correspondences across complex elements first, then the matching process is refined to identify correspondences among simple elements inside each pair of compatible nodes. Our experimental results demonstrate the performance benefits of the XPruM system.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.02.013•

Process instantiation

[...]

Gero Decker¹, Jan Mendling²•Institutions (2)

Hasso Plattner Institute¹, Humboldt State University²

1 Sep 2009

TL;DR: The CASU framework is introduced and provides the basis for the design of new correctness criteria as well as for the formalization of Event-driven Process Chains and extension of the Business Process Modeling Notation (BPMN).

...read moreread less

Abstract: Although several process modeling languages allow one to specify processes with multiple start elements, the precise semantics of such models are often unclear, both from a pragmatic and from a theoretical point of view. This paper addresses the lack of research on this problem and introduces the CASU framework (from Creation, Activation, Subscription, Unsubscription). The contribution of this framework is a systematic description of design alternatives for the specification of instantiation semantics of process modeling languages. We classify six prominent languages by the help of this framework. We validate the relevance of the CASU framework through empirical investigations involving a large set of process models from practice. Our work provides the basis for the design of new correctness criteria as well as for the formalization of Event-driven Process Chains (EPCs) and extension of the Business Process Modeling Notation (BPMN). It complements research such as the workflow patterns.

...read moreread less

Journal Article•10.1016/J.DATAK.2008.11.002•

The AMTEx approach in the medical document indexing and retrieval application

[...]

Angelos Hliaoutakis¹, Kaliope Zervanou¹, Euripides G. M. Petrakis¹•Institutions (1)

Technical University of Crete¹

1 Mar 2009

TL;DR: Experimental results demonstrate that AMTEx performs better in indexing in 20-50% of the processing time compared to MMTx, while for the retrieval task, AMT ex performsbetter in the full text (PMC) corpus.

...read moreread less

Abstract: AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the US National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of NLM, with a well-established method for extraction of terminology, the C/NC-value method. The performance evaluation of two AMTEx configurations is measured against the current state-of-the-art, the MetaMap Transfer (MMTx) method in four experiments, using two types of corpora: a subset of MEDLINE (PMC) full document corpus and a subset of MEDLINE (OHSUMED) abstracts, for each of the indexing and retrieval tasks, respectively. The experimental results demonstrate that AMTEx performs better in indexing in 20-50% of the processing time compared to MMTx, while for the retrieval task, AMTEx performs better in the full text (PMC) corpus.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.07.002•

Supporting content-based image retrieval and computer-aided diagnosis systems with association rule-based techniques

[...]

Marcela Xavier Ribeiro¹, Pedro Henrique Bugatti¹, Caetano Traina¹, Paulo Mazzoncini de Azevedo Marques¹, Natália Abdala Rosa¹, Agma J. M. Traina¹ - Show less +2 more•Institutions (1)

University of São Paulo¹

1 Dec 2009

TL;DR: The results indicate that association rules can be successfully applied to improve CBIR and CAD systems, empowering the arsenal of techniques to support medical image analysis in medical systems.

...read moreread less

Abstract: In this work, we take advantage of association rule mining to support two types of medical systems: the Content-based Image Retrieval (CBIR) systems and the Computer-Aided Diagnosis (CAD) systems. For content-based retrieval, association rules are employed to reduce the dimensionality of the feature vectors that represent the images and to improve the precision of the similarity queries. We refer to the association rule-based method to improve CBIR systems proposed here as Feature selection through Association Rules (FAR). To improve CAD systems, we propose the Image Diagnosis Enhancement through Association rules (IDEA) method. Association rules are employed to suggest a second opinion to the radiologist or a preliminary diagnosis of a new image. A second opinion automatically obtained can either accelerate the process of diagnosing or to strengthen a hypothesis, increasing the probability of a prescribed treatment be successful. Two new algorithms are proposed to support the IDEA method: to pre-process low-level features and to propose a preliminary diagnosis based on association rules. We performed several experiments to validate the proposed methods. The results indicate that association rules can be successfully applied to improve CBIR and CAD systems, empowering the arsenal of techniques to support medical image analysis in medical systems.

...read moreread less

Journal Article•10.1016/J.DATAK.2008.08.009•

An unsupervised method for joint information extraction and feature mining across different Web sites

[...]

Tak-Lam Wong¹, Wai Lam¹•Institutions (1)

The Chinese University of Hong Kong¹

1 Jan 2009

TL;DR: An unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites, based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages.

...read moreread less

Abstract: We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks and eliminate potential conflicts at the same time Our approach is based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages Web pages across different sites are considered simultaneously and hence information from different sources can be effectively leveraged An approximate learning algorithm is developed to conduct inference over the graphical model to tackle the information extraction and feature mining tasks We demonstrate the efficacy of our framework by applying it to two applications, namely, important product feature mining from vendor sites, and hot item feature mining from auction sites Extensive experiments on real-world data have been conducted to demonstrate the effectiveness of our framework

...read moreread less

Journal Article•10.1016/J.DATAK.2009.06.010•

Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events

[...]

Shin-Yi Wu¹, Yen-Liang Chen²•Institutions (2)

Industrial Technology Research Institute¹, National Central University²

1 Nov 2009

TL;DR: This study introduces a hybrid temporal pattern mining problem and develops an algorithm to discover hybrid temporal patterns from hybrid event sequences and carries out an experiment to compare the efficiency and predicting power of this algorithm with traditional algorithms designed exclusively for mining point-based patterns or interval- based patterns.

...read moreread less

Abstract: Previous sequential pattern mining studies have dealt with either point-based event sequences or interval-based event sequences. In some applications, however, event sequences may contain both point-based and interval-based events. These sequences are called hybrid event sequences. Since the relationships among both kinds of events are more diversiform, the information obtained by discovering patterns from these events is more informative. In this study we introduce a hybrid temporal pattern mining problem and develop an algorithm to discover hybrid temporal patterns from hybrid event sequences. We carry out an experiment using both synthetic and real stock price data to compare our algorithm with the traditional algorithms designed exclusively for mining point-based patterns or interval-based patterns. The experimental results indicate that the efficiency of our algorithm is satisfactory. In addition, the experiment also shows that the predicting power of hybrid temporal patterns is higher than that of point-based or interval-based patterns.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.06.003•

Accurate and large-scale privacy-preserving data mining using the election paradigm

[...]

Emmanouil Magkos¹, Manolis Maragoudakis², Vassilis Chrissikopoulos¹, Stefanos Gritzalis²•Institutions (2)

Ionian University¹, University of the Aegean²

1 Nov 2009

TL;DR: This paper argues in favor of using some well-known cryptographic primitives, borrowed from the literature on Internet elections, based on the classical homomorphic election model, and particularly on an extension for supporting multi-candidate elections in a fully distributed setting.

...read moreread less

Abstract: With the proliferation of the Web and ICT technologies there have been concerns about the handling and use of sensitive information by data mining systems. Recent research has focused on distributed environments where the participants in the system may also be mutually mistrustful. In this paper we discuss the design and security requirements for large-scale privacy-preserving data mining (PPDM) systems in a fully distributed setting, where each client possesses its own records of private data. To this end we argue in favor of using some well-known cryptographic primitives, borrowed from the literature on Internet elections. More specifically, our framework is based on the classical homomorphic election model, and particularly on an extension for supporting multi-candidate elections. We also review a recent scheme [Z. Yang, S. Zhong, R.N. Wright, Privacy-preserving classification of customer data without loss of accuracy, in: SDM' 2005 SIAM International Conference on Data Mining, 2005] which was the first scheme that used the homomorphic encryption primitive for PPDM in the fully distributed setting. Finally, we show how our approach can be used as a building block to obtain Random Forests classification with enhanced prediction performance.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.02.004•

Modeling and analysis of security trade-offs - A goal oriented approach

[...]

Golnaz Elahi¹, Eric Yu¹•Institutions (1)

University of Toronto¹

1 Jul 2009

TL;DR: An extension to the i^* Framework is proposed for security trade-off analysis, taking advantage of its multi-agent and goal orientation, and the method was applied to several case studies used to exemplify existing approaches.

...read moreread less

Abstract: In designing software systems, security is typically only one design objective among many. It may compete with other objectives such as functionality, usability, and performance. Too often, security mechanisms such as firewalls, access control, or encryption are adopted without explicit recognition of competing design objectives and their origins in stakeholders' interests. Recently, there is increasing acknowledgement that security is ultimately about trade-offs. One can only aim for ''good enough'' security, given the competing demands from many parties. This paper investigates the criteria for a conceptual modeling technique for making security trade-offs. We examine how conceptual modeling can provide explicit and systematic support for modeling and analyzing security trade-offs. We examine several existing approaches for dealing with trade-offs and security trade-offs in particular. From analyzing the limitations of existing methods, we propose an extension to the i^* Framework for security trade-off analysis, taking advantage of its multi-agent and goal orientation. The method was applied to several case studies used to exemplify existing approaches. The resulting models developed using different approaches are compared.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.07.013•

Tailor-made data management for embedded systems: A case study on Berkeley DB

[...]

Marko Rosenmüller¹, Sven Apel², Thomas Leich, Gunter Saake¹•Institutions (2)

Otto-von-Guericke University Magdeburg¹, University of Passau²

1 Dec 2009

TL;DR: An approach for decomposing data management software for embedded systems using feature-oriented programming is presented and a software product line that allows to generate tailor-made data management systems is demonstrated.

...read moreread less

Abstract: Applications in the domain of embedded systems are diverse and store an increasing amount of data. In order to satisfy the varying requirements of these applications, data management functionality is needed that can be tailored to the applications' needs. Furthermore, the resource restrictions of embedded systems imply a need for data management that is customized to the hardware platform. In this paper, we present an approach for decomposing data management software for embedded systems using feature-oriented programming. The result of such a decomposition is a software product line that allows us to generate tailor-made data management systems. While existing approaches for tailoring software have significant drawbacks regarding customizability and performance, a feature-oriented approach overcomes these limitations, as we will demonstrate. In a non-trivial case study on Berkeley DB, we evaluate our approach and compare it to other approaches for tailoring DBMS.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.02.012•

Deciding service composition and substitutability using extended operating guidelines

[...]

Christian Stahl¹, Karsten Wolf²•Institutions (2)

Humboldt University of Berlin¹, University of Rostock²

1 Sep 2009

TL;DR: This paper presents an extension of the concept of an operating guideline to characterize all correctly interacting partners of a service P, which can be used for answering at least the following two questions: given a service R, does R interact correctly with P?

...read moreread less

Abstract: We study the correct interaction between services using the following notion for correctness: there is no deadlock in the interaction of the services, and a given set of activities is not dead, that is, each activity in this set is executed in at least one run. The second condition has not been studied before. An operating guideline of a service P is an operational characterization of all deadlock-free interacting partners of P. In this paper, we present an extension of the concept of an operating guideline to characterize all correctly interacting partners of a service P. This extension can be used for answering at least the following two questions. First, given a service R, does R interact correctly with P? Second, given a service P^', can P be substituted by P^', that is, is every correctly interacting partner of P a correctly interacting partner of P^', too?

...read moreread less

Journal Article•10.1016/J.DATAK.2008.10.001•

Establishing relationships among patterns in stock market data

[...]

Dietmar H. Dorr¹, Anne Denton¹•Institutions (1)

North Dakota State University¹

1 Mar 2009

TL;DR: This work introduces an algorithm for capturing the relationships among similar, contiguous subsequences, identifies patterns based on the similarity among sequences, captures the sequence-subsequence relationships among patterns in the form of a directed acyclic graph (DAG), and determines pattern conglomerates that allow the application of additional meta-analyses and mining algorithms.

...read moreread less

Abstract: Similarities among subsequences are typically regarded as categorical features of sequential data. We introduce an algorithm for capturing the relationships among similar, contiguous subsequences. Two time series are considered to be similar during a time interval if every contiguous subsequence of a predefined length satisfies the given similarity criterion. Our algorithm identifies patterns based on the similarity among sequences, captures the sequence-subsequence relationships among patterns in the form of a directed acyclic graph (DAG), and determines pattern conglomerates that allow the application of additional meta-analyses and mining algorithms. For example, our pattern conglomerates can be used to analyze time information that is lost in categorical representations. We apply our algorithm to stock market data as well as several other time series data sets and show the richness of our pattern conglomerates through qualitative and quantitative evaluations. An exemplary meta-analysis determines timing patterns representing relations between time series intervals and demonstrates the merit of pattern relationships as an extension of time series pattern mining.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.04.005•

Mining closed patterns in multi-sequence time-series databases

[...]

Anthony J. T. Lee¹, Huei-Wen Wu¹, Tzu-Yu Lee¹, Ying-Ho Liu¹, Kuo-Tay Chen¹ - Show less +1 more•Institutions (1)

National Taiwan University¹

1 Oct 2009

TL;DR: This paper proposes an efficient algorithm, called CMP-Miner, to mine closed patterns in a time-series database where each record in the database, also called a transaction, contains multiple time- series sequences.

...read moreread less

Abstract: In this paper, we propose an efficient algorithm, called CMP-Miner, to mine closed patterns in a time-series database where each record in the database, also called a transaction, contains multiple time-series sequences. Our proposed algorithm consists of three phases. First, we transform each time-series sequence in a transaction into a symbolic sequence. Second, we scan the transformed database to find frequent patterns of length one. Third, for each frequent pattern found in the second phase, we recursively enumerate frequent patterns by a frequent pattern tree in a depth-first search manner. During the process of enumeration, we apply several efficient pruning strategies to remove frequent but non-closed patterns. Thus, the CMP-Miner algorithm can efficiently mine the closed patterns from a time-series database. The experimental results show that our proposed algorithm outperforms the modified Apriori and BIDE algorithms.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.05.001•

Sweeping the disjunctive search space towards mining new exact concise representations of frequent itemsets

[...]

Tarek Hamrouni¹, S. Ben Yahia¹, E. Mephu Nguifo²•Institutions (2)

Tunis University¹, Blaise Pascal University²

1 Oct 2009

TL;DR: A new exact concise representation of frequent itemsets is introduced, based on an exploration of the disjunctive search space, that permits to drastically reduce the number of handled itemsets within the targeted re-presentation.

...read moreread less

Abstract: Concise (or condensed) representations of frequent patterns follow the minimum description length (MDL) principle, by providing the shortest description of the whole set of frequent patterns. In this work, we introduce a new exact concise representation of frequent itemsets. This representation is based on an exploration of the disjunctive search space. The disjunctive itemsets convey information about the complementary occurrence of items in a dataset. A novel closure operator is then devised to suit the characteristics of the explored search space. The proposed operator aims at mapping many disjunctive itemsets to a unique one, called a disjunctive closed itemset. Hence, it permits to drastically reduce the number of handled itemsets within the targeted re-presentation. Interestingly, the proposed representation offers direct access to the disjunctive and negative supports of frequent itemsets while ensuring the derivation of their exact conjunctive supports. We conclude from the experimental results reported and discussed here that our representation is effective and sound in comparison with different other concise representations.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.04.008•

Mining globally distributed frequent subgraphs in a single labeled graph

[...]

Xing Jiang¹, Hui Xiong², Chen Wang³, Ah-Hwee Tan¹•Institutions (3)

Nanyang Technological University¹, Rutgers University², IBM³

1 Oct 2009

TL;DR: A new measure, termed G-Measure, to find globally distributed frequent subgraphs, called G-Patterns, in a single labeled graph is proposed and a G-Miner algorithm is developed for finding G-patterns.

...read moreread less

Abstract: Recent years have observed increasing efforts on graph mining and many algorithms have been developed for this purpose. However, most of the existing algorithms are designed for discovering frequent subgraphs in a set of labeled graphs only. Also, the few algorithms that find frequent subgraphs in a single labeled graph typically identify subgraphs appearing regionally in the input graph. In contrast, for real-world applications, it is commonly required that the identified frequent subgraphs in a single labeled graph should also be globally distributed. This paper thus fills this crucial void by proposing a new measure, termed G-Measure, to find globally distributed frequent subgraphs, called G-Patterns, in a single labeled graph. Specifically, we first show that the G-Patterns, selected by G-Measure, tend to be globally distributed in the input graph. Then, we present that G-Measure has the downward closure property, which guarantees the G-Measure value of a G-Pattern is not less than those of its supersets. Consequently, a G-Miner algorithm is developed for finding G-Patterns. Experimental results on four synthetic and seven real-world data sets and comparison with the existing algorithms demonstrate the efficacy of the G-Measure and the G-Miner for finding G-Patterns. Finally, an application of the G-Patterns is given.

...read moreread less

Journal Article•10.1016/J.DATAK.2008.08.003•

Efficient algorithms for incremental maintenance of closed sequential patterns in large databases

[...]

Lei Chang¹, Tengjiao Wang¹, Dongqing Yang¹, Hua Luan², Shiwei Tang¹ - Show less +1 more•Institutions (2)

Peking University¹, Renmin University of China²

1 Jan 2009

TL;DR: Two efficient algorithms are developed to maintain closed sequential patterns in a dynamic sequence database environment by making full use of the properties of CSTree to find nodes whose states are obsolete and avoid unnecessary node extension and closure checking operations to accelerate the incremental update process.

...read moreread less

Abstract: Recent study shows that mining compact frequent patterns (such as closed patterns and compressed patterns) can alleviate the interpretability and efficiency problem encountered by traditional frequent pattern mining methods. Compact frequent patterns keep exact or approximate supports of a complete set of frequent patterns, and the number of them is often orders of magnitude smaller. Several efficient algorithms have been proposed to mine compact sequential patterns. However, sequence databases are not always static. Sequences (or items) are often added to and deleted from databases. A slight change made on a database may lead to the change of compact patterns. Mining from scratch is very time-consuming and thus infeasible. In this paper, we explore how to efficiently maintain closed sequential patterns in a dynamic sequence database environment. A compact structure CSTree is designed to keep closed sequential patterns, and its nice properties are carefully studied. Two efficient algorithms, IMCS"A and IMCS"D, are developed to maintain the CSTree upon incremental update. The algorithms make full use of the properties of CSTree to find nodes whose states are obsolete and avoid unnecessary node extension and closure checking operations to accelerate the incremental update process. A thorough experimental study on various real and synthetic datasets shows that the proposed algorithms outperform the state-of-the-art algorithms - PrefixSpan, CloSpan, BIDE and a recently proposed incremental mining algorithm IncSpan by about a factor of 4 to more than an order of magnitude.

...read moreread less

Journal Article•10.1016/J.DATAK.2009.01.002•

Mining non-derivable frequent itemsets over data stream

[...]

Haifeng Li¹, Hong Chen¹•Institutions (1)

Renmin University of China¹

1 May 2009

TL;DR: An optimized algorithm named NDFIoDS is proposed to generate non-derivable frequent itemsets over stream sliding window and results show that this method is effective and more efficient than previous approaches.

...read moreread less

Abstract: Non-derivable frequent itemsets are one of several condensed representations of frequent itemsets, which store all of the information contained in frequent itemsets using less space, thus being more suitable for stream mining. This paper considers a problem that to the best of our knowledge has not been addressed, namely, how to mine non-derivable frequent itemsets in an incremental fashion. We design a compact data structure named NDFIT to efficiently maintain a dynamically selected set of itemsets. In NDFIT, the nodes are divided into four categories to reduce the redundant computational cost based on their properties. Consequently, an optimized algorithm named NDFIoDS is proposed to generate non-derivable frequent itemsets over stream sliding window. Our experimental results show that this method is effective and more efficient than previous approaches.

...read moreread less