Top 66 papers presented at Data and Knowledge Engineering in 2011

Showing papers presented at "Data and Knowledge Engineering in 2011"

Journal Article•10.1016/J.DATAK.2010.09.002•

Reinforcement learning based resource allocation in business process management

[...]

Zhengxing Huang¹, W.M.P. van der Aalst², Xudong Lu¹, Huilong Duan¹•Institutions (2)

Zhejiang University¹, Eindhoven University of Technology²

1 Jan 2011

TL;DR: A mechanism in which the resource allocation optimization problem is modeled as Markov decision processes and solved using reinforcement learning, and the proposed mechanism observes its environment to learn appropriate policies which optimize resource allocation in business process execution.

...read moreread less

Abstract: Efficient resource allocation is a complex and dynamic task in business process management. Although a wide variety of mechanisms are emerging to support resource allocation in business process execution, these approaches do not consider performance optimization. This paper introduces a mechanism in which the resource allocation optimization problem is modeled as Markov decision processes and solved using reinforcement learning. The proposed mechanism observes its environment to learn appropriate policies which optimize resource allocation in business process execution. The experimental results indicate that the proposed approach outperforms well known heuristic or hand-coded strategies, and may improve the current state of business process management.

...read moreread less

163 citations

Journal Article•10.1016/J.DATAK.2011.01.002•

SyMSS: A syntax-based measure for short-text semantic similarity

[...]

Jesús Oliva¹, José Ignacio Serrano¹, María Dolores del Castillo¹, Ángel Iglesias¹•Institutions (1)

Spanish National Research Council¹

1 Apr 2011

TL;DR: The results show that SyMSS outperforms state-of-the-art methods in terms of rank correlation with human intuition, thus proving the importance of syntactic information in sentence semantic similarity computation.

...read moreread less

Abstract: Sentence and short-text semantic similarity measures are becoming an important part of many natural language processing tasks, such as text summarization and conversational agents. This paper presents SyMSS, a new method for computing short-text and sentence semantic similarity. The method is based on the notion that the meaning of a sentence is made up of not only the meanings of its individual words, but also the structural way the words are combined. Thus, SyMSS captures and combines syntactic and semantic information to compute the semantic similarity of two sentences. Semantic information is obtained from a lexical database. Syntactic information is obtained through a deep parsing process that finds the phrases in each sentence. With this information, the proposed method measures the semantic similarity between concepts that play the same syntactic role. Psychological plausibility is added to the method by using previous findings about how humans weight different syntactic roles when computing semantic similarity. The results show that SyMSS outperforms state-of-the-art methods in terms of rank correlation with human intuition, thus proving the importance of syntactic information in sentence semantic similarity computation.

...read moreread less

153 citations

Journal Article•10.1016/J.DATAK.2011.01.005•

Editorial: Mining business process variants: Challenges, scenarios, algorithms

[...]

Chen Li¹, Manfred Reichert², Andreas Wombacher¹•Institutions (2)

University of Twente¹, University of Ulm²

1 May 2011

TL;DR: This paper addresses two scenarios for learning from process model adaptations and for discovering a reference model out of which the variants can be configured with minimum efforts, and suggests two algorithms that are applicable in both scenarios, but have their pros and cons.

...read moreread less

Abstract: During the last years a new generation of process-aware information systems has emerged, which enables process model configurations at buildtime as well as process instance changes during runtime. Respective model adaptations result in a large number of model variants that are derived from the same process model, but slightly differ in structure. Generally, such model variants are expensive to configure and maintain. In this paper we address two scenarios for learning from process model adaptations and for discovering a reference model out of which the variants can be configured with minimum efforts. The first one is characterized by a reference process model and a collection of related process variants. The goal is to improve the original reference process model such that it fits better to the variant models. The second scenario comprises a collection of process variants, while the original reference model is unknown; i.e., the goal is to ''merge'' these variants into a new reference process model. We suggest two algorithms that are applicable in both scenarios, but have their pros and cons. We provide a systematic comparison of the two algorithms and further contrast them with conventional process mining techniques. Comparison results indicate good performance of our algorithms and also show that specific techniques are needed for learning from process configurations and adaptations. Finally, we provide results from a case study in automotive industry in which we successfully applied our algorithms.

...read moreread less

136 citations

Journal Article•10.1016/J.DATAK.2011.07.002•

Editorial: Mining usage scenarios in business processes: Outlier-aware discovery and run-time prediction

[...]

Francesco Folino¹, Gianluigi Greco², Antonella Guzzo², Luigi Pontieri¹•Institutions (2)

Indian Council of Agricultural Research¹, University of Calabria²

1 Dec 2011

TL;DR: Two relevant problems that arise in the context of applying clustering methods are addressed and various mining algorithms are implemented and integrated into a system prototype, which has been thoroughly validated over two real-life application scenarios.

...read moreread less

Abstract: A prominent goal of process mining is to build automatically a model explaining all the episodes recorded in the log of some transactional system. Whenever the process to be mined is complex and highly-flexible, however, equipping all the traces with just one model might lead to mixing different usage scenarios, thereby resulting in a spaghetti-like process description. This is, in fact, often circumvented by preliminarily applying clustering methods on the process log in order to identify all its hidden variants. In this paper, two relevant problems that arise in the context of applying such methods are addressed, which have received little attention so far: (i) making the clustering aware of outlier traces, and (ii) finding predictive models for clustering results. The first issue impacts on the effectiveness of clustering algorithms, which can indeed be led to confuse real process variants with exceptional behavior or malfunctions. The second issue instead concerns the opportunity of predicting the behavioral class of future process instances, by taking advantage of context-dependent ''non-structural'' data (e.g., activity executors, parameter values). The paper formalizes and analyzes these two issues and illustrates various mining algorithms to face them. All the algorithms have been implemented and integrated into a system prototype, which has been thoroughly validated over two real-life application scenarios.

...read moreread less

77 citations

Journal Article•10.1016/J.DATAK.2011.03.006•

Editorial: Using OWL and SWRL to represent and reason with situation-based access control policies

[...]

Dizza Beimel¹, Mor Peleg²•Institutions (2)

Ruppin Academic Center¹, University of Haifa²

1 Jun 2011

TL;DR: This paper presents the SitBAC knowledge framework, a formal healthcare-oriented, context-based access-control framework that makes it possible to represent and implement Sit BAC as a knowledge model along with an associated inference method, using OWL and SWRL.

...read moreread less

Abstract: Access control is a central problem in confidentiality management, in particular in the healthcare domain, where many stakeholders require access to patients' health records Situation-Based Access Control (SitBAC) is a conceptual model that allows for modeling healthcare scenarios of data-access requests; thus it can be used to formulate data-access policies, where health organizations can specify their regulations involving access to patients' data according to the context of the request The model's central concept is the Situation, a formal representation of a patient's data-access scenario In this paper, we present the SitBAC knowledge framework, a formal healthcare-oriented, context-based access-control framework that makes it possible to represent and implement SitBAC as a knowledge model along with an associated inference method, using OWL and SWRL Within the SitBAC knowledge framework, scenarios of data access are represented as formal Web Ontology language (OWL)-based Situation classes, formulating data-access rule classes A set of data-access rule classes makes up the organization's data-access policy An incoming data-access request, represented as an individual of an OWL-based Situation class, is evaluated by the inference method against the data-access policy to produce an 'approved/denied' response The method uses a Description Logics (DL)-reasoner and a Semantic Web Rule Language (SWRL) engine during the inference process The DL reasoner is used for knowledge classification and for real-time realization of the incoming data-access request as a member of an existing Situation class to infer the appropriate response The SWRL engine is used to infer new knowledge regarding the incoming data-access requests, which are required for the realization process We evaluated the ability of the SitBAC knowledge framework to provide correct responses by representing and reasoning with real-life healthcare scenarios

...read moreread less

73 citations

Proceedings Article•10.1109/ICDKE.2011.6053920•

A generalization of blocking and windowing algorithms for duplicate detection

[...]

Uwe Draisbach¹, Felix Naumann¹•Institutions (1)

Hasso Plattner Institute¹

20 Oct 2011

TL;DR: This work presents a new algorithm called Sorted Blocks in several variants, which generalizes both blocking and windowing on duplicates detection and shows that the new algorithm needs fewer comparisons to find the same number of duplicates.

...read moreread less

Abstract: Duplicate detection is the process of finding multiple records in a dataset that represent the same real-world entity. Due to the enormous costs of an exhaustive comparison, typical algorithms select only promising record pairs for comparison. Two competing approaches are blocking and windowing. Blocking methods partition records into disjoint subsets, while windowing methods, in particular the Sorted Neighborhood Method, slide a window over the sorted records and compare records only within the window. We present a new algorithm called Sorted Blocks in several variants, which generalizes both approaches. To evaluate Sorted Blocks, we have conducted extensive experiments with different datasets. These show that our new algorithm needs fewer comparisons to find the same number of duplicates.

...read moreread less

62 citations

Journal Article•10.1016/J.DATAK.2010.08.004•

Scaling up top-K cosine similarity search

[...]

Shiwei Zhu¹, Junjie Wu¹, Hui Xiong², Guoping Xia¹•Institutions (2)

Beihang University¹, Rutgers University²

1 Jan 2011

TL;DR: A theoretical analysis and experimental results show that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs, and that it is particularly scalable for large-scale data sets with a large number of items.

...read moreread less

Abstract: Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.

...read moreread less

55 citations

Journal Article•10.1016/J.DATAK.2011.02.003•

Reliable representations for association rules

[...]

Yue Xu¹, Yuefeng Li¹, Gavin Shaw¹•Institutions (1)

Queensland University of Technology¹

1 Jun 2011

TL;DR: It is proved that the redundancy elimination, based on the proposed Reliable basis, does not reduce the strength of belief in the extracted rules, and this result indicates that using non-redundant association rules only is sufficient to solve real problems needless using the entire rule set.

...read moreread less

Abstract: Association rule mining has contributed to many advances in the area of knowledge discovery. However, the quality of the discovered association rules is a big concern and has drawn more and more attention recently. One problem with the quality of the discovered association rules is the huge size of the extracted rule set. Often for a dataset, a huge number of rules can be extracted, but many of them can be redundant to other rules and thus useless in practice. Mining non-redundant rules is a promising approach to solve this problem. In this paper, we first propose a definition for redundancy, then propose a concise representation, called a Reliable basis, for representing non-redundant association rules. The Reliable basis contains a set of non-redundant rules which are derived using frequent closed itemsets and their generators instead of using frequent itemsets that are usually used by traditional association rule mining approaches. An important contribution of this paper is that we propose to use the certainty factor as the criterion to measure the strength of the discovered association rules. Using this criterion, we can ensure the elimination of as many redundant rules as possible without reducing the inference capacity of the remaining extracted non-redundant rules. We prove that the redundancy elimination, based on the proposed Reliable basis, does not reduce the strength of belief in the extracted rules. We also prove that all association rules, their supports and confidences, can be retrieved from the Reliable basis without accessing the dataset. Therefore the Reliable basis is a lossless representation of association rules. Experimental results show that the proposed Reliable basis can significantly reduce the number of extracted rules. We also conduct experiments on the application of association rules to the area of product recommendation. The experimental results show that the non-redundant association rules extracted using the proposed method retain the same inference capacity as the entire rule set. This result indicates that using non-redundant rules only is sufficient to solve real problems needless using the entire rule set.

...read moreread less

55 citations

Journal Article•10.1016/J.DATAK.2011.01.001•

Formal modelling of organisational goals based on performance indicators

[...]

Viara Popova¹, Alexei Sharpanskykh²•Institutions (2)

University of Tartu¹, VU University Amsterdam²

1 Apr 2011

TL;DR: This paper proposes a formal framework for modelling goals based on performance indicators and defines mechanisms for establishing goal satisfaction, which enable evaluation of organisational performance.

...read moreread less

Abstract: Every organisation exists or is created for the achievement of one or more goals. To ensure continued success, the organisation should monitor its performance with respect to the formulated goals. In practice the performance of an organisation is often evaluated by estimating its performance indicators. In most existing approaches for organisation modelling the relation between performance indicators and goals remains implicit. This paper proposes a formal framework for modelling goals based on performance indicators and defines mechanisms for establishing goal satisfaction, which enable evaluation of organisational performance. Methodological and analysis issues related to goals are also discussed in the paper. The described framework is a part of a general framework for organisation modelling and analysis.

...read moreread less

55 citations

Journal Article•10.1016/J.DATAK.2011.07.001•

Information based data anonymization for classification utility

[...]

Jiuyong Li¹, Jixue Liu¹, Muzammil M. Baig¹, Raymond Chi-Wing Wong²•Institutions (2)

University of South Australia¹, Hong Kong University of Science and Technology²

1 Dec 2011

TL;DR: This paper argues that data generalization in anonymization should be determined by the classification capability of data rather than the privacy requirement, and proposes two k-anonymity algorithms to produce anonymized tables for building accurate classification models.

...read moreread less

Abstract: Anonymization is a practical approach to protect privacy in data. The major objective of privacy preserving data publishing is to protect private information in data whereas data is still useful for some intended applications, such as building classification models. In this paper, we argue that data generalization in anonymization should be determined by the classification capability of data rather than the privacy requirement. We make use of mutual information for measuring classification capability for generalization, and propose two k-anonymity algorithms to produce anonymized tables for building accurate classification models. The algorithms generalize attributes to maximize the classification capability, and then suppress values by a privacy requirement k (IACk) or distributional constraints (IACc). Experimental results show that algorithm IACk supports more accurate classification models and is faster than a benchmark utility-aware data anonymization algorithm.

...read moreread less

51 citations

Journal Article•10.1016/J.DATAK.2011.03.007•

Privacy-aware collection of aggregate spatial data

[...]

Hairuo Xie¹, Hairuo Xie², Lars Kulik¹, Lars Kulik², Egemen Tanin¹, Egemen Tanin² - Show less +2 more•Institutions (2)

NICTA¹, University of Melbourne²

1 Jun 2011

TL;DR: This work modulates data collection based on Gaussian distribution to achieve an excellent balance between privacy and accuracy and proposes Gaussian negative surveys that collect negative data.

...read moreread less

Abstract: Privacy concerns can be a major barrier to collecting aggregate data from the public. Recent research proposes negative surveys that collect negative data, which is complementary to the true data. This opens a new direction for privacy-aware data collection. However, the existing approach cannot avoid certain errors when applied to many spatial data collection tasks. The errors can make the data unusable in many real scenarios. We propose Gaussian negative surveys. We modulate data collection based on Gaussian distribution. The collected data can be used to compute accurate spatial distribution of participants and can be used to accurately answer range aggregate queries. Our approach avoids the errors that can occur with the existing approach. Our experiments show that we achieve an excellent balance between privacy and accuracy.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.06.004•

Interaction mining and skill-dependent recommendations for multi-objective team composition

[...]

Christoph Dorn¹, Florian Skopik¹, Daniel Schall¹, Schahram Dustdar¹•Institutions (1)

Vienna University of Technology¹

1 Oct 2011

TL;DR: This paper provides two heuristics based on Genetic Algorithms and Simulated Annealing for discovering efficient team configurations that yield the best trade-off between skill coverage and team connectivity and evaluates the approach based on multiple configurations of a simulated collaboration network that features close resemblance to real world expert networks.

...read moreread less

Abstract: Web-based collaboration and virtual environments supported by various Web 2.0 concepts enable the application of numerous monitoring, mining and analysis tools to study human interactions and team formation processes. The composition of an effective team requires a balance between adequate skill fulfillment and sufficient team connectivity. The underlying interaction structure reflects social behavior and relations of individuals and determines to a large degree how well people can be expected to collaborate. In this paper we address an extended team formation problem that does not only require direct interactions to determine team connectivity but additionally uses implicit recommendations of collaboration partners to support even sparsely connected networks. We provide two heuristics based on Genetic Algorithms and Simulated Annealing for discovering efficient team configurations that yield the best trade-off between skill coverage and team connectivity. Our self-adjusting mechanism aims to discover the best combination of direct interactions and recommendations when deriving connectivity. We evaluate our approach based on multiple configurations of a simulated collaboration network that features close resemblance to real world expert networks. We demonstrate that our algorithm successfully identifies efficient team configurations even when removing up to 40% of experts from various social network configurations.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.02.005•

Editorial: ANEMONE: An environment for modular ontology development

[...]

Tuğba Özacar¹, Övünç Öztürk¹, Murat Osman ínalır¹•Institutions (1)

Ege University¹

1 Jun 2011

TL;DR: This article presents a methodology for modular ontology development that differs from previous methodologies in the way that it defines concrete development steps, to facilitate use by both naive and expert ontology developers.

...read moreread less

Abstract: Many real-world ontologies contain thousands of terms and are developed by multiple participants. The use of monolithic ontologies can cause problems that affect various stages of the ontology life cycle. Thus, there is an urgent need for tools and methodologies that facilitate modular ontology design. The benefits of a modular approach include division of labor, scalability, partial reuse, and broadened participation. This article presents a methodology for modular ontology development. The main idea is to facilitate an interoperable hierarchical network of ontology modules. Modules are designed as a combination of more abstract modules in higher levels of the hierarchy. This methodology differs from previous methodologies in the way that it defines concrete development steps, to facilitate use by both naive and expert ontology developers. This methodology is also supported by ontology design patterns and a prototypical ontology development tool.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.11.003•

A novel keyword search paradigm in relational databases: Object summaries

[...]

Georgios John Fakas¹•Institutions (1)

Manchester Metropolitan University¹

1 Feb 2011

TL;DR: The proposed paradigm introduces the concept of Affinity in order to automatically generate OSs in relational databases, and investigates and quantifies the Affinity of relations and their attributes inorder to decide which tuples and attributes to include in the OS.

...read moreread less

Abstract: This paper introduces a novel keyword search paradigm in relational databases, where the result of a search is an Object Summary (OS). An OS summarizes all data held about a particular Data Subject (DS) in a database. More precisely, it is a tree with a tuple containing the keyword(s) as a root and neighboring tuples as children. In contrast to traditional relational keyword search, an OS comprises a more complete and therefore semantically meaningful set of information about the enquired DS. The proposed paradigm introduces the concept of Affinity in order to automatically generate OSs. More precisely, it investigates and quantifies the Affinity of relations (i.e. Affinity) and their attributes (i.e. Attribute Affinity) in order to decide which tuples and attributes to include in the OS. Experimental evaluation on the TPC-H and Northwind databases verifies the searching quality of the proposed paradigm on both large and small databases; precision, recall, f-score, CPU and space measures are presented.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.11.002•

Privacy-preserving publishing microdata with full functional dependencies

[...]

Hui Wang¹, Ruilin Liu¹•Institutions (1)

Stevens Institute of Technology¹

1 Mar 2011

TL;DR: This paper formalizes the FFD-based privacy attack and defines the privacy model, (d,@?)-inference, to combat the FD-based attack, and design robust algorithms that can efficiently anonymize the microdata with low information loss when the unsafe FFDs are present.

...read moreread less

Abstract: Data publishing has generated much concern on individual privacy. Recent work has shown that different background knowledge can bring various threats to the privacy of published data. In this paper, we study the privacy threat from the full functional dependency (FFD) that is used as part of adversary knowledge. We show that the cross-attribute correlations by FFDs (e.g., Phone->Zipcode) can bring potential vulnerability. Unfortunately, none of the existing anonymization principles (e.g., k-anonymity, @?-diversity, etc.) can effectively prevent against an FFD-based privacy attack. We formalize the FFD-based privacy attack and define the privacy model, (d,@?)-inference, to combat the FD-based attack. We distinguish the safe FFDs that will not jeopardize privacy from the unsafe ones. We design robust algorithms that can efficiently anonymize the microdata with low information loss when the unsafe FFDs are present. The efficiency and effectiveness of our approach are demonstrated by the empirical study.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.07.006•

An approach to test-driven development of conceptual schemas

[...]

Albert Tort¹, Antoni Olivé¹, Maria-Ribera Sancho¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Dec 2011

TL;DR: This paper presents the Test-Driven Conceptual Modeling (TDCM) method, which is an application of TDD for conceptual modeling, and it is shown how to develop a conceptual schema using it.

...read moreread less

Abstract: Test-Driven Development (TDD) is an extreme programming development method in which a software system is developed in short iterations. In this paper we present the Test-Driven Conceptual Modeling (TDCM) method, which is an application of TDD for conceptual modeling, and we show how to develop a conceptual schema using it. In TDCM, a system's conceptual schema is incrementally obtained by performing three kinds of tasks: (1) Write a test the system should pass; (2) Change the schema to pass the test; and (3) Refactor the schema to improve its qualities. We also describe an integration approach of TDCM into a broad set of software development methodologies, including the Unified Process development methodology, the MDD-based approaches, the storytest-driven agile methods and the goal and scenario-oriented requirements engineering methods. We deal with schemas written in UML/OCL, but the TDCM method could be adapted to the development of schemas in other languages.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.01.007•

Symbolic abstraction and deadlock-freeness verification of inter-enterprise processes

[...]

Kais Klai¹, Samir Tata², Jörg Desel³•Institutions (3)

University of Paris¹, Institut Mines-Télécom², Rolf C. Hagen Group³

1 May 2011

TL;DR: How can the symbolic observation graph technique be adapted and employed for process composition?

...read moreread less

Abstract: The design of complex inter-enterprise business processes (IEBP) is generally performed in a modular way. Each process is designed separately and then the whole IEBP is obtained by composition. Even if such a modular approach is intuitive and facilitates the design problem, it poses the problem that correct behavior of each business process of the IEBP taken alone does not guarantee a correct behavior of the composed IEBP (i.e. properties are not preserved by composition). Proving correctness of the (unknown) composed process is strongly related to the model checking problem of a system model. Among others, the symbolic observation graph based approach has proven to be very helpful for efficient model checking in general. Since it is heavily based on abstraction techniques and thus hides detailed information about system components that are not relevant for the correctness decision, it is promising to transfer this concept to the problem raised in this paper: How can the symbolic observation graph technique be adapted and employed for process composition? Answering this question is the aim of this paper.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.08.001•

Indexing and querying XML using extended Dewey labeling scheme

[...]

Jiaheng Lu¹, Xiaofeng Meng¹, Tok Wang Ling¹•Institutions (1)

Renmin University of China¹

1 Jan 2011

TL;DR: A novel labeling scheme is introduced, called extended Dewey, which effectively extends the existing Dewey labeling scheme to combine the types and identifiers of elements in a label, and to avoid the scan of labels for internal query nodes to accelerate query processing (in I/O cost).

...read moreread less

Abstract: Finding all the occurrences of a tree pattern in an XML database is a core operation for efficient evaluation of XML queries. The Dewey labeling scheme is commonly used to label an XML document to facilitate XML query processing by recording information on the path of an element. In order to improve the efficiency of XML tree pattern matching, we introduce a novel labeling scheme, called extended Dewey, which effectively extends the existing Dewey labeling scheme to combine the types and identifiers of elements in a label, and to avoid the scan of labels for internal query nodes to accelerate query processing (in I/O cost). Based on extended Dewey, we propose a series of holistic XML tree pattern matching algorithms. We first present TJFast to answer an XML twig pattern query. To efficiently answer a generalized XML tree pattern, we then propose GTJFast, an optimization that exploits the non-output nodes. In addition, we propose TJFastTL and GTJFastTL based on the tag+level data partition scheme to further reduce I/O costs by level pruning. Finally, we report our comprehensive experimental results to show that our set of XML tree pattern matching algorithms are superior to existing approaches in terms of the number of elements scanned, the size of intermediate results and query performance.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.12.002•

Document clustering using synthetic cluster prototypes

[...]

Argyris Kalogeratos¹, Aristidis Likas¹•Institutions (1)

University of Ioannina¹

1 Mar 2011

TL;DR: The MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster is introduced that is incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp).

...read moreread less

Abstract: The use of centroids as prototypes for clustering text documents with the k-means family of methods is not always the best choice for representing text clusters due to the high dimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.05.001•

Towards a reference service model for the Web of Services

[...]

Nikolaos Loutas¹, Vassilios Peristeras¹, Konstantinos Tarabanis²•Institutions (2)

National University of Ireland, Galway¹, University of Macedonia²

1 Sep 2011

TL;DR: A reference service model (RSM) that closes the gap between two phenomenically contradictory service annotation paradigms: traditional semantic service frameworks and the emerging social annotation of services.

...read moreread less

Abstract: This paper introduces a reference service model (RSM) that closes the gap between two phenomenically contradictory service annotation paradigms: traditional semantic service frameworks and the emerging social annotation of services. RSM aims to (i) facilitate the semantic interlinking between services annotated using different semantic models and (ii) accommodate the bottom-up social annotation of services. RSM was developed following the design science research methodology. To develop RSM, existing semantic service models and SOA service models were reviewed in the light of the six service contracts and examined whether and using which elements each of the models supports in each of the contracts. The identified elements were then fed to a multiphase abstraction exercise. RSM comprises of the following concepts: Service, Service Input, Service Output, Service Context and Service Logic, Service Provider, Service Client and Service Feedback. The paper also maps the concepts of RSM to those of existing semantic service models and positions RSM with respect to related SOA service models. Finally, an implementation of RSM in OWL and two pilot developments that highlight different aspects of RSM are discussed.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.06.005•

An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers

[...]

David Rebollo-Monedero¹, Jordi Forné¹, Miguel Soriano•Institutions (1)

Polytechnic University of Catalonia¹

1 Oct 2011

TL;DR: The most promising aspect of the proposed algorithm is its capability to maintain the same k-anonymity constraint, while outperforming MDAV by a significant reduction in data distortion, in all the cases considered.

...read moreread less

Abstract: We present a multidisciplinary solution to the problems of anonymous microaggregation and clustering, illustrated with two applications, namely privacy protection in databases, and private retrieval of location-based information. Our solution is perturbative, is based on the same privacy criterion used in microdata k-anonymization, and provides anonymity through a substantial modification of the Lloyd algorithm, a celebrated quantization design algorithm, endowed with numerical optimization techniques.Our algorithm is particularly suited to the important problem of k-anonymous microaggregation of databases, with a small integer k representing the number of individual respondents indistinguishable from each other in the published database. Our algorithm also exhibits excellent performance in the problem of clustering or macroaggregation, where k may take on arbitrarily large values. We illustrate its applicability in this second, somewhat less common case, by means of an example of location-based services. Specifically, location-aware devices entrust a third party with accurate location information. This party then uses our algorithm to create distortion-optimized, size-constrained clusters, where k nearby devices share a common centroid location, which may be regarded as a distorted version of the original one. The centroid location is sent back to the devices, which use it when contacting untrusted location-based information providers, in lieu of the exact home location, to enforce k-anonymity.We compare the performance of our novel algorithm to the state-of-the-art microaggregation algorithm MDAV, on both synthetic and standardized real data, which encompass the cases of small and large values of k. The most promising aspect of our proposed algorithm is its capability to maintain the same k-anonymity constraint, while outperforming MDAV by a significant reduction in data distortion, in all the cases considered.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.12.001•

Automated browsing in AJAX websites

[...]

Paula Montoto¹, Alberto Pan¹, Juan Raposo¹, Fernando Bellas¹, Javier Lopez¹ - Show less +1 more•Institutions (1)

University of A Coruña¹

1 Mar 2011

TL;DR: A new method for recording navigation sequences able to scale to a wider range of events, an algorithm to identify in a change-resilient manner the target element of a useraction, and a novel method to detect when the effects caused by a user action have finished.

...read moreread less

Abstract: Web automation applications are widely used for different purposes such as B2B integration, automated testing of web applications or technology and business watch. One crucial part in web automation applications is for them to easily generate and reproduce navigation sequences. This problem is specially complicated in the case of the new breed of AJAX-based websites. Although recently some tools have also addressed the problem, they show some limitations either in usability or their ability to deal with complex websites. In this paper, we propose a set of new techniques to build an automatic web navigation system able to deal with these complexities. Our main contributions are: a new method for recording navigation sequences able to scale to a wider range of events, an algorithm to identify in a change-resilient manner the target element of a user action, and a novel method to detect when the effects caused by a user action (including the effects of scripting code and AJAX requests) have finished. In addition, we have also tested our approach with a high number of real web sources and have compared it with other relevant web automation tools obtaining very good results.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.05.003•

Sequence partitioning for process mining with unlabeled event logs

[...]

Michal Walicki¹, Diogo R. Ferreira²•Institutions (2)

University of Bergen¹, Technical University of Lisbon²

1 Oct 2011

TL;DR: This work describes an approach to perform complete search over the search space, as a matter of finding the minimal set of patterns contained in a sequence, where patterns can be interleaved but do not have repeating symbols.

...read moreread less

Abstract: Finding the case id in unlabeled event logs is arguably one of the hardest challenges in process mining research. While this problem has been addressed with greedy approaches, these usually converge to sub-optimal solutions. In this work, we describe an approach to perform complete search over the search space. We formulate the problem as a matter of finding the minimal set of patterns contained in a sequence, where patterns can be interleaved but do not have repeating symbols. This represents a new problem that has not been previously addressed in the literature, with NP-hard variants and conjectured NP-completeness. We solve it in a stepwise manner, by generating and verifying a list of candidate solutions. The techniques, introduced to address various subtasks, can be applied independently for solving more specific problems. The approach has been implemented and applied in a case study with real data from a business process supported in a software application.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.03.004•

Combining objects with rules to represent aggregation knowledge in data warehouse and OLAP systems

[...]

Nicolas Prat¹, Isabelle Comyn-Wattiau¹, Jacky Akoka²•Institutions (2)

ESSEC Business School¹, Conservatoire national des arts et métiers²

1 Aug 2011

TL;DR: In this article, the authors propose to represent aggregation knowledge with objects (UML class diagrams) and rules in the Production Rule Representation language (PRR) to enable early modeling of user requirements in a data warehouse project.

...read moreread less

Abstract: Data warehouses are based on multidimensional modeling. Using On-Line Analytical Processing (OLAP) tools, decision makers navigate through and analyze multidimensional data. Typically, users need to analyze data at different aggregation levels (using roll-up and drill-down functions). Therefore, aggregation knowledge should be adequately represented in conceptual multidimensional models, and mapped in subsequent logical and physical models. However, current conceptual multidimensional models poorly represent aggregation knowledge, which (1) has a complex structure and dynamics and (2) is highly contextual. In order to account for the characteristics of this knowledge, we propose to represent it with objects (UML class diagrams) and rules in the Production Rule Representation language (PRR). Static aggregation knowledge is represented in the class diagrams, while rules represent the dynamics (i.e. how aggregation may be performed depending on context). We present the class diagrams, and a typology and examples of associated rules. We argue that this representation of aggregation knowledge enables an early modeling of user requirements in a data warehouse project. A prototype has been developed based on the Java Expert System Shell (Jess).

...read moreread less

Journal Article•10.1016/J.DATAK.2010.11.005•

A method of workflow scheduling based on colored Petri nets

[...]

Zhijiao Xiao¹, Zhong Ming¹•Institutions (1)

Shenzhen University¹

1 Feb 2011

TL;DR: Experimental results show that the proposed method of workflow scheduling, called phased method, can deal with the uncertainties and the dynamic circumstances very well and a satisfactory balance can be achieved between static global optimization and dynamic local optimization.

...read moreread less

Abstract: Effective methods of workflow scheduling can improve the performance of workflow systems. Based on the study of existing scheduling methods, a method of workflow scheduling, called phased method, is proposed. This method is based on colored Petri nets. Activities of workflows are divided into several groups to be scheduled in different phases using this method. Details of the method are discussed. Experimental results show that the proposed method can deal with the uncertainties and the dynamic circumstances very well and a satisfactory balance can be achieved between static global optimization and dynamic local optimization.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.01.003•

Generating operation specifications from UML class diagrams: A model transformation approach

[...]

Manoli Albert¹, Jordi Cabot², Cristina Gómez³, Vicente Pelechano¹•Institutions (3)

Polytechnic University of Valencia¹, French Institute for Research in Computer Science and Automation², Polytechnic University of Catalonia³

1 Apr 2011

TL;DR: This paper aims to simplify this task by providing a method that automatically generates a set of basic operations that complement the static aspects of the CS and suffice to perform all typical life-cycle create/update/delete changes on the population of the elements of theCS.

...read moreread less

Abstract: One of the more tedious and complex tasks during the specification of conceptual schemas (CSs) is modeling the operations that define the system behavior. This paper aims to simplify this task by providing a method that automatically generates a set of basic operations that complement the static aspects of the CS and suffice to perform all typical life-cycle create/update/delete changes on the population of the elements of the CS. Our method guarantees that the generated operations are executable, i.e. their executions produce a consistent state wrt the most typical structural constraints that can be defined in CSs (e.g. multiplicity constraints). In particular, our method takes as input a CS expressed as a Unified Modeling Language (UML) class diagram (optionally defined using a profile to enrich the specification of associations) and generates an extended version of the CS that includes all necessary operations to start operating the system. If desired, these basic operations can be later used as building blocks for creating more complex ones. We show the formalization and implementation of our method by means of model-to-model transformations. Our approach is particularly relevant in the context of Model Driven Development approaches.

...read moreread less

Journal Article•10.1016/J.DATAK.2010.09.001•

Extending l -diversity to generalize sensitive data

[...]

Hongwei Tian¹, Weining Zhang¹•Institutions (1)

University of Texas at San Antonio¹

1 Jan 2011

TL;DR: An efficient heuristic algorithm that uses a novel order of quasi-identifier (QI) values to achieve (@t, @?)-diversity is presented and preliminary experimental results indicate that the algorithm not only provides a stronger privacy protection but also results in better utility of anonymous data.

...read moreread less

Abstract: Generalization is an important technique for protecting privacy in data dissemination. In the framework of generalization, @?-diversity is a strong notion of privacy. However, since existing @?-diversity measures are defined in terms of the most specific (rather than general) sensitive attribute (SA) values, algorithms based on these measures can have narrow eligible ranges for data that has a heavily skewed distribution of SA values and produce anonymous data that has a low utility. In this paper, we propose a new @?-diversity measure called the functional (@t, @?)-diversity, which extends @?-diversity by using a simple function to constrain frequencies of base SA values that are induced by general SA values. As a result, algorithms based on (@t, @?)-diversity may generalize SA values, thus are much less constrained by skew SA distributions. We show that (@t, @?)-diversity is more flexible and elaborate than existing @?-diversity measures. We present an efficient heuristic algorithm that uses a novel order of quasi-identifier (QI) values to achieve (@t, @?)-diversity. We compare our algorithm with two state-of-the-art algorithms that are based on existing @?-diversity measures. Our preliminary experimental results indicate that our algorithm not only provides a stronger privacy protection but also results in better utility of anonymous data.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.07.007•

An approximate duplicate elimination in RFID data streams

[...]

Chun-Hee Lee¹, Chin-Wan Chung²•Institutions (2)

Samsung¹, KAIST²

1 Dec 2011

TL;DR: Experimental results show that the approaches can effectively remove duplicates in RFID data streams in one pass with a small amount of memory.

...read moreread less

Abstract: The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data contains many duplicates. Since RFID data is generated in a streaming fashion, it is difficult to remove duplicates in one pass with limited memory. We propose one pass approximate methods based on Bloom Filters using a small amount of memory. We first devise Time Bloom Filters as a simple extension to Bloom Filters. We then propose Time Interval Bloom Filters to reduce errors. Time Interval Bloom Filters need more space than Time Bloom Filters. We propose a method to reduce space for Time Interval Bloom Filters. Since Time Bloom Filters and Time Interval Bloom Filters are based on Bloom Filters, they do not produce false negative errors. Experimental results show that our approaches can effectively remove duplicates in RFID data streams in one pass with a small amount of memory.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.03.009•

Extracting hot spots of topics from time-stamped documents

[...]

Wei Chen, Parvathi Chundi¹•Institutions (1)

University of Nebraska Omaha¹

1 Jul 2011

TL;DR: The experiments show that the proposed EHE algorithm significantly outperforms the naive one, and the extracted hot spots of given topics are meaningful.

...read moreread less

Abstract: Identifying time periods with a burst of activities related to a topic has been an important problem in analyzing time-stamped documents. In this paper, we propose an approach to extract a hot spot of a given topic in a time-stamped document set. Topics can be basic, containing a simple list of keywords, or complex. Logical relationships such as and, or, and not are used to build complex topics from basic topics. A concept of presence measure of a topic based on fuzzy set theory is introduced to compute the amount of information related to the topic in the document set. Each interval in the time period of the document set is associated with a numeric value which we call the discrepancy score. A high discrepancy score indicates that the documents in the time interval are more focused on the topic than those outside of the time interval. A hot spot of a given topic is defined as a time interval with the highest discrepancy score. We first describe a naive implementation for extracting hot spots. We then construct an algorithm called EHE (Efficient Hot Spot Extraction) using several efficient strategies to improve performance. We also introduce the notion of a topic DAG to facilitate an efficient computation of presence measures of complex topics. The proposed approach is illustrated by several experiments on a subset of the TDT-Pilot Corpus and DBLP conference data set. The experiments show that the proposed EHE algorithm significantly outperforms the naive one, and the extracted hot spots of given topics are meaningful.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.01.006•

Hierarchical aggregation of Service Level Agreements

[...]

Irfan Ul Haq¹, Altaf Ahmad Huqqani¹, Erich Schikuta¹•Institutions (1)

University of Vienna¹

1 May 2011

TL;DR: The notion of SLA Choreography is formalized and an aggregation model based on SLA-Views is defined to enable the automation of hierarchical aggregation of Service Level Agreements to comply with the WS-Agreement standard.

...read moreread less

Abstract: IT-based Service Economy requires Service Markets to flourish for the trade of services. A market does not represent a simple buyer-seller relationship, rather it is the culmination point of a complex chain of stake-holders with a hierarchical integration of value along each point in the chain. To enable a Service Economy, Service Markets must be practically realized, which in turn requires an enabling infrastructure to support service value chains and service choreographies resulting from service composition scenarios. In such scenarios, services compose together hierarchically in a producer-consumer manner to form service supply-chains of added value. Service Level Agreements (SLAs) are defined at various levels in this hierarchy to ensure the expected quality of service for different stakeholders. Automation of service composition directly implies the aggregation of their corresponding SLAs. In this paper we elaborate on the requirements of hierarchical aggregation of SLAs corresponding to service choreographies leading to business models such as Business Value Networks. During the hierarchical aggregation of SLAs, certain SLA information pertaining to different stakeholders is meant to be restricted and can be only partially revealed to a subset of their business partners. We introduce the concept of SLA-Views to protect such privacy concerns. We then formalize the notion of SLA Choreography and define an aggregation model based on SLA-Views to enable the automation of hierarchical aggregation of Service Level Agreements. The aggregation model has been designed to comply with the WS-Agreement standard.

...read moreread less