Top 70 papers presented at Data and Knowledge Engineering in 2012

Showing papers presented at "Data and Knowledge Engineering in 2012"

Journal Article•10.1016/J.DATAK.2012.02.005•

From humor recognition to irony detection: The figurative language of social media

[...]

Antonio Reyes¹, Paolo Rosso¹, Davide Buscaldi²•Institutions (2)

Polytechnic University of Valencia¹, Paul Sabatier University²

1 Apr 2012

TL;DR: The research described in this paper is focused on analyzing two playful domains of language: humor and irony, in order to identify key values components for their automatic processing in social media, such as ''tweets''.

...read moreread less

Abstract: The research described in this paper is focused on analyzing two playful domains of language: humor and irony, in order to identify key values components for their automatic processing. In particular, we are focused on describing a model for recognizing these phenomena in social media, such as ''tweets''. Our experiments are centered on five data sets retrieved from Twitter taking advantage of user-generated tags, such as ''#humor'' and ''#irony''. The model, which is based on textual features, is assessed on two dimensions: representativeness and relevance. The results, apart from providing some valuable insights into the creative and figurative usages of language, are positive regarding humor, and encouraging regarding irony.

...read moreread less

444 citations

Journal Article•10.1016/J.DATAK.2011.09.007•

AD-LRU: An efficient buffer replacement algorithm for flash-based databases

[...]

Peiquan Jin¹, Peiquan Jin², Yi Ou¹, Yi Ou², Theo Härder¹, Theo Härder², Zhi Li¹, Zhi Li² - Show less +4 more•Institutions (2)

University of Science and Technology of China¹, Kaiserslautern University of Technology²

1 Feb 2012

TL;DR: A new approach to buffer management for flash-based databases, called AD-LRU (Adaptive Double LRU), which focuses on improving the overall runtime efficiency by reducing the number of write/erase operations and by retaining a high buffer hit ratio.

...read moreread less

Abstract: Flash memory has characteristics of out-of-place update and asymmetric I/O latencies for read, write, and erase operations. Thus, the buffering policy for flash-based databases has to consider those properties to improve the overall performance. This article introduces a new approach to buffer management for flash-based databases, called AD-LRU (Adaptive Double LRU), which focuses on improving the overall runtime efficiency by reducing the number of write/erase operations and by retaining a high buffer hit ratio. We conduct trace-driven experiments both in a simulation environment and in a real DBMS, using a real OLTP trace and four kinds of synthetic traces: random, read-most, write-most, and Zipf. We make detailed comparisons between our algorithm and the best-known competitor methods. The experimental results show that AD-LRU is superior to its competitors in most cases.

...read moreread less

92 citations

Journal Article•10.1016/J.DATAK.2012.08.001•

DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

[...]

Mina Alibeigi, Sattar Hashemi, Ali Hamzeh

1 Nov 2012

TL;DR: The theoretical analysis and experimental observations reveal that the Density Based Feature Selection approach is the method of choice by offering a simple yet effective feature ranking method based on well-known statistical evaluation measures.

...read moreread less

Abstract: Nowadays, imbalanced data sets are pervasive in real world human practices, and hence, become a very interesting research area within machine learning communities. Imbalanced data sets introduce a significant reduction in performance of standard classifiers when they are invoked to learn data underlying concepts. The problem becomes even more sever when imbalanced data sets are involved with high dimensions. This paper presents a novel feature ranking approach based on the probability density estimation to cope with these issues. The idea behind our approach, named Density Based Feature Selection (DBFS), is that features' distributions over classes can bring significant benefits to feature selection algorithms. In other words, to explore the contribution of each attribute and assign it an appropriate rank, DBFS takes into account features' corresponding distributions over all classes along with their correlations. To show the effectiveness of the presented approach, well-known feature ranking methods are implemented and compared with our approach across varieties of small sample size and high dimensional data sets from microarray, mass spectrometry and text mining domains. Our theoretical analysis and experimental observations reveal that our approach is the method of choice by offering a simple yet effective feature ranking method based on well-known statistical evaluation measures.

...read moreread less

77 citations

Journal Article•10.1016/J.DATAK.2012.07.003•

Inferring the semantic properties of sentences by mining syntactic parse trees

[...]

Boris Galitsky¹, Josep Lluis de la Rosa, Gábor Dobrocsi•Institutions (1)

eBay¹

1 Nov 2012

TL;DR: It is concluded that implicit indications of semantic classes can be extracted from syntactic structures by using a syntactic parse tree-based similarity measure instead of the bag-of-words and keyword frequency approaches.

...read moreread less

Abstract: We extend the mechanism of logical generalization toward syntactic parse trees and attempt to detect semantic signals unobservable in the level of keywords. Generalization from a syntactic parse tree as a measure of syntactic similarity is defined by the obtained set of maximum common sub-trees and is performed at the level of paragraphs, sentences, phrases and individual words. We analyze the semantic features of this similarity measure and compare it with the semantics of traditional anti-unification of terms. Nearest-Neighbor machine learning is then applied to relate the sentence to a semantic class. By using a syntactic parse tree-based similarity measure instead of the bag-of-words and keyword frequency approaches, we expect to detect a subtle difference between semantic classes that is otherwise unobservable. The proposed approach is evaluated in three distinct domains in which a lack of semantic information makes the classification of sentences rather difficult. We conclude that implicit indications of semantic classes can be extracted from syntactic structures.

...read moreread less

66 citations

Journal Article•10.1016/J.DATAK.2011.07.004•

Measures and mechanisms for process monitoring in evolving business networks

[...]

Marco Comuzzi¹, J. Vonk¹, Paul Grefen¹•Institutions (1)

Eindhoven University of Technology¹

1 Jan 2012

TL;DR: A framework to solve the problem of preserving the monitorability of processes in an evolving business network is proposed and a set of metrics are defined that can be used for supporting decisions regarding the potential evolution of a business network.

...read moreread less

Abstract: The literature on monitoring of cross-organizational processes, executed within business networks, considers monitoring only in the network formation phase, since network establishment determines what can be monitored during process execution. In particular, the impact of evolution in such networks on monitoring is not considered. When a business network evolves, e.g. contracts are introduced, updated, or dropped, or actors join or leave the network, the monitoring requirements of the network actors change as well. As a result, the monitorability of processes in the network may be disrupted. This paper proposes a framework to solve the problem of preserving the monitorability of processes in an evolving business network. We first propose a formal model of business networks, contracts, and monitoring requirements. Then, we model network evolution and the mechanisms to preserve the monitorability of the processes in the network for different types of evolution. In particular, the preservation of monitorability requires the actors in the network to take appropriate actions in case of dependencies between already established contracts, and update their monitoring infrastructure to satisfy the new monitoring requirements introduced by evolution. We also define a set of metrics that can be used for supporting decisions regarding the potential evolution of a business network. A case study in healthcare and the discussion of a prototype implementation show the applicability of our framework in real-world scenarios.

...read moreread less

37 citations

Journal Article•10.1016/J.DATAK.2012.03.003•

Fully homomorphic encryption based two-party association rule mining

[...]

Mohammed Kaosar¹, Russell Paulet¹, Xun Yi¹•Institutions (1)

Victoria University, Australia¹

1 Jun 2012

TL;DR: This paper proposes a secure comparison technique using fully homomorphic encryption scheme that provides a similar level of security to the Yao based solution, but promotes greater efficiency due to the reuse of resources.

...read moreread less

Abstract: Association rule mining (ARM) is one of the popular data mining methods that discover interesting correlations amongst a large collection of data, which appears incomprehensible. This is known to be a trivial task when the data is owned by one party. But when multiple data sites collectively engage in ARM, privacy concerns are introduced. Due to this concern, privacy preserving data mining algorithms have been developed to attain the desired result, while maintaining privacy. In the case of two party privacy preserving ARM for horizontally partitioned databases, both parties are required to compare their itemset counts securely. This problem is comparable to the famous millionaire problem of Yao. However, in this paper, we propose a secure comparison technique using fully homomorphic encryption scheme that provides a similar level of security to the Yao based solution, but promotes greater efficiency due to the reuse of resources.

...read moreread less

34 citations

Journal Article•10.1016/J.DATAK.2011.07.009•

Mining frequent patterns from univariate uncertain data

[...]

Ying-Ho Liu¹•Institutions (1)

National Dong Hwa University¹

1 Jan 2012

TL;DR: The experimental results demonstrate that the U2P-Miner algorithm outperforms three widely used algorithms, namely, the modified Apriori, modified H-mine, and modified depth-first backtracking algorithms.

...read moreread less

Abstract: In this paper, we propose a new algorithm called U2P-Miner for mining frequent U2 patterns from univariate uncertain data, where each attribute in a transaction is associated with a quantitative interval and a probability density function. The algorithm is implemented in two phases. First, we construct a U2P-tree that compresses the information in the target database. Then, we use the U2P-tree to discover frequent U2 patterns. Potential frequent U2 patterns are derived by combining base intervals and verified by traversing the U2P-tree. We also develop two techniques to speed up the mining process. Since the proposed method is based on a tree-traversing strategy, it is both efficient and scalable. Our experimental results demonstrate that the U2P-Miner algorithm outperforms three widely used algorithms, namely, the modified Apriori, modified H-mine, and modified depth-first backtracking algorithms.

...read moreread less

33 citations

Journal Article•10.1016/J.DATAK.2011.10.005•

A semantically enhanced service repository for user-centric service discovery and management

[...]

Jian Yu¹, Quan Z. Sheng², Jun Han¹, Yanbo Wu², Chengfei Liu¹ - Show less +1 more•Institutions (2)

Swinburne University of Technology¹, University of Adelaide²

1 Feb 2012

TL;DR: The design and development of a service repository is discussed, which is at the very core of the OPUCE platform, which consists of a fully functioned XML registry supporting facet-based access to service descriptions, and a novel semantic service browser that supports prosumers who are not technically experienced to explore and discover services in an intuitive and visualized manner.

...read moreread less

Abstract: User centricity represents a new trend in the currently flourishing service oriented computing era. By upgrading end-users to prosumers (producer+consumer) and involving them in the process of service creation, both service consumers and service providers can benefit from a cheaper, faster, and better service provisioning. The EU-IST research project OPUCE (Open Platform for User-Centric Service Creation and Execution) aims at building a unique service environment by integrating recent advances in networking, communication and information technology where personalized services can be dynamically created and managed by prosumers. This paper particularly discusses the design and development of a service repository, which is at the very core of the OPUCE platform. The repository consists of two main components: a fully functioned XML registry supporting facet-based access to service descriptions, and a novel semantic service browser that supports prosumers who are not technically experienced to explore and discover services in an intuitive and visualized manner. We demonstrate the benefits of our design by conducting usability and performance studies.

...read moreread less

30 citations

Journal Article•10.1016/J.DATAK.2011.09.002•

When conceptual model meets grammar: A dual approach to XML data modeling

[...]

Martin Necasky¹, Irena Mlynkova¹, Jakub Klímek¹, Jakub Maly¹•Institutions (1)

Charles University in Prague¹

1 Feb 2012

TL;DR: This paper introduces a novel approach to conceptual modeling for XML schemas, and proves correctness of the introduced translation algorithms between platform-specific and XML schema levels and proves the expressive power of the conceptual schemas.

...read moreread less

Abstract: In this paper we introduce a novel approach to conceptual modeling for XML schemas. Compared to other approaches, it allows for modeling of a whole family of XML schemas related to a particular application domain. It is integrated in a well-established way of software-engineering, namely Model-Driven Development (MDD). It allows software-engineers to naturally model their application domain using a conceptual schema at the platform-independent level of the MDD hierarchy. From there they can design the desired XML schemas in a form of conceptual schemas at the platform-specific level of MDD hierarchy. Schemas at the platform-specific level are then automatically translated to particular XML schemas. Beside this forward-engineering direction, reverse-engineering direction integrating existing XML schemas into the MDD hierarchy is supported as well. We provide several theoretical results which ensure correctness of the introduced approach. We exploit regular tree grammars to formalize XML schemas. We formalize the bindings between the schemas at the two MDD levels and between schemas at the platform-specific level and XML schemas. We prove that conceptual schemas specify the target XML schemas unambiguously. We also prove the expressive power of the conceptual schemas. And, finally, we prove correctness of the introduced translation algorithms between platform-specific and XML schema levels.

...read moreread less

30 citations

Journal Article•10.1016/J.DATAK.2011.11.005•

Discovering better navigation sequences for the session construction problem

[...]

Murat Ali Bayir¹, Ismail Hakki Toroslu², Murat Demirbas¹, Ahmet Cosar²•Institutions (2)

University at Buffalo¹, Middle East Technical University²

1 Mar 2012

TL;DR: A novel page view based session model and session construction method to address the Web Usage Mining (WUM) problem and shows that Smart-SRA produces more accurate user sessions than the session construction methods found in the literature.

...read moreread less

Abstract: In this paper, we propose a novel page view based session model and session construction method to address the Web Usage Mining (WUM) problem. Unlike the simple session models, where sessions are sequences of web pages requested from the server (or served from a browser/proxy cache) and viewed in the browser (which may not guarantee a direct relationship between subsequent web pages in the session), we define a more realistic session model in which a session is a set of paths traversed in the web graph that corresponds to a user navigation performed by following links on web pages. We define the session construction process from raw server logs as a new graph problem and present a novel algorithm, Smart-SRA (Smart Session Reconstruction Algorithm), to solve this problem efficiently. An experimental evaluation based on data collected from real web access scenarios showed that Smart-SRA produces more accurate user sessions than the session construction methods found in the literature.

...read moreread less

26 citations

Journal Article•10.1016/J.DATAK.2011.08.002•

Non-redundant web services composition based on a two-phase algorithm

[...]

Joonho Kwon¹, Daewook Lee²•Institutions (2)

Pusan National University¹, Sogang University²

1 Jan 2012

TL;DR: Results of experiments involving data sets with different characteristics show the performance benefits of the NRC techniques in comparison to state-of-the-art composition approaches.

...read moreread less

Abstract: Recently, there has been growing interest in developing web services composition search systems Current solutions have the drawback of including redundant web services in the results In this paper, we proposed a non-redundant web services composition search system called NRC, which is based on a two-phase algorithm In the NRC system, the Link Index is built over web services according to their connectivity In the forward phase, the candidate compositions are efficiently found by searching the Link Index In the backward phase, the candidate compositions decomposed into several non-redundant web services compositions by using the concept of tokens Results of experiments involving data sets with different characteristics show the performance benefits of the NRC techniques in comparison to state-of-the-art composition approaches

...read moreread less

Journal Article•10.1016/J.DATAK.2011.10.001•

A safe-exit approach for efficient network-based moving range queries

[...]

Duncan Yung¹, Man Lung Yiu¹, Eric Lo¹•Institutions (1)

Hong Kong Polytechnic University¹

1 Feb 2012

TL;DR: This paper formulate a network-based concept called safe exits that guarantees the query result of the client remains unchanged before the client reaches any exit, and develops an efficient algorithm for computing safe exits for a client on-demand.

...read moreread less

Abstract: Query processing on road networks has been extensively studied in recent years. However, the processing of moving queries on road networks has received little attention. This paper studies the efficient processing of moving range queries on road networks. We formulate a network-based concept called safe exits that guarantee the query result of the client remains unchanged before the client reaches any exit. This significantly reduces the communication overhead between moving clients and the server. We then develop an efficient algorithm for computing safe exits for a client on-demand. We evaluate the proposed techniques using real road network data. Experimental results show that our algorithm constructs safe exits efficiently and they effectively reduce the communication cost.

...read moreread less

Journal Article•10.1109/TKDE.2011.74•

Adding Temporal Constraints to XML Schema

[...]

Faiz Currim¹, Sabah Currim¹, Curtis E. Dyreson², Richard T. Snodgrass¹, Stephen W. Thomas³, Rui Zhang¹ - Show less +2 more•Institutions (3)

University of Arizona¹, Utah State University², Queen's University³

1 Aug 2012

TL;DR: This paper describes how to interpret the various integrity constraints defined in XML Schema as sequenced constraints, applicable at each point in time, and adds new variants that apply across time, so-called nonsequenced constraints.

...read moreread less

Abstract: If past versions of XML documents are retained, what of the various integrity constraints defined in XML Schema on those documents? This paper describes how to interpret such constraints as sequenced constraints, applicable at each point in time. We also consider how to add new variants that apply across time, so-called nonsequenced constraints. Our approach supports temporal documents that vary over both valid and transaction time, whose schema can vary over transaction time. We do this by replacing the schema with a (possibly time-varying) temporal schema and replacing the document with a temporal document, both of which are upward compatible with conventional XML and with conventional tools like XMLLINT, which we have extended to support the temporal constraints introduced here.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.04.002•

Repairing inconsistent dimensions in data warehouses

[...]

Monica Caniupan¹, Loreto Bravo², Carlos A. Hurtado³•Institutions (3)

University of the Bío Bío¹, University of Concepción², Adolfo Ibáñez University³

1 Sep 2012

TL;DR: The notion of minimal repair of a dimension is introduced: a new dimension that is consistent with respect to the set of integrity constraints, which is obtained by applying a minimal number of updates to the original dimension.

...read moreread less

Abstract: A dimension in a data warehouse (DW) is a set of elements connected by a hierarchical relationship. The elements are used to view summaries of data at different levels of abstraction. In order to support an efficient processing of such summaries, a dimension is usually required to satisfy different classes of integrity constraints. In scenarios where the constraints properly capture the semantics of the DW data, but they are not satisfied by the dimension, the problem of repairing (correcting) the dimension arises. In this paper, we study the problem of repairing a dimension in the context of two main classes of integrity constraints: strictness and covering constraints. We introduce the notion of minimal repair of a dimension: a new dimension that is consistent with respect to the set of integrity constraints, which is obtained by applying a minimal number of updates to the original dimension. We study the complexity of obtaining minimal repairs, and show how they can be characterized using Datalog programs with weak constraints under the stable model semantics.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.03.002•

Editorial: Large scale instance selection by means of federal instance selection

[...]

Aida de Haro-García¹, Nicolás García-Pedrajas¹, Juan Antonio Romero del Castillo¹•Institutions (1)

University of Córdoba (Spain)¹

1 May 2012

TL;DR: This paper presents a methodology for scaling up instance selection algorithms by means of a parallel procedure that performs instance selection on small subsets of the original dataset using a voting scheme.

...read moreread less

Abstract: Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is hundreds of thousands or millions. Most of the widely used instance selection algorithms are of complexity at least O(n^2), n being the number of instances. When we face very large problems, the scalability becomes an issue, and most of the algorithms are not applicable. This paper presents a methodology for scaling up instance selection algorithms by means of a parallel procedure that performs instance selection on small subsets of the original dataset. The results obtained with the application of instance selection to small subsets are combined using a voting scheme. The method achieves a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The parallel algorithm also removes any kind of constraint imposed by memory size, as the whole dataset does not need to be stored in memory. The usefulness of our method is shown by an extensive comparison using 35 datasets of medium and large sizes from the UCI Machine Learning Repository. Additionally, our method is applied to eight very large datasets with very good results and fast execution time.

...read moreread less

Book Chapter•10.1007/978-3-642-34679-8_6•

An Agile Knowledge Discovery in Databases Software Process

[...]

Givanildo Santana do Nascimento¹, Givanildo Santana do Nascimento², Adicinéia Aparecida de Oliveira¹•Institutions (2)

Universidade Federal de Sergipe¹, Petrobras²

21 Nov 2012

TL;DR: The AgileKDD, an agile and disciplined software process for developing systems capable of discovering the knowledge hidden in databases, which was built on top of the Open Unified Process is introduced.

...read moreread less

Abstract: In a knowledge society, transforming data into information and knowledge to support the decision-making process is a crucial success factor for all organizations. In this sense, the mission of Software Engineering is to build systems able to process large volumes of data, transform them into relevant knowledge and deliver them to customers, so they can make the right decisions at the right time. However, companies still fail in determining the process model used in their Knowledge Discovery in Databases projects. This article introduces the AgileKDD, an agile and disciplined software process for developing systems capable of discovering the knowledge hidden in databases, which was built on top of the Open Unified Process. A case study shows that AgileKDD can increase the success factor of projects whose goal is to develop Knowledge Discovery in Databases applications.

...read moreread less

Book Chapter•10.1007/978-3-642-34679-8_13•

Wikipedia Category Graph and New Intrinsic Information Content Metric for Word Semantic Relatedness Measuring

[...]

Mohamed Ali Hadj Taieb, Mohamed Ben Aouicha, Mohamed Tmar, Abdelmajid Ben Hamadou

21 Nov 2012

TL;DR: A new intrinsic information content metric is used with Wikipedia category graph to measure the semantic relatedness between words and when tested on common benchmark of similarity ratings the proposed approach shows a good correlation value compared to other computational models.

...read moreread less

Abstract: Computing semantic relatedness is a key component of information retrieval tasks and natural processing language applications. Wikipedia provides a knowledge base for computing word relatedness with more coverage than WordNet. In this paper we use a new intrinsic information content (IC) metric with Wikipedia category graph (WCG) to measure the semantic relatedness between words. Indeed, we have developed a performed algorithm to extract the categories assigned to a given word from the WCG. Moreover, this extraction strategy is coupled with a new intrinsic information content metric based on the subgraph composed of hypernyms of a given concept. Also, we have developed a process to quantify the information content subgraph. When tested on common benchmark of similarity ratings the proposed approach shows a good correlation value compared to other computational models.

...read moreread less

Book Chapter•10.1007/978-3-642-34679-8_19•

Certificate-Based Key-Insulated Signature

[...]

Haiting Du¹, Jiguo Li¹, Yichen Zhang¹, Li Tao¹, Yuexin Zhang² - Show less +1 more•Institutions (2)

Hohai University¹, Fujian Normal University²

21 Nov 2012

TL;DR: A certificate-based key-insulated signature scheme is presented, which is proven to be existentially unforgeable against adaptive chosen message attacks in the random oracle model.

...read moreread less

Abstract: To reduce the influence of key exposure, we introduce key-insulated into certificate-based cryptography and formalize the notion and security model of the certificate-based key-insulated signature scheme. We then present a certificate-based key-insulated signature scheme, which is proven to be existentially unforgeable against adaptive chosen message attacks in the random oracle model.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.07.008•

Adaptive optimization for multiple continuous queries

[...]

Hong Kyu Park¹, Won Suk Lee¹•Institutions (1)

Yonsei University¹

1 Jan 2012

TL;DR: A new multiple query optimization approach, Adaptive Sharing-based Extended Greedy Optimization Approach (A-SEGO), that traces multiple promising partial plans simultaneously and presents a novel method for sharing the results of common sub-expressions in a set of queries cost-effectively.

...read moreread less

Abstract: Because it operates under a strict time constraint, query processing for data streams should be continuous and rapid. To guarantee this constraint, most previous researches optimize the evaluation order of multiple join operations in a set of continuous queries using a greedy optimization strategy so that the order is re-optimized dynamically in run-time due to the time-varying characteristics of data streams. However, this method often results in a sub-optimal plan because the greedy strategy traces only the first promising plan. This paper proposes a new multiple query optimization approach, Adaptive Sharing-based Extended Greedy Optimization Approach (A-SEGO), that traces multiple promising partial plans simultaneously. A-SEGO presents a novel method for sharing the results of common sub-expressions in a set of queries cost-effectively. The number of partial plans can be flexibly controlled according to the query processing workload. In addition, to avoid invoking the optimization process too frequently, optimization is performed only when the current execution plan is relatively no longer efficient. A series of experiments are comparatively analyzed to evaluate the performance of the proposed method in various stream environments.

...read moreread less

Book Chapter•10.1007/978-3-642-34679-8_11•

Determining pattern similarity in a medical recommender system

[...]

Maytiyanin Komkhao¹, Jie Lu², Lichen Zhang³•Institutions (3)

Rolf C. Hagen Group¹, University of Technology, Sydney², Guangdong University of Technology³

21 Nov 2012

TL;DR: For collaborative filtering an incremental algorithm, called W-InCF, is used working with the Mahalanobis distance and fuzzy membership, and fuzzy sets are employed to cope with possible confusion of decision making on overlapping clusters.

...read moreread less

Abstract: As recommender systems have proven their effectiveness in other areas, it is aimed to transfer this approach for use in medicine. Particularly, the diagnoses of physicians made in rural hospitals of developing countries, in remote areas or in situations of uncertainty are to be complemented by machine recommendations drawing on large bases of expert knowledge in order to reduce the risk to patients. Recommendation is mainly based on finding known patterns similar to a case under consideration. To search for such patterns in rather large databases, a weighted similarity distance is employed, which is specially derived for medical knowledge. For collaborative filtering an incremental algorithm, called W-InCF, is used working with the Mahalanobis distance and fuzzy membership. W-InCF consists of a learning phase, in which a cluster model of patients’ medical history is constructed incrementally, and a prediction phase, in which the medical pattern of each patient considered is compared with the model to determine the most similar cluster. Fuzzy sets are employed to cope with possible confusion of decision making on overlapping clusters. The degrees of membership to these fuzzy sets is expressed by a weighted Mahalanobis radial basis function, and the weights are derived from risk factors identified by experts. The algorithm is validated using data on cephalopelvic disproportion.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.03.001•

PISA: A framework for multiagent classification using argumentation

[...]

Maya Wardeh¹, Frans Coenen¹, Trevor J. M. Bench Capon¹•Institutions (1)

University of Liverpool¹

1 May 2012

TL;DR: Experiments indicate that the operation of PISA is comparable with other classification approaches and that, when operating groups or in the presence of noise, PISA outperforms such comparable approaches.

...read moreread less

Abstract: This paper describes an approach to multi-agent classification using an argumentation from experience paradigm whereby individual agents argue for a given example to be classified with a particular label according to their local data. Arguments are expressed in the form of classification rules which are generated dynamically. As such each local database can be conceptualised as an experience repository; and the individual classification rules, generated from this repository, as describing generalisations drawn from this experience. The argumentation process and the supporting data structures are fully described. The process has been implemented in the PISA (Pooling Information from Several Agents) multi-agent framework which is fully described. Experiments indicate that the operation of PISA is comparable with other classification approaches and that, when operating groups or in the presence of noise, PISA outperforms such comparable approaches.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.09.006•

Formalization and reasoning about spatial semantic integrity constraints

[...]

Loreto Bravo¹, M. Andrea Rodríguez¹•Institutions (1)

University of Concepción¹

1 Feb 2012

TL;DR: It is shown that satisfiability is not tractable in general and some conditions under which it is, and algorithms that check if a set of constraints is satisfiable are given for tractable cases and for intractable cases.

...read moreread less

Abstract: A formalization of spatial semantic integrity constraints is fundamental to assess the data quality of spatial databases. This paper presents a formalization of spatial semantic integrity constraints that provides a uniform specification of constraints used in practice. The formalization extends traditional notions of functional and inclusion dependencies to consider spatial attributes. This enables to impose topological relations between spatial attributes and to impose constraints on numerical attributes that depend on spatial attributes. We also study one of the classical problems of integrity constraints: the satisfiability problem, which consists in checking the existence of a non-empty database that satisfies a given set of constraints. This problem, in the context of spatial databases, rises the qualitative reasoning problems of topological consistency and realizability of spatial constraints. We show that satisfiability is not tractable in general and provide some conditions under which it is. For tractable cases, we also give algorithms that check if a set of constraints is satisfiable. For intractable cases we find conditions under which approximation algorithms can be used.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.11.003•

A change detection system for unordered XML data using a relational model

[...]

Sathya Sundaram¹, Sanjay Kumar Madria¹•Institutions (1)

Missouri University of Science and Technology¹

1 Feb 2012

TL;DR: An efficient algorithm is proposed (XRel_Change_SQL) for detecting unordered changes between two XML data files stored in XRel as the underlying relational data model, using Structured Query Language (SQL).

...read moreread less

Abstract: The dramatic increase in the evolution of XML data available on the Internet requires a change detection system to keep track of important changes occurring during their life time. In this paper, we introduce a novel approach of detecting changes between two versions of unordered XML data stored in a traditional relational database using approaches like XRel. Most of the existing work in the area of XML change detection is mainly focused on detecting changes between two versions of XML data by constructing their Document Object Model (DOM) trees and then comparing these two tree structures based on Longest Common Sequence (LCS) using minimum edit distances. The basic tree comparison approach is not efficient in handling large XML files due to the fact that (1) an equivalent XML DOM tree will be twice as large as the original document and (2) the entire trees of both versions have to be memory resident during the comparison process. These two issues are constrained by the available main memory. In addition, existing approaches fail to detect changes among versions of XML data stored in relational databases as reverse mapping is not loss-less. We propose an efficient algorithm (XRel_Change_SQL) for detecting unordered changes between two XML data files stored in XRel as the underlying relational data model, using Structured Query Language (SQL). We compare the efficiency and quality of our change detection algorithm with existing XML change detection tools like X-Diff, DeltaXML and XANDY. We provide an experimental evaluation of the results obtained from the benchmark datasets as well as some synthetic datasets to show that our approach is highly scalable, and results in a much better efficiency and delta quality than the aforementioned approaches and tools.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.09.003•

S3: Processing tree-pattern XML queries with all logical operators

[...]

Sayyed Kamyar Izadi¹, Mostafa S. Haghjoo¹, Theo Härder²•Institutions (2)

Iran University of Science and Technology¹, Kaiserslautern University of Technology²

1 Feb 2012

TL;DR: A new structure, called Evaluation Tree, is used, which is used to execute QTPs and extends the method to supportQTPs having logical operators OR, XOR, and NOT, and to prevent redundant I/O and QTP matching.

...read moreread less

Abstract: XML is a tree-based data representation format which combines data and structure. Therefore, XML queries not only contain predicates to filter data but also refer to relationships between document elements searched. The existing elements in an XML query are connected to each other using a tree-pattern structure, called Query Tree Pattern (QTP). Finding elements of a document, which satisfy the given QTP, is the main task during query execution. To optimize this processing, we presented two methods in [13]. Instead of directly executing the QTP against the document, our methods first evaluate a guidance structure, called QueryGuide. Using the extracted information, called match pattern, we provided a focused document access and minimized the required I/O. However, we only supported the logical operator AND (called AND-QTPs). In this paper, we use a new structure, called Evaluation Tree, to execute QTPs. We also extend our method to support QTPs having logical operators OR, XOR, and NOT. Parsing QTPs into some AND-QTPs is typically assumed non-efficient. To process QTPs having logical operators OR and NOT, we therefore parse them but we use an efficient method to prevent redundant I/O and QTP matching. This is done by optimizing the selection of match patterns which were derived from the QueryGuide during QTP parsing. As a result, QTP execution is not inefficient anymore.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.02.001•

Probabilistic Voronoi diagrams for probabilistic moving nearest neighbor queries

[...]

Mohammed Eunus Ali¹, Egemen Tanin², Rui Zhang², Ramamohanarao Kotagiri²•Institutions (2)

Bangladesh University of Engineering and Technology¹, University of Melbourne²

1 May 2012

TL;DR: The Probabilistic Voronoi diagram (PVD) is proposed for processing moving nearest neighbor queries on uncertain data, namely the probabilisticMoving nearest neighbor (PMNN) queries.

...read moreread less

Abstract: A large spectrum of applications such as location based services and environmental monitoring demand efficient query processing on uncertain databases. In this paper, we propose the probabilistic Voronoi diagram (PVD) for processing moving nearest neighbor queries on uncertain data, namely the probabilistic moving nearest neighbor (PMNN) queries. A PMNN query finds the most probable nearest neighbor of a moving query point continuously. To process PMNN queries efficiently, we provide two techniques: a pre-computation approach and an incremental approach. In the pre-computation approach, we develop an algorithm to efficiently evaluate PMNN queries based on the pre-computed PVD for the entire data set. In the incremental approach, we propose an incremental probabilistic safe region based technique that does not require to pre-compute the whole PVD to answer the PMNN query. In this incremental approach, we exploit the knowledge for a known region to compute the lower bound of the probability of an object being the nearest neighbor. Experimental results show that our approaches significantly outperform a sampling based approach by orders of magnitude in terms of I/O, query processing time, and communication overheads.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.09.001•

A model driven approach for the development of metadata editors, applicability to the annotation of geographic information resources

[...]

Javier Nogueras-Iso¹, Miguel Ángel Latre¹, Rubén Béjar¹, Pedro R. Muro-Medrano¹, F. Javier Zarazaga-Soria¹ - Show less +1 more•Institutions (1)

University of Zaragoza¹

1 Nov 2012

TL;DR: This work proposes a model driven approach for the development of metadata editors, more focused on the generic treatment of metadata models than on theDevelopment of specific edition forms for a reduced set of metadata standards.

...read moreread less

Abstract: Metadata are a key element for the development of information infrastructures because they facilitate the semantic description of contents and services. However, the diversity and heterogeneity of metadata standards have become a barrier for the generation of these metadata. Many metadata editors are not useful anymore because they do not support the latest version of metadata standards or the new profiles arisen in the market. Thus, this work proposes a model driven approach for the development of metadata editors, more focused on the generic treatment of metadata models than on the development of specific edition forms for a reduced set of metadata standards. This approach has been tested in the context of Spatial Data Infrastructures for the development of an Open Source tool called CatMDEdit. Additionally, the approach could be also applied to improve the efficiency of any metadata editor using a metamodeling development strategy.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.10.002•

Knowledge hiding from tree and graph databases

[...]

Osman Abul¹, Harun Gökçe¹•Institutions (1)

TOBB University of Economics and Technology¹

1 Feb 2012

TL;DR: This work addresses the knowledge hiding problem in the context of tree and graph databases with efficient frequent pattern mining algorithms and develops appropriate sanitization techniques to protect the privacy of the sensitive patterns.

...read moreread less

Abstract: Sensitive knowledge hiding is the problem of removing sensitive knowledge from databases before publishing. The problem is extensively studied in the context of relational databases to hide frequent itemsets and association rules. Recently, sequential pattern hiding from sequential (both sequence and spatio-temporal) databases has been investigated [1]. With the ever increasing versatile application demands, new forms of knowledge and databases should be addressed as well. In this work, we address the knowledge hiding problem in the context of tree and graph databases. For these databases efficient frequent pattern mining algorithms have already been developed in the literature. Since, some of the discovered patterns may be attributed as sensitive, we develop appropriate sanitization techniques to protect the privacy of the sensitive patterns.

...read moreread less

Journal Article•10.1016/J.DATAK.2011.09.008•

Editorial: Narrative-based taxonomy distillation for effective indexing of text collections

[...]

Mario Cataldi, K. Selçuk Candan¹, Maria Luisa Sapino•Institutions (1)

Arizona State University¹

1 Feb 2012

TL;DR: This paper proposes A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and provides user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.

...read moreread less

Abstract: Taxonomies embody formalized knowledge and define aggregations between concepts/categories in a given domain, facilitating the organization of the data and making the contents easily accessible to the users. Since taxonomies have significant roles in data annotation, search and navigation, they are often carefully engineered. However, especially in domains, such as news, where content dynamically evolves, they do not necessarily reflect the content knowledge. Thus, in this paper, we ask and answer, in the positive, the following question: ''is it possible to efficiently and effectively adapt a given taxonomy to a usage context defined by a corpus of documents?'' In particular, we recognize that the primary role of a taxonomy is to describe or narrate the natural relationships between concepts in a given document corpus. Therefore, a corpus-aware adaptation of a taxonomy should essentially distill the structure of the existing taxonomy by appropriately segmenting and, if needed, summarizing this narrative relative to the content of the corpus. Based on this key observation, we propose A Narrative Interpretation of Taxonomies for their Adaptation (ANITA) for re-structuring existing taxonomies to varying application contexts and we evaluate the proposed scheme using different text collections. Finally we provide user studies that show that the proposed algorithm is able to adapt the taxonomy in a new compact and understandable structure.

...read moreread less

Journal Article•10.1016/J.DATAK.2012.04.001•

Editorial: Occupation inference through detection and classification of biographical activities

[...]

Elena Filatova¹, John M. Prager²•Institutions (2)

Fordham University¹, IBM²

1 Jun 2012

TL;DR: This paper uses the obtained occupation-related activities as features for a multi-class SVM classifier to identify the occupation of a previously unseen individual, and shows that the activities automatically obtained from text can be used as features not only for a classification task but for a clustering task as well.

...read moreread less

Abstract: Dealing with biographical information (e.g., biography generation, answering biography-related questions, etc.) requires the identification of important activities in the life of the individual in question. While there are activities that can be used in any biography (e.g., person was born on a particular date, person lived in a particular location, etc.), many activities used in biographies tend to be occupation-related, others are person-specific. Hence, occupation gives important clues as to which activities should be included in the biography. In this paper, we present a methodology for identifying a three-level hierarchy of biographical activities: those activities that apply to the general population, those activities that are occupation-related, and those activities that are person-specific. We use the obtained occupation-related activities as features for a multi-class SVM classifier to identify the occupation of a previously unseen individual. We also show that the activities automatically obtained from text can be used as features not only for a classification task but for a clustering task as well. We show that, given the correct number of clusters, people belonging to the same occupation are clustered together. At the same time, clustering people into a smaller number of classes allows the grouping of practitioners of the occupations that share a considerable number of occupation-related activities. Thus, analyzing descriptions of people belonging to various occupations, we can build a hierarchy of occupations.

...read moreread less

Book Chapter•10.1007/978-3-642-34679-8_15•

Retrieving Information from Microblog Using Pattern Mining and Relevance Feedback

[...]

Cher Han Lau¹, Xiaohui Tao², Dian Tjondronegoro¹, Yuefeng Li¹•Institutions (2)

Queensland University of Technology¹, University of Southern Queensland²

21 Nov 2012

TL;DR: This paper presents an innovative framework to address the issue of performing IR in microblog, and shows that the proposed approach significantly outperforms term-based methods Okapi BM25, TF-IDF and pattern based methods, using precision, recall and F measures.

...read moreread less

Abstract: Retrieving information from Twitter is always challenging due to its large volume, inconsistent writing and noise. Most existing information retrieval (IR) and text mining methods focus on term-based approach, but suffers from the problems of terms variation such as polysemy and synonymy. This problem deteriorates when such methods are applied on Twitter due to the length limit. Over the years, people have held the hypothesis that pattern-based methods should perform better than term-based methods as it provides more context, but limited studies have been conducted to support such hypothesis especially in Twitter. This paper presents an innovative framework to address the issue of performing IR in microblog. The proposed framework discover patterns in tweets as higher level feature to assign weight for low-level features (i.e. terms) based on their distributions in higher level features. We present the experiment results based on TREC11 microblog dataset and shows that our proposed approach significantly outperforms term-based methods Okapi BM25, TF-IDF and pattern based methods, using precision, recall and F measures.

...read moreread less