Top 41 papers presented at Data and Knowledge Engineering in 2015

Showing papers presented at "Data and Knowledge Engineering in 2015"

Journal Article•10.1016/J.DATAK.2015.07.010•

The Baquara2 knowledge-based framework for semantic enrichment and analysis of movement data

[...]

Renato Fileto¹, Cleto May¹, Chiara Renso², Nikos Pelekis³, Douglas Klein¹, Yannis Theodoridis³ - Show less +2 more•Institutions (3)

Universidade Federal de Santa Catarina¹, Istituto di Scienza e Tecnologie dell'Informazione², University of Piraeus³

1 Jul 2015

TL;DR: The Baquara2 framework provides an ontological model for structuring and abstracting movement data in a multilevel hierarchy of progressively detailed movement segments that generalize concepts such as trajectories, stops, and moves and enables queries for movement analyses based on application and domain specific knowledge.

...read moreread less

Abstract: The analysis of movements frequently requires more than just spatio-temporal data. Thus, despite recent progresses in trajectory handling, there is still a gap between movement data and formal semantics. This gap hinders movement analyses benefiting from available knowledge, with well-defined and widely agreed semantics. This article describes the Baquara2 framework to help narrow this gap by exploiting knowledge bases to semantically enrich and analyze movement data. It provides an ontological model for structuring and abstracting movement data in a multilevel hierarchy of progressively detailed movement segments that generalize concepts such as trajectories, stops, and moves. Baquara2 also includes a general customizable process to annotate movement data with concepts and objects described in ontologies and Linked Open Data (LOD) collections. The resulting semantic annotations enable queries for movement analyses based on application and domain specific knowledge. The proposed framework has been used in experiments to semantically enrich movement data collected from social media with geo-referenced LOD. The obtained results enable powerful queries that illustrate Baquara2 capabilities.

...read moreread less

73 citations

Journal Article•10.1016/J.DATAK.2015.07.007•

Modelling and reasoning about security requirements in socio-technical systems

[...]

Elda Paja¹, Fabiano Dalpiaz², Paolo Giorgini¹•Institutions (2)

University of Trento¹, Utrecht University²

1 Jul 2015

TL;DR: This paper proposes the STS approach for modelling and reasoning about security requirements, and applies it to a case study about e-Government, and reports on promising scalability results of the implementation.

...read moreread less

Abstract: Modern software systems operate within the context of larger socio-technical systems, wherein they interact-by exchanging data and outsourcing tasks-with other technical components, humans, and organisations. When interacting, these components (actors) operate autonomously; as such, they may disclose confidential information without being authorised, wreck the integrity of private data, rely on untrusted third parties, etc. Thus, the design of a secure software system shall begin with a thorough analysis of its socio-technical context, thereby considering not only technical attacks, but also social and organisational ones.In this paper, we propose the STS approach for modelling and reasoning about security requirements. In STS, security requirements are specified, via the STS-ml requirements modelling language, as contracts that constrain the interactions among the actors in the socio-technical system. The requirements models of STS-ml have a formal semantics which enables automated reasoning for detecting possible conflicts among security requirements as well as conflicts between security requirements and actors' business policies. We apply STS to a case study about e-Government, and report on promising scalability results of our implementation.

...read moreread less

69 citations

Journal Article•10.1016/J.DATAK.2015.06.009•

An incremental approach to attribute reduction from dynamic incomplete decision systems in rough set theory

[...]

Wenhao Shu¹, Wenbin Qian•Institutions (1)

East China Jiaotong University¹

1 Nov 2015

TL;DR: Compared with other attribute reduction algorithms, the proposed algorithms can effectively reduce the time required for reduct computations without losing the classification performance.

...read moreread less

Abstract: Attribute reduction is an important preprocessing step in data mining and knowledge discovery. The effective computation of an attribute reduct has a direct bearing on the efficiency of knowledge acquisition and various related tasks. In real-world applications, some attribute values for an object may be incomplete and an object set may vary dynamically in the knowledge representation systems, also called decision systems in rough set theory. There are relatively few studies on attribute reduction in such systems. This paper mainly focuses on this issue. For the immigration and emigration of a single object in the incomplete decision system, an incremental attribute reduction algorithm is developed to compute a new attribute reduct, rather than to obtain the dynamic system as a new one that has to be computed from scratch. In particular, for the immigration and emigration of multiple objects in the system, another incremental reduction algorithm guarantees that a new attribute reduct can be computed on the fly, which avoids some re-computations. Compared with other attribute reduction algorithms, the proposed algorithms can effectively reduce the time required for reduct computations without losing the classification performance. Experiments on different real-life data sets are conducted to test and demonstrate the efficiency and effectiveness of the proposed algorithms.

...read moreread less

61 citations

Journal Article•10.1016/J.DATAK.2015.02.001•

Efficient mining of platoon patterns in trajectory databases

[...]

Yuxuan Li¹, James Bailey¹, Lars Kulik¹•Institutions (1)

University of Melbourne¹

1 Nov 2015

TL;DR: This work proposes a novel algorithm to efficiently retrieve platoon patterns in large trajectory databases, using several pruning techniques, and demonstrates that the algorithm is able to achieve several orders of magnitude improvement in running time, compared to an existing method for retrieving moving object clusters.

...read moreread less

Abstract: The widespread use of localization technologies produces increasing quantities of trajectory data. An important task in the analysis of trajectory data is the discovery of moving object clusters, i.e., moving objects that travel together for a period of time. Algorithms for the discovery of moving object clusters operate by applying constraints on the consecutiveness of timestamps. However, existing approaches either use a very strict timestamp constraint, which may result in the loss of interesting patterns, or a very relaxed timestamp constraint, which risks discovering noisy patterns. To address this challenge, we introduce a new type of moving object pattern called the platoon pattern.We propose a novel algorithm to efficiently retrieve platoon patterns in large trajectory databases, using several pruning techniques. Our experiments on both real data and synthetic data evaluate the effectiveness and efficiency of our approach and demonstrate that our algorithm is able to achieve several orders of magnitude improvement in running time, compared to an existing method for retrieving moving object clusters.

...read moreread less

60 citations

Journal Article•10.1016/J.DATAK.2014.11.004•

Design of computationally efficient density-based clustering algorithms

[...]

Satyasai Jagannath Nanda¹, Satyasai Jagannath Nanda², Ganapati Panda¹, Ganapati Panda²•Institutions (2)

Malaviya National Institute of Technology, Jaipur¹, Indian Institute of Technology Bhubaneswar²

1 Jan 2015

TL;DR: A new strategy to reduce the computational complexity associated with the DBSCAN is proposed by efficiently implementing new merging criteria at the initial stage of evolution of clusters by considering correlation coefficient as similarity measure.

...read moreread less

Abstract: The basic DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm uses minimum number of input parameters, very effective to cluster large spatial databases but involves more computational complexity. The present paper proposes a new strategy to reduce the computational complexity associated with the DBSCAN by efficiently implementing new merging criteria at the initial stage of evolution of clusters. Further new density based clustering (DBC) algorithms are proposed considering correlation coefficient as similarity measure. These algorithms though computationally not efficient, found to be effective when there is high similarity between patterns of dataset. The computations associated with DBC based on correlation algorithms are reduced with new cluster merging criteria. Test on several synthetic and real datasets demonstrates that these computationally efficient algorithms are comparable in accuracy to the traditional one. An interesting application of the proposed algorithm has been demonstrated to identify the regional hazard regions present in the seismic catalog of Japan.

...read moreread less

43 citations

Journal Article•10.1016/J.DATAK.2015.06.004•

Ontological anti-patterns

[...]

Tiago Prince Sales¹, Giancarlo Guizzardi¹•Institutions (1)

Universidade Federal do Espírito Santo¹

1 Sep 2015

TL;DR: A computational tool is presented that is able to automatically identify these anti-patterns in user's models, guide users in assessing their consequences, and generate corrections to these models by the automatic inclusion of OCL constraints implementing the proposed refactoring plans.

...read moreread less

Abstract: The construction of large-scale reference conceptual models is a complex engineering activity. To develop high-quality models, a modeler must have the support of expressive engineering tools such as theoretically well-founded modeling languages and methodologies, patterns and anti-patterns and automated supporting environments. This paper proposes a set of Ontological Anti-Patterns for Ontology-Driven Conceptual Modeling. These anti-patterns capture error-prone modeling decisions that can result in the creation of models that fail to exclude unintended model instances (representing unintended state of affairs) or forbid intended ones (representing intended states of affairs). The anti-patterns presented here have been empirically elicited through an approach of conceptual models validation via visual simulation. The paper also presents a series of refactoring plans for rectifying the models in which these anti-patterns occur. In addition, we present here a computational tool that is able to: automatically identify these anti-patterns in user's models, guide users in assessing their consequences, and generate corrections to these models by the automatic inclusion of OCL constraints implementing the proposed refactoring plans. Finally, the paper also presents an empirical study for assessing the harmfulness of each of the uncovered anti-patterns (i.e., the likelihood that its occurrence in a model entails unintended consequences) as well as the effectiveness of the proposed refactoring plans.

...read moreread less

40 citations

Journal Article•10.1016/J.DATAK.2015.06.007•

Adoption of OSS components

[...]

Lidia López¹, Dolors Costal¹, Claudia P. Ayala¹, Xavier Franch¹, Maria Carmela Annosi², Ruediger Glott, Kirsten Haaland - Show less +3 more•Institutions (2)

Polytechnic University of Catalonia¹, Ericsson²

1 Sep 2015

TL;DR: This paper proposes to model OSS adoption strategies using a goal-oriented notation, in which different actors state their objectives and dependencies on each other, and introduces the notion of model coverage, which allows to measure the degree of concordance among every strategy with the model of the organization by comparing the respective models.

...read moreread less

Abstract: Open Source Software (OSS) has become a strategic asset for a number of reasons, such as short time-to-market software delivery, reduced development and maintenance costs, and its customization capabilities. Therefore, organizations are increasingly becoming OSS adopters, either as a result of a strategic decision or because it is almost unavoidable nowadays, given the fact that most commercial software also relies at some extent in OSS infrastructure. The way in which organizations adopt OSS affects and shapes their businesses. Therefore, knowing the impact of different OSS adoption strategies in the context of an organization may help improving the processes undertaken inside this organization and ultimately pave the road to strategic moves. In this paper, we propose to model OSS adoption strategies using a goal-oriented notation, in which different actors state their objectives and dependencies on each other. These models describe the consequences of adopting one such strategy or another: which are the strategic and operational goals that are supported, which are the resources that emerge, etc. The models rely on an OSS ontology, built upon a systematic literature review, which comprises the activities and resources that characterize these strategies. Different OSS adoption strategy models arrange these ontology elements in diverse ways. In order to assess which is the OSS adoption strategy that better fits the organization needs, the notion of model coverage is introduced, which allows to measure the degree of concordance among every strategy with the model of the organization by comparing the respective models. The approach is illustrated with an example of application in a big telecommunications company.

...read moreread less

27 citations

Journal Article•10.1016/J.DATAK.2015.06.012•

Hiding outliers into crowd

[...]

Hui Wang¹, Ruilin Liu¹•Institutions (1)

Stevens Institute of Technology¹

1 Nov 2015

TL;DR: This paper defines the distinguishability-based attack by which the adversary can identify outliers and reveal their private information from an anonymized dataset, and designs efficient algorithms to anonymize the dataset to achieve plain ?

...read moreread less

Abstract: In recent years, many organizations publish their data in non-aggregated format for research purpose. However, publishing non-aggregated data raises serious concerns in data privacy. One of the concerns is that when outliers exist in the dataset, they are easier to be distinguished from the crowd and their privacy is prone to be compromised. In this paper, we study the problem of privacy-preserving publishing datasets that contain outliers. We define the distinguishability-based attack by which the adversary can identify outliers and reveal their private information from an anonymized dataset. We show that the existing syntactic privacy models (e.g., k-anonymity and ?-diversity) cannot defend against the distinguishability-based attack. We define the plain ?-diversity to provide privacy guarantee to outliers against the distinguishability-based attack, and design efficient algorithms to anonymize the dataset to achieve plain ?-diversity with low information loss. We extend our anonymization approach to deal with continuous release of a series of datasets that contain outliers. Our experiments demonstrate the efficiency and effectiveness of our approaches.

...read moreread less

23 citations

Journal Article•10.1016/J.DATAK.2015.06.008•

A hybrid possibilistic approach for Arabic full morphological disambiguation

[...]

Ibrahim Bounhas¹, Raja Ayed², Bilel Elayeb², Bilel Elayeb³, Narjès Bellamine Ben Saoud², Narjès Bellamine Ben Saoud⁴ - Show less +2 more•Institutions (4)

Carthage College¹, Manouba University², Emirates College of Technology³, Tunis El Manar University⁴

1 Nov 2015

TL;DR: This paper investigates new approaches to disambiguate the morphological features of non-vocalized Arabic texts, combining statistical classification and linguistic rules, and presents an approach dealing with unknown (Out-of-Vocabulary) words.

...read moreread less

Abstract: Morphological ambiguity is an important phenomenon affecting several tasks in Arabic text analysis, indexing and mining. Nevertheless, it has not been well studied in related works. We investigate, in this paper, new approaches to disambiguate the morphological features of non-vocalized Arabic texts, combining statistical classification and linguistic rules. Indeed, we perform unsupervised training from unlabelled vocalized Arabic corpora. Thus, the training and testing sets contain imperfect instances (i.e. having ambiguous attributes and/or classes). To handle imperfect data, we compare two approaches: i) a possibilistic approach allowing to handle imperfection in a direct manner; and, ii) a data transformation-based approach permitting to convert an imperfect dataset to a perfect one, thus allowing to exploit classical classifiers. We also present an approach dealing with unknown (Out-of-Vocabulary) words. The experiments focus mainly on classical texts, which were not sufficiently studied in related works. We show that the possibilistic approach performs better than the transformation-based one. Besides, we report encouraging results as far as i) the role of linguistic rules in enhancing the disambiguation rates; and, ii) the accuracy of our approach for full morphological disambiguation of unknown words.

...read moreread less

21 citations

Journal Article•10.1016/J.DATAK.2015.06.002•

Cardinality constraints on qualitatively uncertain data

[...]

Neil Hall¹, Henning Koehler², Sebastian Link¹, Henri Prade³, Xiaofang Zhou⁴ - Show less +1 more•Institutions (4)

University of Auckland¹, Massey University², University of Toulouse³, Soochow University (Suzhou)⁴

1 Sep 2015

TL;DR: This work describes the associated implication problem axiomatically and algorithmically in linear input time, and shows how to visualize any given set of cardinality constraints in the form of an Armstrong sketch.

...read moreread less

Abstract: Modern applications require advanced techniques and tools to process large volumes of uncertain data. For that purpose we introduce cardinality constraints as a principled tool to control the occurrences of uncertain data. Uncertainty is modeled qualitatively by assigning to each object a degree of possibility by which the object occurs in an uncertain instance. Cardinality constraints are assigned a degree of certainty that stipulates on which objects they hold. Our framework empowers users to model uncertainty in an intuitive way, without the requirement to put a precise value on it. Our class of cardinality constraints enjoys a natural possible world semantics, which is exploited to establish several tools to reason about them. We characterize the associated implication problem axiomatically and algorithmically in linear input time. Furthermore, we show how to visualize any given set of our cardinality constraints in the form of an Armstrong sketch. Even though the problem of finding an Armstrong sketch is precisely exponential, our algorithm computes a sketch with conservative use of time and space. Data engineers may therefore compute Armstrong sketches that they can jointly inspect with domain experts in order to consolidate the set of cardinality constraints meaningful for a given application domain. Cardinality constraints on qualitatively uncertain data are introducedThe constraints help control the number of occurrences of uncertain dataThis ability has applications in integrity enforcement and query processingAxiomatic and algorithmic solutions are given for their implication problemArmstrong sketches finitely represent possibly infinite Armstrong samples

...read moreread less

21 citations

Journal Article•10.1016/J.DATAK.2015.07.008•

Improving business process intelligence by observing object state transitions

[...]

Nico Herzberg¹, Andreas Meyer¹, Mathias Weske¹•Institutions (1)

Hasso Plattner Institute¹

1 Jul 2015

TL;DR: This paper uses object state transitions as additional monitoring information, so-called object state transition events, based on the enablement and termination of activities and provides the basis for process monitoring and analysis in terms of a large event log.

...read moreread less

Abstract: During the execution of business processes several events happen that are recorded in the company's information systems. These events deliver insights into process executions so that process monitoring and analysis can be performed resulting, for instance, in prediction of upcoming process steps or the analysis of the run time of single steps. While event capturing is trivial when a process engine with integrated logging capabilities is used, manual process execution environments do not provide automatic logging of events, so that typically external devices, like bar code scanners, have to be used. As experience shows, these manual steps are error-prone and induce additional work. Therefore, we use object state transitions as additional monitoring information, so-called object state transition events. Based on these object state transition events, we reason about the enablement and termination of activities and provide the basis for process monitoring and analysis in terms of a large event log. In this paper, we present the concept to utilize information from these object state transition events for capturing process progress. Furthermore, we discuss a methodology to create the required design time artifacts that then are used for monitoring at run time. In a proof-of-concept implementation, we show how the design time and run time side work and prove applicability of the introduced concept of object state transition events.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.05.005•

A novel methodology for retrieving infographics utilizing structure and message content

[...]

Zhuo Li¹, Sandra Carberry¹, Hui Fang¹, Kathleen F. McCoy¹, Kelly Peterson¹, Matthew Stagitis¹ - Show less +2 more•Institutions (1)

University of Delaware¹

1 Nov 2015

TL;DR: A novel methodology for retrieving infographics from a digital library that takes into account a graphic's structural and message content is presented, and it significantly outperforms a baseline method that treats queries and graphics as bags of words.

...read moreread less

Abstract: Information graphics (infographics) in popular media are highly structured knowledge representations that are generally designed to convey an intended message This paper presents a novel methodology for retrieving infographics from a digital library that takes into account a graphic's structural and message content The retrieval methodology can be summarized thus: 1) hypothesize requisite structural and message content from a natural language query, 2) measure the relevance of each candidate infographic to the requisite structural and message content hypothesized from the user query, and 3) integrate these relevance measurements via a linear combination model in order to produce a ranked list of infographics in response to the user query The methodology has been implemented and evaluated, and it significantly outperforms a baseline method that treats queries and graphics as bags of words

...read moreread less

Journal Article•10.1016/J.DATAK.2015.07.003•

Ontology-based mappings

[...]

Giansalvatore Mecca, Guillem Rull¹, Donatello Santoro, Ernest Teniente²•Institutions (2)

University of Barcelona¹, Polytechnic University of Catalonia²

1 Jul 2015

TL;DR: A translation algorithm is developed that automatically rewrites a mapping from the source schema to the target ontology into an equivalent mapping fromThe source to thetarget databases.

...read moreread less

Abstract: Data translation consists of the task of moving data from a source database to a target database. This task is usually performed by developing mappings, i.e. executable transformations from the source to the target schema. However, a richer description of the target database semantics may be available in the form of an ontology. This is typically defined as a set of views over the base tables that provides a unified conceptual view of the underlying data. We investigate how the mapping process changes when such a rich conceptualization of the target database is available. We develop a translation algorithm that automatically rewrites a mapping from the source schema to the target ontology into an equivalent mapping from the source to the target databases. Then, we show how to handle this problem when an ontology is available also for the source. Differently from previous approaches, the language we use in view definitions has the full power of non-recursive Datalog with negation. In the paper, we study the implications of adopting such an expressive language. Experiments are conducted to illustrate the trade-off between expressibility of the view language and efficiency of the chase engine used to perform the data exchange.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.06.006•

Computing repairs for constraint violations in UML/OCL conceptual schemas

[...]

Xavier Oriol¹, Ernest Teniente¹, Albert Tort•Institutions (1)

Polytechnic University of Catalonia¹

1 Sep 2015

TL;DR: This work follows here an alternative approach aimed at automatically computing the repairs of an update, i.e., the minimum additional changes that, when applied together with the requested update, bring the information base to a new state where all constraints are satisfied.

...read moreread less

Abstract: Updating the contents of an information base may violate some of the constraints defined over the schema. The classical way to deal with this problem has been to reject the requested update when its application would lead to some constraint violation. We follow here an alternative approach aimed at automatically computing the repairs of an update, i.e., the minimum additional changes that, when applied together with the requested update, bring the information base to a new state where all constraints are satisfied. Our approach is independent of the language used to define the schema and the constraints, since it is based on a logic formalization of both, although we apply it to UML and OCL because they are widely used in the conceptual modeling community.Our method can be used for maintaining the consistency of an information base after the application of some update, and also for dealing with the problem of fixing up non-executable operations. The fragment of OCL that we use to define the constraints has the same expressiveness as relational algebra and we also identify a subset of it which provides some nice properties in the repair-computation process. Experiments are conducted to analyze the efficiency of our approach.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.04.003•

Towards accurate predictors of word quality for Machine Translation

[...]

Ngoc Quang Luong, Laurent Besacier, Benjamin Lecouteux

1 Mar 2015

TL;DR: A method that combines multiple "weak" classifiers to constitute a strong "composite" classifier by taking advantage of their complementarity allows us to achieve a significant improvement in terms of F-score, for both fr-en and en-es systems.

...read moreread less

Abstract: This paper proposes some ideas to build effective estimators, which predict the quality of words in a Machine Translation (MT) output. We propose a number of novel features of various types (system-based, lexical, syntactic and semantic) and then integrate them into the conventional (previously used) feature set, for our baseline classifier training. The classifiers are built over two different bilingual corpora: French-English (fr-en) and English-Spanish (en-es). After the experiments with all features, we deploy a "Feature Selection" strategy to filter the best performing ones. Then, a method that combines multiple "weak" classifiers to constitute a strong "composite" classifier by taking advantage of their complementarity allows us to achieve a significant improvement in terms of F-score, for both fr-en and en-es systems. Finally, we exploit word confidence scores for improving the quality estimation system at sentence level.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.05.004•

Extraction and clustering of arguing expressions in contentious text

[...]

Amine Trabelsi¹, Osmar R. Zaïane¹•Institutions (1)

University of Alberta¹

1 Nov 2015

TL;DR: A Joint Topic Viewpoint (JTV) probabilistic model to analyze the underlying divergent arguing expressions that may be present in a collection of contentious documents and empirically demonstrates a better clustering of arguing expressions over state-of-the art and baseline methods.

...read moreread less

Abstract: This work proposes an unsupervised method intended to enhance the quality of opinion mining in contentious text. It presents a Joint Topic Viewpoint (JTV) probabilistic model to analyze the underlying divergent arguing expressions that may be present in a collection of contentious documents. The conceived JTV has the potential of automatically carrying the tasks of extracting associated terms denoting an arguing expression, according to the hidden topics it discusses and the embedded viewpoint it voices. Furthermore, JTV's structure enables the unsupervised grouping of obtained arguing expressions according to their viewpoints, using a proposed constrained clustering algorithm which is an adapted version of the constrained k-means clustering (COP-KMEANS). Experiments are conducted on three types of contentious documents (polls, online debates and editorials), through six different contentious data sets. Quantitative evaluations of the topic modeling output, as well as the constrained clustering results show the effectiveness of the proposed method to fit the data and generate distinctive patterns of arguing expressions. Moreover, it empirically demonstrates a better clustering of arguing expressions over state-of-the art and baseline methods. The qualitative analysis highlights the coherence of clustered arguing expressions of the same viewpoint and the divergence of opposing ones.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.09.003•

Hiding multiple solutions in a hard 3-SAT formula

[...]

Ran Liu¹, Wenjian Luo¹, Lihua Yue¹•Institutions (1)

University of Science and Technology of China¹

1 Nov 2015

TL;DR: The objective of this paper is to propose algorithms which could cancel the attraction to the multiple predefined solutions simultaneously, and the core element of these proposed algorithms is misguiding the SAT solvers with local search strategy to the reverse direction of the centre solution of the multiplepredefined solutions.

...read moreread less

Abstract: Hiding solutions in 3-SAT formulas can be used in privacy protection and data security. Although the typical q -hidden algorithm could cancel the attraction to the unique predefined solution, and generate deceptive 3-SAT formulas with unique predefined solution, few works have mentioned that with multiple predefined solutions. Therefore, the objective of this paper is to propose algorithms which could cancel the attraction to the multiple predefined solutions simultaneously. The core element of these proposed algorithms is misguiding the SAT solvers with local search strategy to the reverse direction of the centre solution of the multiple predefined solutions, so that the attraction to the multiple predefined solutions can be cancelled simultaneously. Experimental results verify the behaviour of the two classical SAT solvers: the SAT solvers with local search strategy (such as WalkSAT) and that with DPLL strategy (such as zChaff). And a real-world application is introduced based on the proposed algorithm.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.04.006•

Fast updated frequent-itemset lattice for transaction deletion

[...]

Bay Vo¹, Tuong Le², Tzung-Pei Hong³, Bac Le•Institutions (3)

Ho Chi Minh City University of Technology¹, Ton Duc Thang University², National Sun Yat-sen University³

1 Mar 2015

TL;DR: This paper proposes an approach for maintaining FILs for transaction deletion without rescanning the original database if the number of eliminated transactions is smaller than the threshold determined based on the pre-large and diffset concepts.

...read moreread less

Abstract: The frequent-itemset lattice (FIL) is an effective structure for mining association rules. However, building an FIL for a modified database requires a lot of time and memory. Currently, there is no approach for updating an FIL with deleted transactions. Therefore, this paper proposes an approach for maintaining FILs for transaction deletion without rescanning the original database if the number of eliminated transactions is smaller than the threshold determined based on the pre-large and diffset concepts. A diffset-based approach is first used for fast building an FIL. Then, two proposed approaches (tidset-based and diffset-based) are used for updating the FIL with transaction deletion. The experiment was conducted to show that the diffset-based approach outperforms the tidset-based and the batch-mode approaches.

...read moreread less

Journal Article•10.1016/J.DATAK.2014.11.003•

Stepwise structural verification of cyclic workflow models with acyclic decomposition and reduction of loops

[...]

Yongsun Choi¹, Pauline Kongsuwan¹, Cheol Min Joo², J. Leon Zhao³•Institutions (3)

Inje University¹, Dongseo University², City University of Hong Kong³

1 Jan 2015

TL;DR: A novel structural verification approach for cyclic workflow models by means of acyclic decomposition and reduction of loops is introduced and its execution result shows that, while providing diagnostic information, the proposed approach can handle workflow models with arbitrary cycles effectively.

...read moreread less

Abstract: Existence of cycles (or loops) is one of the main sources that make the analysis of workflow models difficult. Several approaches of structural verification exist in the literature, but how to verify cyclic workflow models efficiently in a comprehensible form remains an open research question. Thus, a novel structural verification approach for cyclic workflow models by means of acyclic decomposition and reduction of loops is introduced in this paper with the following contributions. First, acyclic decomposition of natural loops, further enhanced by reduction of nested loops, enables existing verification techniques, normally dealing with acyclic models, to handle workflow models with natural loops. Second, instantiation of an irreducible loop into natural loops, altogether with reduction of concurrent loop entries, enables the proposed approach to handle workflow models with irreducible loops. Last, diagnostic information, provided by the proposed approach, helps stakeholders correct and improve their workflow models. Two examples are provided to show that the proposed approach is systematic and practical. In addition, a prototype of the proposed approach is developed. Its execution result shows that, while providing diagnostic information, the proposed approach can handle workflow models with arbitrary cycles effectively.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.04.004•

A user-centered approach for integrating social data into groups of interest

[...]

Xuan-Truong Vu¹, Marie-Hélène Abel¹, Pierre Morizet-Mahoudeaux¹•Institutions (1)

University of Technology of Compiègne¹

1 Mar 2015

TL;DR: A new user-centered approach for integrating social data into groups of interest that makes it possible for a group to tap into its members' social data scattered over different social network sites and extract from these data the information relevant to the group's topic of interests.

...read moreread less

Abstract: Social network sites with large-scale public networks like Facebook, Twitter or LinkedIn have become a very important part of our daily life. Users are increasingly connected to these services for publishing and sharing information and contents with others. Social network sites have therefore become a powerful source of contents of interest, part of which may fall into the scope of interests of a given group. So far, no efficient solution has been proposed for a group of interest to tap into social data, especially when they are protected by and scattered across different social network sites. We have therefore proposed a user-centered approach for integrating social data into groups of interests. This approach makes it possible to aggregate social data of the group's members and extract from these data the information relevant to the group's topic of interests. Moreover, it follows a user-centered design allowing each member to personalize his/her sharing settings and interests within their respective groups. We describe in this paper the conceptual and technical components of the proposed approach. To illustrate further the approach, a web-based prototype is also presented. A preliminary test using this prototype was carried out and showed encouraging results. The paper describes a new user-centered approach for integrating social data into groups of interest.The approach makes it possible for a group to tap into its members' social data scattered over different social network sites.The contents relevant to the group's collectively defined topics of interest are automatically extracted from these data.Each member is free to personalize his/her collaborative experience within the group.The paper also presents a working Web-based prototype supporting Facebook, Twitter and LinkedIn.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.09.002•

Temporal expression extraction with extensive feature type selection and a posteriori label adjustment

[...]

Michele Filannino¹, Goran Nenadic•Institutions (1)

University of Manchester¹

1 Nov 2015

TL;DR: It is shown that the use of WordNet-based features in the identification task negatively affects the overall performance, and that there is no statistically significant difference in the results based on gazetteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features.

...read moreread less

Abstract: The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. It allows to filter information and infer temporal flows of events.This paper presents ManTIME, a general domain temporal expression identification and normalisation system, and systematically explores the impact of different features and training corpora on the performance. The identification phase combines the use of conditional random fields along with a post-processing pipeline, whereas the normalisation phase is carried out using NorMA, an open-source rule-based temporal normaliser.We investigate the performance variation with respect to different feature types. Specifically, we show that the use of WordNet-based features in the identification task negatively affects the overall performance, and that there is no statistically significant difference in the results based on gazetteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features. We also show that the use of silver data (alone or in addition to the human-annotated ones) does not improve the performance.We evaluate six combinations of training data and post-processing pipeline with respect to the TempEval-3 benchmark test set. The best run achieved 0.95 (precision), 0.85 (recall) and 0.90 (Fβ=1) in the identification phase. Normalisation accuracies are 0.86 (for type attribute) and 0.77 (for value attribute).The proposed approach ranked 3rd in the TempEval-3 challenge (task A) as the best performing machine learning-based system among 21 participants.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.07.006•

Empirical evidence for the usefulness of Armstrong tables in the acquisition of semantically meaningful SQL constraints

[...]

Van Lam Le¹, Sebastian Link², Flavio Ferrarotti•Institutions (2)

Victoria University of Wellington¹, University of Auckland²

1 Jul 2015

TL;DR: Using new empirical measures, extensive experiments confirm that users of Armstrong tables are likely to recognize domain semantics they would overlook otherwise and complement existing schema design methodologies in producing quality schemata that process data efficiently.

...read moreread less

Abstract: SQL schema designs result from methodologies such as UML, Entity-Relationship models, description logics, or relational normalization. Independently of the methodology, sample data is promoted by academia and industry to consolidate the schema designs produced. SQL constraints are an abstract standard-compliant encoding of the designers' perception about the semantics of an application domain. Armstrong tables can visualize SQL constraints concisely, in the sense that they satisfy all constraints perceived meaningful and violate all constraints perceived meaningless. Using new empirical measures we investigate how Armstrong tables help design teams recognize domain semantics. Extensive experiments confirm that users of Armstrong tables are likely to recognize domain semantics they would overlook otherwise. Armstrong tables therefore complement existing schema design methodologies in producing quality schemata that process data efficiently.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.06.005•

A conceptual modeling framework for network analytics

[...]

Qing Wang¹•Institutions (1)

Australian National University¹

1 Sep 2015

TL;DR: This paper discusses how the semantics of network analysis queries can be modeled at the conceptual level, and explores three possible application areas of using this analytical framework for network analysis applications: governing semantic integrity, improving analysis efficiency, and supporting network dynamics.

...read moreread less

Abstract: In this paper we propose a conceptual modeling framework for network analysis applications. Within this framework, a data model called the Network Analytics ER model (NAER) is developed, which enables us to manage and analyze network data in a unified way. In particular, not only data requirements but also query requirements can be captured by the conceptual description of network analysis applications. This unified view provides us a flexible platform to build a number of topology schemas upon the underlying core schema for supporting network analysis queries. We also discuss how the semantics of network analysis queries can be modeled at the conceptual level, and explore three possible application areas of using our analytical framework for network analysis applications: (1) governing semantic integrity, (2) improving analysis efficiency, and (3) supporting network dynamics. We believe that conceptual modeling can play an important role in managing and analyzing network data, and contribute to the development of network analytics.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.06.011•

An approach to website schema.org design

[...]

Albert Tort¹, Antoni Olivé¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Sep 2015

TL;DR: This paper describes an approach to the design of a website schema.org by using a human-computer task-oriented dialogue, whose purpose is to arrive at that design and proposes a dialogue generator that is domain independent but that can be adapted to specific domains.

...read moreread less

Abstract: Schema.org offers to web developers the opportunity to enrich a website's content with microdata and schema.org. For large websites, implementing microdata can take a lot of time. In general, it is necessary to perform two main activities, for which we lack methods and tools. The first consists in designing what we call the website schema.org, which is the fragment of schema.org that is relevant to the website. The second consists in adding the corresponding microdata tags to the web pages. In this paper, we describe an approach to the design of a website schema.org. The approach consists in using a human-computer task-oriented dialogue, whose purpose is to arrive at that design. We describe a dialogue generator that is domain independent but that can be adapted to specific domains. We propose a set of six evaluation criteria that we use to evaluate our approach and that could be used in future approaches.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.06.003•

Exploiting semantics for XML keyword search

[...]

Thuy Ngoc Le¹, Zhifeng Bao², Tok Wang Ling¹•Institutions (2)

National University of Singapore¹, RMIT University²

1 Sep 2015

TL;DR: This paper proposes a new semantics, called CR (Common Relative) for XML keyword search, which can return answers independent from schema designs and discovers properties of common relative and proposes an efficient algorithms.

...read moreread less

Abstract: XML keyword search has attracted a lot of interests with typical search based on lowest common ancestor (LCA). However, in this paper, we show several problems of the LCA-based approaches, including meaningless answers, incomplete answers, duplicated answers, missing answers, and schema-dependent answers. To handle these problems, we exploit the semantics of object, object identifier, relationship, and attribute (referred to as the ORA-semantics). Based on the ORA-semantics, we introduce new ways of labeling and matching. More importantly, we propose a new semantics, called CR (Common Relative) for XML keyword search, which can return answers independent from schema designs. To find answers based on the CR semantics, we discover properties of common relative and propose an efficient algorithms. Experimental results show the seriousness of the problems of the LCA-based approaches. They also show that the CR semantics possesses the properties of completeness, soundness and independence while the response time of our approach is faster than the LCA-based approaches thanks to our techniques.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.04.002•

Discovery of pathways in protein-protein interaction networks using a genetic algorithm

[...]

Hoai Anh Nguyen¹, Cong Long Vu¹, Minh Phuong Tu², Thu Lam Bui¹•Institutions (2)

Le Quy Don Technical University¹, Posts and Telecommunications Institute of Technology²

1 Mar 2015

TL;DR: A method for orienting protein-protein interaction networks (PPIs) and discovering pathways and a genetic algorithm is designed to find the solution for the problem taking into account the problem's characteristics is proposed.

...read moreread less

Abstract: Biological pathways have played an important role in understanding cell activities and evolution. In order to find these pathways, it is necessary to orient protein-protein interactions, which are usually given in forms of undirected networks or graphs. Previous findings indicate that orienting protein interactions can improve the process of pathway discovery. However, assigning orientation for protein interactions is a combinatorial optimization problem which has been proved to be NP-hard, making it critical to develop efficient algorithms.This paper proposes a method for orienting protein-protein interaction networks (PPIs) and discovering pathways. For our proposal, the mathematical model of the problem is given and then a genetic algorithm is designed to find the solution for the problem taking into account the problem's characteristics. We conducted multiple runs on the data of yeast PPI networks to test the best option for the problem. The obtained results were compared with a well-known algorithm (ROLS), which was shown to be the best in dealing with this problem, in terms of the run time, fitness function values, and especially the ratio of matching gold standard pathways. The results show the good performance of our approach in addressing this problem.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.06.010•

Approximate and selective reasoning on knowledge graphs

[...]

André Freitas¹, João Carlos Pereira da Silva², Edward Curry³, Paul Buitelaar⁴•Institutions (4)

University of Passau¹, Federal University of Rio de Janeiro², National University of Ireland³, University of South Africa⁴

1 Nov 2015

TL;DR: A selective graph navigation mechanism based on a distributional relational semantic model which can be applied to querying and reasoning over heterogeneous knowledge bases (KBs) and is evaluated using ConceptNet as a commonsense KB, and achieves high selectivity, highSelectivity scalability and high accuracy in the selection of meaningful navigational paths.

...read moreread less

Abstract: Tasks such as question answering and semantic search are dependent on the ability of querying and reasoning over large-scale commonsense knowledge bases (KBs). However, dealing with commonsense data demands coping with problems such as the increase in schema complexity, semantic inconsistency, incompleteness and scalability. This paper proposes a selective graph navigation mechanism based on a distributional relational semantic model which can be applied to querying and reasoning over heterogeneous knowledge bases (KBs). The approach can be used for approximative reasoning, querying and associational knowledge discovery. In this paper we focus on commonsense reasoning as the main motivational scenario for the approach. The approach focuses on addressing the following problems: (i) providing a semantic selection mechanism for facts which are relevant and meaningful in a specific reasoning and querying context and (ii) allowing coping with information incompleteness in large KBs. The approach is evaluated using ConceptNet as a commonsense KB, and achieved high selectivity, high selectivity scalability and high accuracy in the selection of meaningful navigational paths. Distributional semantics is also used as a principled mechanism to cope with information incompleteness.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.04.005•

Towards richer rule languages with polynomial data complexity for the Semantic Web

[...]

Linh Anh Nguyen¹, Thi-Bich-Loc Nguyen², Andrzej Szałas³•Institutions (3)

University of Warsaw¹, University of the Sciences², Linköping University³

1 Mar 2015

TL;DR: A Horn description logic called Horn-DL is introduced, which is strictly and essentially richer than Horn, and allows a form of the concept constructor "universal restriction" to appear at the left hand side of terminological inclusion axioms.

...read moreread less

Abstract: We introduce a Horn description logic called Horn-DL, which is strictly and essentially richer than Horn ? R eg I, Horn ? S H I Q and Horn ? S R O I Q , while still has PTime data complexity. In comparison with Horn ? S R O I Q , Horn-DL additionally allows the universal role and assertions of the form i r r e fl e x i v e s , ?s(a, b), a ? ? b . More importantly, in contrast to all the well-known Horn fragments E L , DL-Lite, DLP, Horn ? S H I Q , and Horn ? S R O I Q of description logics, Horn-DL allows a form of the concept constructor "universal restriction" to appear at the left hand side of terminological inclusion axioms. Namely, a universal restriction can be used in such places in conjunction with the corresponding existential restriction. We develop the first algorithm with PTime data complexity for checking satisfiability of Horn-DL knowledge bases.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.07.005•

Improving conceptual data models through iterative development

[...]

Tilmann Zäschke¹, Stefania Leone, Tobias Gmünder¹, Moira C. Norrie¹•Institutions (1)

ETH Zurich¹

1 Jul 2015

TL;DR: The concept of evolvability as a model quality characteristic is introduced and why the quality of conceptual models can generally benefit from profiling and how performance measurements convey semantic information is discussed.

...read moreread less

Abstract: Agile methods promote iterative development with short cycles, where user feedback from the previous iteration is used to refactor and improve the current version. To facilitate agile development of information systems, this paper offers three contributions. First, we introduce the concept of evolvability as a model quality characteristic. Evolvability refers to the expected implications of future model refactorings, both in terms of complexity of the required database evolution algorithm and in terms of the expected volume of data to evolve. Second, we propose extending the agile development cycle by using database profiling information to suggest adaptations to the conceptual model to improve performance. For every software release, the database profiler identifies and analyses navigational access patterns, and proposes model optimisations based on data characteristics, access patterns and a cost-benefit model. Based on an experimental evaluation of the profiler we discuss why the quality of conceptual models can generally benefit from profiling and how performance measurements convey semantic information. Third, we discuss the flow of semantic information when developing and using information systems.Beyond these contributions, we also make a case for using object databases in agile development environments. However, most of the presented concepts are also applicable to other database paradigms.

...read moreread less

Journal Article•10.1016/J.DATAK.2015.01.001•

Efficient repair of dimension hierarchies under inconsistent reclassification

[...]

Monica Caniupan¹, Alejandro A. Vaisman², Raúl Arredondo•Institutions (2)

University of the Bío Bío¹, Instituto Tecnológico de Buenos Aires²

1 Jan 2015

TL;DR: It is shown that, although in the general case finding an r-repair is NP-complete, for real-world hierarchy schemas, computing such repairs can be done in polynomial time.

...read moreread less

Abstract: On-Line Analytical Processing (OLAP) dimensions are usually modeled as a set of elements connected by a hierarchical relationship. To ensure summarizability, a dimension is required to be strict, that is, every element of the dimension must have a unique ancestor in each of its ancestor categories. In practice, elements in a dimension are often reclassified, meaning that their rollups are changed. After this operation the dimension may become non-strict. To fix this problem, we propose to compute a set of minimal r-repairs for the new non-strict dimension. Each minimal r-repair is a strict dimension that keeps the result of the reclassification, and is obtained by performing a minimum number of insertions and deletions to the dimension graph. We show that, although in the general case finding an r-repair is NP-complete, for real-world hierarchy schemas, computing such repairs can be done in polynomial time. Further, we propose efficient heuristic-based algorithms for computing r-repairs, and discuss their computational complexity. We also perform experiments over synthetic and real-world dimensions to show the plausibility of our approach.

...read moreread less