Top 52 papers presented at Data and Knowledge Engineering in 2017

Showing papers presented at "Data and Knowledge Engineering in 2017"

Journal Article•10.1016/J.DATAK.2017.01.001•

Big data technologies and Management

[...]

Veda C. Storey¹, Il-Yeol Song²•Institutions (2)

J. Mack Robinson College of Business¹, Drexel University²

1 Mar 2017

TL;DR: The five Vs of big data, volume, velocity, variety, veracity, and value, are reviewed, as well as new technologies, including NoSQL databases that have emerged to accommodate the needs ofbig data initiatives.

...read moreread less

Abstract: The era of big data has resulted in the development and applications of technologies and methods aimed at effectively using massive amounts of data to support decision-making and knowledge discovery activities. In this paper, the five Vs of big data, volume, velocity, variety, veracity, and value, are reviewed, as well as new technologies, including NoSQL databases that have emerged to accommodate the needs of big data initiatives. The role of conceptual modeling for big data is then analyzed and suggestions made for effective conceptual modeling efforts with respect to big data.

...read moreread less

264 citations

Journal Article•10.1016/J.DATAK.2017.06.001•

Learning multiple layers of knowledge representation for aspect based sentiment analysis

[...]

Duc-Hong Pham¹, Duc-Hong Pham², Anh-Cuong Le³•Institutions (3)

Electric Power University¹, University of Engineering and Technology, Lahore², Ton Duc Thang University³

15 Jun 2017

TL;DR: A novel multi-layer architecture for representing customer reviews that outperforms the well-known methods in previous studies on aspect-based sentiment analysis and generates the aspect ratings as well as aspect weights.

...read moreread less

Abstract: Sentiment Analysis is the task of automatically discovering the exact sentimental ideas about a product (or service, social event, etc.) from customer textual comments (i.e. reviews) crawled from various social media resources. Recently, we can see the rising demand of aspect-based sentiment analysis, in which we need to determine sentiment ratings and importance degrees of product aspects. In this paper we propose a novel multi-layer architecture for representing customer reviews. We observe that the overall sentiment for a product is composed from sentiments of its aspects, and in turn each aspect has its sentiments expressed in related sentences which are also the compositions from their words. This observation motivates us to design a multiple layer architecture of knowledge representation for representing the different sentiment levels for an input text. This representation is then integrated into a neural network to form a model for prediction of product overall ratings. We will use the representation learning techniques including word embeddings and compositional vector models, and apply a back-propagation algorithm based on gradient descent to learn the model. This model consequently generates the aspect ratings as well as aspect weights (i.e. aspect importance degrees). Our experiment is conducted on a data set of reviews from hotel domain, and the obtained results show that our model outperforms the well-known methods in previous studies.

...read moreread less

143 citations

Journal Article•10.1016/J.DATAK.2017.11.003•

A Guidelines framework for understandable BPMN models

[...]

Flavio Corradini¹, Alessio Ferrari², Fabrizio Fornari¹, Stefania Gnesi², Andrea Polini¹, Barbara Re¹, Giorgio Oronzo Spagnolo² - Show less +3 more•Institutions (2)

University of Camerino¹, Istituto di Scienza e Tecnologie dell'Informazione²

28 Nov 2017

TL;DR: A set of fifty guidelines that can help modelers to improve the understandability of their models are provided, focused on the Business Process Modelling Notation 2.0 standard published by the Object Management Group.

...read moreread less

Abstract: Business process modeling allows abstracting and reasoning on how work is structured within complex organizations. Business process models represent blueprints that can serve different purposes for a variety of stakeholders. For example, business analysts can use these models to better understand how the organization works; employees playing a role in the process can use them to learn the tasks that they are supposed to perform; software analysts/developers can refer to the models to understand the system-as-is before designing the system-to-be. Given the variety of stakeholders that need to interpret these models, and considering the pivotal function that models play within organizations, understandability becomes a fundamental quality that need to be taken into particular account by modelers. In this paper we provide a set of fifty guidelines that can help modelers to improve the understandability of their models. The work focuses on the Business Process Modelling Notation 2.0 standard published by the Object Management Group, which has acquired a clear predominance among the modeling notations for business processes. Guidelines were derived by means of a thoughtful literature review – which allowed identifying around one hundred guidelines – and through successive activities of synthesis and homogenization. In addition, we implemented a freely available open source tool, named B EBoP (understandaBility vErifier for Business Process models), to check the adherence of a model to the guidelines. Finally, guidelines violation has been checked with B EBoP on a dataset of 11,294 models available in a publicly accessible repository. Our tests show that, although the majority of the guidelines are respected by the models, some guidelines, which are recognized as fundamental by the literature, are frequently violated.

...read moreread less

93 citations

Journal Article•10.1016/J.DATAK.2017.03.009•

An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews

[...]

Reinald Kim Amplayo¹, Min Song¹•Institutions (1)

Yonsei University¹

1 Jul 2017

TL;DR: Results show that the sentiment classifier outperforms baseline models and industry-standard classifiers while the aspect extractor outperforms other topic models in terms of aspect diversity and aspect extracting power.

...read moreread less

Abstract: In this study, we present a novel method in generating summaries of multiple online reviews using a fine-grained sentiment extraction model for short texts, which is adaptable to different domains and languages. Adaptability of a model is defined as its ability to be easily modified and be usable on different domains and languages. This is important because of the diversity of domains and languages available. The fine-grained sentiment extraction model is divided into two methods: sentiment classification and aspect extraction. The sentiment classifier is built using a three-level classification approach, while the aspect extractor is built using extended biterm topic model (eBTM), an extension of LDA topic model for short texts. Overall, results show that the sentiment classifier outperforms baseline models and industry-standard classifiers while the aspect extractor outperforms other topic models in terms of aspect diversity and aspect extracting power. In addition, using the Naver movies dataset, we show that online review summarization can be effectively constructed using the proposed methods by comparing the results of our method and the results of a movie awards ceremony.

...read moreread less

64 citations

Journal Article•10.1016/J.DATAK.2017.03.002•

Multi-level ontology-based conceptual modeling

[...]

Victorio Albani de Carvalho¹, Victorio Albani de Carvalho², João Paulo A. Almeida¹, Claudenir M. Fonseca¹, Giancarlo Guizzardi³, Giancarlo Guizzardi¹ - Show less +2 more•Institutions (3)

Universidade Federal do Espírito Santo¹, International Foundation for Electoral Systems², Free University of Bozen-Bolzano³

1 May 2017

TL;DR: The UFO-MLT combination serves as a foundation for conceptual models that can benefit from the ontological distinctions of UFO as well as MLT's basic concepts and patterns for multi-level modeling.

...read moreread less

Abstract: Since the late 1980s, there has been a growing interest in the use of foundational ontologies to provide a sound theoretical basis for the discipline of conceptual modeling. This has led to the development of ontology-based conceptual modeling techniques whose modeling primitives reflect the conceptual categories defined in a foundational ontology. The ontology-based conceptual modeling language OntoUML, for example, incorporates the distinctions underlying the taxonomy of types in the Unified Foundational Ontology (UFO) (e.g., kinds, phases, roles, mixins, etc.). This approach has focused so far on the support to types whose instances are individuals in the subject domain, with no provision for types of types (or categories of categories). In this paper we address this limitation by extending the Unified Foundational Ontology with the MLT multi-level theory. The UFO-MLT combination serves as a foundation for conceptual models that can benefit from the ontological distinctions of UFO as well as MLT's basic concepts and patterns for multi-level modeling. We discuss the impact of the extended foundation to multi-level conceptual modeling.

...read moreread less

50 citations

Journal Article•10.1016/J.DATAK.2017.07.005•

A secure kNN query processing algorithm using homomorphic encryption on outsourced database

[...]

Hyeong-Il Kim¹, Hyeong-Jin Kim², Jae-Woo Chang²•Institutions (2)

Agency for Defense Development¹, Chonbuk National University²

22 Jul 2017

TL;DR: This paper proposes a new secure k-nearest neighbor query processing algorithm that guarantees the confidentiality of both encrypted data and users’ query records and devise an encrypted index search scheme that performs data filtering without revealing data access patterns.

...read moreread less

Abstract: With the adoption of cloud computing, database outsourcing has emerged as a new platform. Due to the serious privacy concerns associated with cloud computing, databases must be encrypted before being outsourced to the cloud. Therefore, various k-nearest neighbor (kNN) query processing techniques have been proposed for encrypted databases. However, existing schemes are either insecure or inefficient. In this paper, we propose a new secure kNN query processing algorithm. Our algorithm guarantees the confidentiality of both encrypted data and users’ query records. To achieve a high level of query processing efficiency, we also devise an encrypted index search scheme that performs data filtering without revealing data access patterns. A performance analysis shows that the proposed scheme outperforms the existing scheme in terms of query processing costs while preserving data privacy.

...read moreread less

43 citations

Journal Article•10.1016/J.DATAK.2016.12.004•

Specification and derivation of key performance indicators for business analytics

[...]

Alejandro Mat¹, Juan Trujillo¹, John Mylopoulos²•Institutions (2)

University of Alicante¹, University of Trento²

1 Mar 2017

TL;DR: An approach that provides decision makers with an integrated view of strategic business objectives and conceptual data warehouse KPIs is proposed that links strategic business models to the data for monitoring and assessing them and enables the user to analyze data subspaces from a strategic point of view.

...read moreread less

Abstract: Key Performance Indicators (KPI) measure the performance of an enterprise relative to its objectives thereby enabling corrective action where there are deviations. In current practice, KPIs are manually integrated within dashboards and scorecards used by decision makers. This practice entails various shortcomings. First, KPIs are not related to their business objectives and strategy. Consequently, decision makers often obtain a scattered view of the business status and business concerns. Second, while KPIs are defined by decision makers, their implementation is performed by IT specialists. This often results in discrepancies that are difficult to identify. In this paper, we propose an approach that provides decision makers with an integrated view of strategic business objectives and conceptual data warehouse KPIs. The main benefit of our proposal is that it links strategic business models to the data for monitoring and assessing them. In our proposal, KPIs are defined using a modeling language where decision makers specify KPIs using business terminology, but can also perform quick modifications and even navigate data while maintaining a strategic view. This enables monitoring and what-if analysis, thereby helping analysts to compare expectations with reported results. HighlightsNovel approach for conceptualizing and specifying Key Performance Indicators.Transforms strategic models into analytic tools to aid in decision making.Enables the user to analyze data subspaces from a strategic point of view.Based on the Semantics for Business Vocabulary and Rules specification.Implemented to support the whole process from definition to data extraction.

...read moreread less

42 citations

Journal Article•10.1016/J.DATAK.2017.11.002•

An Information-Theoretic Filter Approach for Value Weighted Classification Learning in Naive Bayes

[...]

Chang-Hwan Lee¹•Institutions (1)

Dongguk University¹

1 Nov 2017

TL;DR: The experimental results show that the value weighting method could improve the performance of naive Bayes significantly and is compared with that of some other traditional methods for a number of datasets.

...read moreread less

Abstract: Assigning weights in features has been an important topic in some classification learning algorithms. In this paper, we propose a new paradigm of assigning weights in classification learning, called value weighting method. While the current weighting methods assign a weight to each feature, we assign a different weight to the values of each feature. The performance of naive Bayes learning with value weighting method is compared with that of some other traditional methods for a number of datasets. The experimental results show that the value weighting method could improve the performance of naive Bayes significantly.

...read moreread less

37 citations

Journal Article•10.1016/J.DATAK.2017.03.008•

Ontology-based context modeling in service-oriented computing: A systematic mapping

[...]

Oscar Cabrera¹, Xavier Franch¹, Jordi Marco¹•Institutions (1)

Polytechnic University of Catalonia¹

1 Jul 2017

TL;DR: A sweeping view on the anatomy of context models may help avoiding the postulation of new proposals not aligned with the current research.

...read moreread less

Abstract: Context Service-oriented computing and context-aware computing are two consolidated paradigms that are changing the way of providing and consuming software services. Whilst service-oriented computing is based on service-oriented architectures for providing flexible software services, context-aware computing articulates different phases of a context life cycle for changing the behavior of such services. The synergy between both paradigms provides the context to this study. Objective This study analyzes the current state of the art of context models, specifically: (1) which are these proposals and how are they related; (2) what are their structural characteristics; (3) what context information is the most addressed; and (4) what are their most consolidated definitions. Given their dominance on the field, the study focuses on ontology-based approaches. Method We conducted a systematic mapping by establishing a review protocol that integrates automatic and manual searches from different sources. We applied a rigorous method to elicit the keywords from the research questions and selection criteria to retrieve the papers to evaluate. Results Overall, 138 primary studies were selected to answer our research questions. These proposals were studied in depth by analyzing: 1) distribution along time and their relationships; 2) size correlated with the number of classes and levels of the context model, and coverage of the definitions provided as indicator of quality provided; 3) most addressed context information; 4) most consolidated definitions of context information. Conclusions The contribution of this survey is to make available a unified and consolidated body of knowledge on context for service-oriented computing that could be instantiated and used as starting point in a variety of use cases. This sweeping view on the anatomy of context models may help avoiding the postulation of new proposals not aligned with the current research.

...read moreread less

32 citations

Journal Article•10.1016/J.DATAK.2017.08.003•

A Fine‐Grained Distribution Approach for ETL Processes in Big Data Environments

[...]

Mahfoud Bala, Omar Boussaid¹, Zaia Alimazighi•Institutions (1)

University of Lyon¹

1 Sep 2017

TL;DR: A new fine-grained parallelization/distribution approach for populating the Data Warehouse (DW) by employing 25 to 38 parallel tasks enables the novel approach to speed up the ETL process by up to 33% with the improvement rate being linear.

...read moreread less

Abstract: Among the so-called “4Vs” (volume, velocity, variety, and veracity) that characterize the complexity of Big Data, this paper focuses on the issue of “ Volume ” in order to ensure good performance for Extracting-Transforming-Loading (ETL) processes. In this study, we propose a new fine-grained parallelization/distribution approach for populating the Data Warehouse (DW). Unlike prior approaches that distribute the ETL only at coarse-grained level of processing, our approach provides different ways of parallelization/distribution both at process, functionality and elementary functions levels. In our approach, an ETL process is described in terms of its core functionalities which can run on a cluster of computers according to the MapReduce (MR) paradigm. The novel approach allows thereby the distribution of the ETL process at three levels: the “process” level for coarse-grained distribution and the “functionality” and “elementary functions” levels for fine-grained distribution. Our performance analysis reveals that employing 25 to 38 parallel tasks enables the novel approach to speed up the ETL process by up to 33% with the improvement rate being linear.

...read moreread less

26 citations

Journal Article•10.1016/J.DATAK.2017.08.004•

Frequent patterns in ETL workflows: An empirical approach

[...]

Vasileios Theodorou¹, Alberto Abelló¹, Maik Thiele², Wolfgang Lehner²•Institutions (2)

Polytechnic University of Catalonia¹, Dresden University of Technology²

5 Sep 2017

TL;DR: This work logically model the ETL workflows using labeled graphs and employ graph algorithms to identify candidate patterns and to recognize them on different workflows and provides a stepping stone for the automatic translation of ETL logical models to their conceptual representation and to generate fine-grained cost models at the granularity level of patterns.

...read moreread less

Abstract: The complexity of Business Intelligence activities has driven the proposal of several approaches for the effective modeling of Extract-Transform-Load (ETL) processes, based on the conceptual abstraction of their operations. Apart from fostering automation and maintainability, such modeling also provides the building blocks to identify and represent frequently recurring patterns. Despite some existing work on classifying ETL components and functionality archetypes, the issue of systematically mining such patterns and their connection to quality attributes such as performance has not yet been addressed. In this work, we propose a methodology for the identification of ETL structural patterns. We logically model the ETL workflows using labeled graphs and employ graph algorithms to identify candidate patterns and to recognize them on different workflows. We showcase our approach through a use case that is applied on implemented ETL processes from the TPC-DI specification and we present mined ETL patterns. Decomposing ETL processes to identified patterns, our approach provides a stepping stone for the automatic translation of ETL logical models to their conceptual representation and to generate fine-grained cost models at the granularity level of patterns.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.09.001•

SummTriver: A new trivergent model to evaluate summaries automatically without human references

[...]

Luis Adrián Cabrera-Diego¹, Luis Adrián Cabrera-Diego², Juan-Manuel Torres-Moreno³, Juan-Manuel Torres-Moreno²•Institutions (3)

Edge Hill University¹, University of Avignon², École Polytechnique de Montréal³

1 Sep 2017

TL;DR: This paper presents SummTriver, an automatic evaluation method that tries to be more correlated to manual evaluation by using multiple divergences, and the results are promising, especially for summarization campaigns.

...read moreread less

Abstract: The automatic evaluation of summaries is a hard task that continues to be open. The assessment aims to measure simultaneously the informativeness and readability of summaries. The scientific community has tackled this problem with partial solutions, in terms of informativeness, using ROUGE. However, to use this method, it is necessary to have multiple summaries made by humans (the references). Methods without human references have been implemented, but there are still far from being highly correlated to manual evaluations. In this paper we present SummTriver, an automatic evaluation method that tries to be more correlated to manual evaluation by using multiple divergences. The results are promising, especially for summarization campaigns. Besides this, we also present an interesting analysis, at micro-level, of how correlated the manual and automatic summaries evaluation methods are, when we make use of a large quantity of observations.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.07.008•

Social emotion classification based on noise-aware training

[...]

Xin Li¹, Yanghui Rao¹, Haoran Xie², Xuebo Liu¹, Tak-Lam Wong², Fu Lee Wang³ - Show less +2 more•Institutions (3)

Sun Yat-sen University¹, University of Hong Kong², Caritas Institute of Higher Education³

21 Jul 2017

TL;DR: This work proposes a new architecture named PCNN, which utilizes two cascading convolutional layers to model the word-phrase relation and the phrase-sentence relation, and presents a Bayesian-based model named WMCM to learn document-level semantic features.

...read moreread less

Abstract: Social emotion classification draws many natural language processing researchers’ attention in recent years, since analyzing user-generated emotional documents on the Web is quite useful in recommending products, gathering public opinions, and predicting election results. However, the documents that evoke prominent social emotions are usually mixed with noisy instances, and it is also challenging to capture the textual meaning of short messages. In this work, we focus on reducing the impact of noisy instances and learning a better representation of sentences. For the former, we introduce an “emotional concentration” indicator, which is derived from emotional ratings to weight documents. For the latter, we propose a new architecture named PCNN, which utilizes two cascading convolutional layers to model the word-phrase relation and the phrase-sentence relation. This model regards continuous tokens as phrases based on an assumption that neighboring words are very likely to have internal relations, and semantic feature vectors are generated based on the phrase representation. We also present a Bayesian-based model named WMCM to learn document-level semantic features. Both PCNN and WMCM classify social emotions by capturing semantic regularities in language. Experiments on two real-world datasets indicate that the quality of learned semantic vectors and the performance of social emotion classification can be improved by our models.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.08.001•

Graph based knowledge discovery using MapReduce and SUBDUE algorithm

[...]

Sirisha Velampalli¹, Murthy V. Jonnalagedda¹•Institutions (1)

University College of Engineering¹

1 Sep 2017

TL;DR: This work aims to show how skills data from resumes is modelled into a variant of graph data structure called conceptual graph using MapReduce programming model, which is able to extract common skill-sets.

...read moreread less

Abstract: Knowledge Discovery is the process of extracting useful and hidden information. Extracting knowledge from data represented in the form of graphs is emerging in this new generation. Graphs are used to model and solve many real world problems. In this work, we aim to show how skills data from resumes is modelled into a variant of graph data structure called conceptual graph using MapReduce programming model. Resumes are taken as data source because they are the ones containing skill-sets of candidates. Initial storage and pre-processing is done in a big data framework using Hadoop Distributed File System (HDFS ) and MapReduce. SUB Structure Discovery Using Examples (SUBDUE), a popular graph mining algorithm is used for retrieving common skill-sets. The results obtained from real-world dataset of resumes clearly demonstrate the potential of graph mining algorithms in skill set analytics. Proposed approach is able to extract common skill-sets. Common skill-set extraction is useful for course curriculum designers as well as job seekers.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.09.002•

QETL: An approach to on-demand ETL from non-owned data sources

[...]

Lorenzo Baldacci¹, Matteo Golfarelli¹, Simone Graziani¹, Stefano Rizzi¹•Institutions (1)

University of Bologna¹

1 Nov 2017

TL;DR: The experimental tests show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements, and is proposed to feed a multidimensional cube.

...read moreread less

Abstract: In traditional OLAP systems, the ETL process loads all available data in the data warehouse before users start querying them. In some cases, this may be either inconvenient (because data are supplied from a provider for a fee) or unfeasible (because of their size); on the other hand, directly launching each analysis query on source data would not enable data reuse, leading to poor performance and high costs. The alternative investigated in this paper is that of fetching and storing data on-demand, i.e., as they are needed during the analysis process. In this direction we propose the Query-Extract-Transform-Load (QETL) paradigm to feed a multidimensional cube; the idea is to fetch facts from the source data provider, load them into the cube only when they are needed to answer some OLAP query, and drop them when some free space is needed to load other facts. Remarkably, QETL includes an optimization step to cheaply extract the required data based on the specific features of the data provider. The experimental tests, made on a real case study in the genomics area, show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.07.003•

Automatically classifying source code using tree-based approaches

[...]

Anh Viet Phan¹, Phuong Ngoc Chau, Minh-Le Nguyen, Lam Thu Bui¹•Institutions (1)

Le Quy Don Technical University¹

27 Jul 2017

TL;DR: This paper proposes two combination models between a tree-based convolutional neural network and k-Nearest Neighbors, support vector machines to exploit both structural and semantic ASTs' information to solve software engineering problems by exploring information of programs' abstract syntax trees (ASTs) instead of software metrics.

...read moreread less

Abstract: Analyzing source code to solve software engineering problems such as fault prediction, cost, and effort estimation always receives attention of researchers as well as companies. The traditional approaches are based on machine learning, and software metrics obtained by computing standard measures of software projects. However, these methods have faced many challenges due to limitations of using software metrics which were not enough to capture the complexity of programs. To overcome the limitations, this paper aims to solve software engineering problems by exploring information of programs' abstract syntax trees (ASTs) instead of software metrics. We propose two combination models between a tree-based convolutional neural network (TBCNN) and k-Nearest Neighbors (kNN), support vector machines (SVMs) to exploit both structural and semantic ASTs' information. In addition, to deal with high-dimensional data of ASTs, we present several pruning tree techniques which not only reduce the complexity of data but also enhance the performance of classifiers in terms of computational time and accuracy. We survey many machine learning algorithms on different types of program representations including software metrics, sequences, and tree structures. The approaches are evaluated based on classifying 52000 programs written in C language into 104 target labels. The experiments show that the tree-based classifiers dramatically achieve high performance in comparison with those of metrics-based or sequences-based; and two proposed models TBCNN + SVM and TBCNN + kNN rank as the top and the second classifiers. Pruning redundant AST branches leads to not only a substantial reduction in execution time but also an increase in accuracy.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.06.006•

A natural language interface to a graph-based bibliographic information retrieval system

[...]

Yongjun Zhu¹, Erjia Yan¹, Il-Yeol Song¹•Institutions (1)

Drexel University¹

1 Sep 2017

TL;DR: This paper proposes a novel customized natural language processing framework that integrates a few original algorithms/heuristics for interpreting and analyzing bibliographic queries and shows that the proposed framework and natural language interface provide a practical solution for building real-world bibliographical information retrieval systems.

...read moreread less

Abstract: With the ever-increasing volume of scientific literature, there is a need for a natural language interface to bibliographic information retrieval systems to retrieve relevant information effectively. In this paper, we propose one such interface, NLI-GIBIR, which allows users to search for a variety of bibliographic data through natural language. NLI-GIBIR makes use of a novel framework applicable to graph-based bibliographic information retrieval systems in general. This framework incorporates algorithms/heuristics for interpreting and analyzing natural language bibliographic queries via a series of text- and linguistic-based techniques, including tokenization, named entity recognition, and syntactic analysis. We find that our framework, as implemented in NLI-GIBIR, can effectively represent and address complex bibliographic information needs. Thus, the contributions of this paper are as follows: First, to our knowledge, it is the first attempt to propose a natural language interface for graph-based bibliographic information retrieval. Second, we propose a novel customized natural language processing framework that integrates a few original algorithms/heuristics for interpreting and analyzing bibliographic queries. Third, we show that the proposed framework and natural language interface provide a practical solution for building real-world bibliographic information retrieval systems. Our experimental results show that the presented system can correctly answer 39 out of 40 example natural language queries with varying lengths and complexities.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.10.002•

Multi-View Fuzzy Information Fusion in Collaborative Filtering Recommender Systems: Application to the Urban Resilience Domain

[...]

Iván Palomares¹, Fiona Browne², Peadar Davis²•Institutions (2)

University of Bristol¹, Ulster University²

24 Oct 2017

TL;DR: A hybrid framework which combines a collaborative filtering recommendation system with fuzzy decision-making approaches (based on the use of aggregation functions) to improve the accuracy of domain-specific recommendations is proposed.

...read moreread less

Abstract: Recommender systems play an increasingly important role in on-line web services for the personalization and recommendation of content to individual users. The quantity and quality of user-based information has progressed presenting the opportunity to further tailor recommendations to users based on feature view integration. In this work, we propose a hybrid framework which combines a collaborative filtering recommendation system with fuzzy decision-making approaches (based on the use of aggregation functions) to improve the accuracy of domain-specific recommendations. We extend upon the classical, neighborhood-based collaborative filtering process by conflating preference information with user-profile data in the recommendation process. This is performed using intelligent information fusion techniques whereby Ordered Weighted Averaging (OWA) operators and uninorm aggregation functions are implemented in the fusion of multiple views of pairwise similarity degrees between users. To address the shortcoming of generating sensible recommendations to cold users, we incorporate a novel weighting scheme based on fuzzy set modeling within the uninorm-based aggregation of similarity views. We finally outline the application of the proposed approach through an empirical study based in the Urban Resilience domain, along with an example to movie recommendation.

...read moreread less

Journal Article•10.1016/J.DATAK.2016.12.002•

Improving the efficiency of NSGA-II based ontology aligning technology

[...]

Xingsi Xue¹, Xingsi Xue², Yuping Wang²•Institutions (2)

Fujian University of Technology¹, Xidian University²

1 Mar 2017

TL;DR: The experiment results show that, comparing with the approach by using NSGA-II solely, the utilization of Dynamic Alignment Candidates Selection Strategy and Metamodel is able to highly reduce the time and main memory consumption of the tuning process while at the same time ensures the correctness and completeness of the alignments.

...read moreread less

Abstract: There is evidence from Ontology Alignment Evaluation Initiative (OAEI) that ontology matchers do not necessarily find the same correct correspondences. Therefore, usually several competing matchers are applied to the same pair of entities in order to increase evidence towards a potential match or mismatch. How to select the proper matcher's alignments and efficiently tune them becomes one of the challenges in ontology matching domain. To this end, in this paper, we propose to use the Dynamic Alignment Candidates Selection Strategy and Metamodel to raise the efficiency of the process of using NSGA-II to optimize the ontology alignment by prescreening the less promising aligning results to be combined and individuals to be evaluated in the NSGA-II, respectively. The experiment results show that, comparing with the approach by using NSGA-II solely, the utilization of Dynamic Alignment Candidates Selection Strategy and Metamodel is able to highly reduce the time and main memory consumption of the tuning process while at the same time ensures the correctness and completeness of the alignments. Moreover, our proposal is also more efficient than the state-of-the-art ontology aligning systems.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.03.010•

Ensuring the canonicity of process models

[...]

Henrik Leopold¹, Fabian Pittke, Jan Mendling•Institutions (1)

VU University Amsterdam¹

1 Sep 2017

TL;DR: The notion of canonicity is introduced to prevent the mixing of natural language and modeling language in process models and is used to define automated techniques for detecting and refactoring activities that do not comply with it.

...read moreread less

Abstract: Process models play an important role for specifying requirements of business-related software. However, the usefulness of process models is highly dependent on their quality. Recognizing this, researches have proposed various techniques for the automated quality assurance of process models. A considerable shortcoming of these techniques is the assumption that each activity label consistently refers to a single stream of action. If, however, activities textually describe control flow related aspects such as decisions or conditions, the analysis results of these tools are distorted. Due to the ambiguity that is associated with this misuse of natural language, also humans struggle with drawing valid conclusions from such inconsistently specified activities. In this paper, we therefore introduce the notion of canonicity to prevent the mixing of natural language and modeling language. We identify and formalize non-canonical patterns, which we then use to define automated techniques for detecting and refactoring activities that do not comply with it. We evaluated these techniques by the help of four process model collections from industry, which confirmed the applicability and accuracy of these techniques.

...read moreread less

Journal Article•10.1016/J.DATAK.2016.12.003•

Producing relevant interests from social networks by mining users' tagging behaviour

[...]

Manel Mezghani, Andr Pninou¹, Corinne Amel Zayani, Ikram Amous, Florence Sdes¹ - Show less +1 more•Institutions (1)

University of Toulouse¹

1 Mar 2017

TL;DR: The originality of the approach is based on the proposal of a new technique of interests' detection by analysing the accuracy of the tagging behaviour of a user in order to figure out the tags which really reflect the content of the resources.

...read moreread less

Abstract: Social media provides an environment of information exchange. They principally rely on their users to create content, to annotate others content and to make on-line relationships. The user activities reflect his opinions, interests, etc. in this environment. We focus on analysing this social environment to detect user interests which are the key elements for improving adaptation. This choice is motivated by the lack of information in the user profile and the inefficiency of the information issued from methods that analyse the classic user behaviour (e.g. navigation, time spent on web page, etc.). So, having to cope with an incomplete user profile, the user social network can be an important data source to detect user interests. The originality of our approach is based on the proposal of a new technique of interests' detection by analysing the accuracy of the tagging behaviour of a user in order to figure out the tags which really reflect the content of the resources. So, these tags are somehow comprehensible and can avoid tags ambiguity usually associated to these social annotations. The approach combines the tag, user and resource in a way that guarantees a relevant interests detection. The proposed approach has been tested and evaluated in the Delicious social database. For the evaluation, we compare the result issued from our approach using the tagging behaviour of the neighbours (the egocentric network and the communities) with the information yet known for the user (his profile). A comparative evaluation with the classical tag-based method of interests detection shows that the proposed approach is better.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.06.005•

Ontology-based modeling and querying of trajectory data

[...]

Marwa Manaa¹, Jalel Akaichi¹•Institutions (1)

Tunis University¹

1 Sep 2017

TL;DR: An ontology-based trajectory pivot model is presented that covers common structures encountered in trajectories associated with links to application and geographic modules that is intended to reduce structural heterogeneity among sources and to specify the semantics of concepts in an unambiguous way.

...read moreread less

Abstract: With the evolution of location-sensing devices and associated technologies, mobility data driven scientific discovery approaches became an important paradigm for advanced computing performed in various central areas i.e., Internet of things and social networks. Under this paradigm, trajectory data is considered as a core revealing details of instantaneous behaviors piloted by mobile entities. This forms the need of modeling of such behaviors and the understanding of them, and actually, gave rise to different modeling approaches using either conceptual modeling or ontologies. Modeling and querying of trajectory data are still challenging because of their structural and semantic heterogeneities, and due to the complexity of establishing choices about the domain’ consensual knowledge. Ontologies are promising solutions for the above two problems seeing that they are intended to reduce structural heterogeneity among sources and to specify the semantics of concepts in an unambiguous way. In this paper, we propose a framework for a semantics oriented modeling and querying of trajectory data. We present an ontology-based trajectory pivot model that covers common structures encountered in trajectories associated with links to application and geographic modules. We validate our proposal through a case study dealing with human movement activity.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.09.003•

The Merkurion approach for similarity searching optimization in Database Management Systems

[...]

Marcos V. N. Bedo, Daniel S. Kaster, Agma J. M. Traina, Caetano Traina

23 Sep 2017

TL;DR: This article addresses a novel strategy that extends the query optimizer of any DBMS, so that it can also perform both logical and physical query plan optimizations in searches that include similarity predicates.

...read moreread less

Abstract: Modern Database Management Systems (DBMSs) retrieve songs that resemble those in a music dataset, identify plagiarism in a set of documents, or provide past cases to physicians by taking into account the characteristics of a query exam. All such tasks require the comparison of data by similarity, which can be expressed in terms of distance-based queries in metric spaces. Traditional query processing relies mostly on histograms for describing the data distribution space and choosing a data retrieval path that quickly leads to the answer, discarding comparisons of most unwanted data. However, DBMSs still lack adequate support for selectivity estimation of query operators for data types embedded in metric spaces. This article addresses a novel strategy that extends the query optimizer of a DBMS, so that it can also perform both logical and physical query plan optimizations in searches that include similarity predicates. The proposal, named Merkurion, updates the concept of Data Distribution Space and captures data distributions according to the distances between the elements within a dataset. Moreover, it employs concise representations of such distributions, called synopses, for the definition of rules that enable similarity searching optimization. An extensive evaluation of Merkurion in real-world datasets has proven its effectiveness and broad applicability to many data domains.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.12.002•

Annotation paths for matching XML-Schemas

[...]

Julius Köpke¹•Institutions (1)

Alpen-Adria-Universität Klagenfurt¹

1 Dec 2017

TL;DR: This work provides a comprehensive evaluation of the annotation method and the proposed matching algorithms using real-world schemas and reference ontologies and demonstrates the feasibility of generating executable mappings using a state of the art mapping system.

...read moreread less

Abstract: Annotation paths are a technique for the semantic annotation of XML-Schemas. The design rationale was to develop an embedded annotation method on top of SAWSDL which is fully declarative, easily applicable and still provides the proper expressiveness for high-quality logic-based schema matching. Annotation paths capture significantly more semantics than plain model references, the declarative annotation method of the W3C standard SAWSDL. While the concept of annotation paths was introduced in earlier works, we provide a new formalization of their structure and based thereon define their semantics and introduce matching methods to derive simple and complex value correspondences. Such correspondences can be used for the generation of executable schema mappings using state of the art mapping tools. We provide a comprehensive evaluation of our annotation method and the proposed matching algorithms using real-world schemas and reference ontologies and demonstrate the feasibility of generating executable mappings using a state of the art mapping system. Our evaluations show that our annotation-based matcher achieves outstanding matching quality (avg. f-measure between 0.98 and 1.0).

...read moreread less

Journal Article•10.1016/J.DATAK.2017.02.001•

Constructing target-aware results for keyword search on knowledge graphs

[...]

Yi Shan¹, Mingda Li², Yi Chen²•Institutions (2)

Electronic Arts¹, New Jersey Institute of Technology²

1 Jul 2017

TL;DR: This paper uses the Information Theory and develops a general probability model to infer search targets by analyzing return specifiers, modifiers, relatedness relationships, and query keywords' information gain and proposes two important properties for a target-aware result: atomicity and intactness.

...read moreread less

Abstract: Existing work of processing keyword searches on graph data focuses on efficiency of result generation. However, being oblivious to user search intention, a query result may contain multiple instances of user search target, and multiple query results may contain information for the same instance of user search target. With the misalignment between query results and search targets, a ranking function is unable to effectively rank the instances of search targets. In this paper we propose the concept of target-aware query results driven by inferred user search intention. We leverage the Information Theory and develop a general probability model to infer search targets by analyzing return specifiers, modifiers, relatedness relationships, and query keywords' information gain. Then we propose two important properties for a target-aware result: atomicity and intactness. We develop techniques to efficiently generate target-aware results. Extensive experimental evaluation shows the effectiveness and efficiency of our approach.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.03.007•

Mining task post-conditions: Automating the acquisition of process semantics

[...]

Metta Santiputri¹, Aditya Ghose¹, Hoa Khanh Dam¹•Institutions (1)

Information Technology University¹

1 May 2017

TL;DR: This paper presents a data-driven approach to mining and validating semantic annotations (and specifically context-independent semantic annotations) and presents an empirical evaluation, which suggests that the approach provides generally reliable results.

...read moreread less

Abstract: Semantic annotation of business process model in the business process designs has been addressed in a large and growing body of work, but these annotations can be difficult and expensive to acquire. This paper presents a data-driven approach to mining and validating these annotations (and specifically context-independent semantic annotations). We leverage event objects in process execution histories which describe both activity execution events (typically represented as process events ) and state update events (represented as object state transition events ). We present an empirical evaluation, which suggests that the approach provides generally reliable results.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.08.002•

Thematic ranking of object summaries for keyword search

[...]

Georgios John Fakas¹, Yilun Cai, Zhi Cai², Nikos Mamoulis³•Institutions (3)

Uppsala University¹, Beijing University of Technology², University of Ioannina³

1 Oct 2017

TL;DR: This paper argues that the effective thematic ranking of OSs should combine gracefully IR-style properties, authoritative ranking and affinity, and proposes an algorithm that computes the join efficiently, taking advantage of appropriate count statistics and compare it with baseline approaches.

...read moreread less

Abstract: An Object Summary (OS) is a tree structure of tuples that summarizes the context of a particular Data Subject (DS) tuple. The OS has been used as a model of keyword search in relational databases; where given a set of keywords, the objective is to identify the DSs tuples relevant to the keywords and their corresponding OSs. However, a query result may return a large amount of OSs, which brings in the issue of effectively and efficiently ranking them in order to present only the most important ones to the user. In this paper, we propose a model that ranks OSs containing a set of identifying keywords (e.g., Chen ) according to their relevance to a set of thematic keywords (e.g. Mining ). We argue that the effective thematic ranking of OSs should combine gracefully IR-style properties, authoritative ranking and affinity. Our ranking problem is modeled and solved as a top-k group-by join; we propose an algorithm that computes the join efficiently, taking advantage of appropriate count statistics and compare it with baseline approaches. An experimental evaluation on the DBLP and TPC-H databases verifies the effectiveness and efficiency of our proposal.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.03.003•

Planning runtime software adaptation through pragmatic goal model

[...]

Felipe Pontes Guimaraes¹, Genaína Nunes Rodrigues², Raian Ali³, Daniel Macedo Batista¹•Institutions (3)

University of São Paulo¹, University of Brasília², Bournemouth University³

1 May 2017

TL;DR: This paper argues the case for pragmatic requirements and extends the CGM with additional constructs to capture them and allow their analysis, and develops an automated analysis which aids the planning and scheduling of tasks execution to meet pragmatic goals.

...read moreread less

Abstract: Adaptivity is a capability that enables a system to choose amongst various alternatives to satisfy or maintain the satisfaction of certain requirements. The criteria of requirements satisfaction could be pragmatic and context-dependent. Contextual Goal Models (CGM) capture the power of context on banning or allowing certain alternatives to reach requirements (goals) and also deciding the quality of those alternatives with regards to certain quality measures (softgoals). It is used to depict facets of the decision making strategy and rationale of an adaptive system at the preliminary level of requirements. In this paper we argue the case for pragmatic requirements and extend the CGM with additional constructs to capture them and allow their analysis. We also develop an automated analysis which aids the planning and scheduling of tasks execution to meet pragmatic goals. Moreover, we evaluate our modelling and analysis regarding correctness and performance. Such an evaluation showed the applicability of the approach and its usefulness in aiding sensible decisions. It has also shown its capability to do so in a time short enough to suit run-time adaptation decision making.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.06.002•

A time-dependent model with speed windows for share-a-ride problems: A case study for Tokyo transportation

[...]

Phan-Thuan Do¹, Nguyen-Viet-Dung Nghiem¹, Ngoc-Quang Nguyen¹, Quang Dung Pham¹•Institutions (1)

Hanoi University of Science and Technology¹

15 Jun 2017

TL;DR: A new fully time-dependent model of a public transportation system in the urban context that allows sharing a taxi between one passenger and parcels with speed widows consideration is introduced and is presented by a mathematical formulation.

...read moreread less

Abstract: This paper introduces a new fully time-dependent model of a public transportation system in the urban context that allows sharing a taxi between one passenger and parcels with speed widows consideration. The model contains many real-life case features and is presented by a mathematical formulation. We study both static and dynamic scenarios in comparison to traditional strategies, i.e., the direct delivery model. Moreover, we classify speed windows by different zones and congestion levels during a day in the urban context. Different speed windows induce the dynamic graph model for road networks and make the problem much more difficult to solve. Because of the complex model, the preprocessing steps on data as well as on dynamic graphs are very important. We use a greedy algorithm to initiate the solution and then use some local search techniques to improve the solution quality. The experimental data set is recorded by Tokyo-Musen Taxi company. The data set includes more than 20000 requests per day, more than 4500 used taxis per day and more than 130000 crossing points on the Tokyo map. Experimental results are analyzed on various factors such as the total benefit, the accumulating traveling time during the day, the number of used taxis and the number of shared requests.

...read moreread less

Journal Article•10.1016/J.DATAK.2017.10.001•

Location disclosure risks of releasing trajectory distances

[...]

Emre Kaplan¹, Mehmet Emre Gursoy², Mehmet Ercan Nergiz, Yucel Saygin¹•Institutions (2)

Sabancı University¹, Georgia Institute of Technology²

16 Oct 2017

TL;DR: This work devise an attack that yields the locations which the private trajectory has visited, with high confidence, given a set of known trajectories and their distances to a private, unknown trajectory.

...read moreread less

Abstract: Location tracking devices enable trajectories to be collected for new services and applications such as vehicle tracking and fleet management. While trajectory data is a lucrative source for data analytics, it also contains sensitive and commercially critical information. This has led to the development of systems that enable privacy-preserving computation over trajectory databases, but many of such systems in fact (directly or indirectly) allow an adversary to compute the distance (or similarity) between two trajectories. We show that the use of such systems raises privacy concerns when the adversary has a set of known trajectories. Specifically, given a set of known trajectories and their distances to a private, unknown trajectory, we devise an attack that yields the locations which the private trajectory has visited, with high confidence. The attack can be used to disclose both positive results (i.e., the victim has visited a certain location) and negative results (i.e., the victim has not visited a certain location). Experiments on real and synthetic datasets demonstrate the accuracy of our attack.

...read moreread less