Top 58 papers presented at Data and Knowledge Engineering in 2018

Showing papers presented at "Data and Knowledge Engineering in 2018"

Journal Article•10.1016/J.DATAK.2018.08.003•

Leveraging social media news to predict stock index movement using RNN-boost

[...]

Weiling Chen¹, Chai Kiat Yeo¹, Chiew Tong Lau¹, Bu-Sung Lee¹•Institutions (1)

1 Nov 2018

TL;DR: This paper carefully selects official accounts from China’s largest online social networks — Sina Weibo and analyzes the news content crawled from these accounts by extracting sentiment features and Latent Dirichlet allocation (LDA) features into a novel hybrid model called RNN-boost to predict the stock volatility in the Chinese stock market.

...read moreread less

Abstract: News from traditional media has been used to facilitate the prediction of stock movement for a long time. However, in recent times, online social networks (OSN) have played an increasing significant role as a platform for information sharing. News content posted on these OSN provides very useful insight about public moods. In this paper, we carefully select official accounts from China’s largest online social networks — Sina Weibo and analyze the news content crawled from these accounts by extracting sentiment features and Latent Dirichlet allocation (LDA) features. We then input these features together with technical indicators into a novel hybrid model called RNN-boost to predict the stock volatility in the Chinese stock market. The Shanghai-Shenzhen 300 Stock Index (HS300) is the use case for this research. Experimental results show that our model outperforms other prevalent methods and can achieve a good prediction performance.

...read moreread less

103 citations

Journal Article•10.1016/J.DATAK.2018.04.006•

Business-driven data analytics: A conceptual modeling framework

[...]

Soroosh Nalchigar¹, Eric Yu¹•Institutions (1)

University of Toronto¹

1 Sep 2018

TL;DR: A modeling framework for requirements analysis and design of data analytics systems that consists of three complementary modeling views: business view, analytics design view, and data preparation view and suggests that the framework provides an adequate set of concepts to support the design and implementation of analytics solutions.

...read moreread less

Abstract: The effective development of advanced data analytics solutions requires tackling challenges such as eliciting analytical requirements, designing the machine learning solution, and ensuring the alignment between analytics initiatives and business strategies, among others. The use of conceptual modeling methods and techniques is seen to be of considerable value in overcoming such challenges. This paper proposes a modeling framework (including a set of metamodels and a set of design catalogues) for requirements analysis and design of data analytics systems. It consists of three complementary modeling views: business view, analytics design view, and data preparation view. These views are linked together to connect enterprise strategies to analytics algorithms and to data preparation activities. The framework includes a set of design catalogues that codify and represent an organized body of business analytics design knowledge. As the first attempt to validate the framework, three real-world data analytics case studies are used to illustrate the expressiveness and usability of the framework. Findings suggest that the framework provides an adequate set of concepts to support the design and implementation of analytics solutions.

...read moreread less

61 citations

Journal Article•10.1016/J.DATAK.2018.04.001•

Sentiment analysis : an automatic contextual analysis and ensemble clustering approach and comparison

[...]

Murtadha Talib AL-Sharuee¹, Fei Liu¹, Mahardhika Pratama²•Institutions (2)

La Trobe University¹, Nanyang Technological University²

1 May 2018

TL;DR: This article describes a completely automatic and unsupervised approach to sentiment analysis which can overcome the domain-dependency and the labelling cost problems and shows that the proposed nonrandom initialization of k-means yields a significant improvement compared to other algorithms.

...read moreread less

Abstract: Product reviews are one of the most important resources to determine public sentiment. The existing literature on review sentiment analysis mostly utilizes supervised models, which usually suffer from domain-dependency and require expensive manual labelling effort to provide training data. This article addresses these issues by describing a completely automatic and unsupervised approach to sentiment analysis. The method consists of two phases, which are contextual analysis and unsupervised ensemble learning. In the implementation of both phases, a sentiment lexicon, SentiWordNet, is deployed. Using effective contextual procedures and modifying the base learning component (the k-means algorithm) results in developing a successful approach to sentiment analysis which can overcome the domain-dependency and the labelling cost problems. The results show that the proposed nonrandom initialization of k-means yields a significant improvement compared to other algorithms. In terms of accuracy and performance, the proposed method is effective compared to supervised and unsupervised approaches. We also introduce new sentiment analysis problems relating to Australian airlines and home builders which could be potential benchmark problems in the sentiment analysis field. Our experiments on datasets from different domains show that contextual analysis and the ensemble phases improve the clustering performance in term of accuracy, stability and generalizability.

...read moreread less

53 citations

Journal Article•10.1016/J.DATAK.2018.02.001•

Assessing the quality of domain ontologies: Metrics and an automated ranking system

[...]

Melinda McDaniel¹, Veda C. Storey², Vijayan Sugumaran³•Institutions (3)

Georgia Institute of Technology¹, Georgia State University², University of Rochester³

1 May 2018

TL;DR: Existing metrics for domain ontology evaluation are analyzed and extended to derive a Layered Ontology Metrics Suite based on semiotic theory, which is used to assess candidate domain ontologies' quality and suitability based upon a suite of metrics.

...read moreread less

Abstract: The ability of a user to select an appropriate, high-quality domain ontology from a set of available options would be most useful in knowledge engineering and other intelligent applications. This capability, however, requires good quality assessment metrics as well as automated support when there is a large number of ontologies from which to make a selection. This research analyzes existing metrics for domain ontology evaluation and extends them to derive a Layered Ontology Metrics Suite based on semiotic theory. The metrics are implemented in a Domain Ontology Ranking System (DoORS) prototype, the purpose of which is to search an ontology library for specific terms to retrieve candidate domain ontologies and then assess their quality and suitability based upon the suite of metrics. The prototype system is compared to existing approaches to automated ontology quality ranking to illustrate the usefulness of the research.

...read moreread less

39 citations

Journal Article•10.1016/J.DATAK.2018.02.002•

ON-SMMILE: Ontology Network-based Student Model for MultIple Learning Environments

[...]

Hector Yago¹, Julia Clemente¹, Daniel Rodriguez¹, Pedro Fernandez-de-Cordoba¹•Institutions (1)

University of Alcalá¹

1 May 2018

TL;DR: The aim of this work is to design and build methodologically, throughout ontological engineering, the ON-SMMILE model to be used as support of future works closely linked to supervision of student's learning as competence-based recommender system.

...read moreread less

Abstract: Currently, many educational researchers focus on the extraction of information about the learning progress to properly assist students. We present ON-SMMILE, a student-centered and flexible student model which is represented as an ontology network combining information related to (i) students and their knowledge state, (ii) assessments that rely on rubrics and different types of objectives, (iii) units of learning and (iv) information resources previously employed as support for the student model in intelligent virtual environment for training/instruction and here extended. The aim of this work is to design and build methodologically, throughout ontological engineering, the ON-SMMILE model to be used as support of future works closely linked to supervision of student's learning as competence-based recommender system. For this purpose, our model is designed as a set of ontological resources that have been extended, standardized, interrelated and adapted to be used in multiple learning environments. In this paper, we also analyze the available approaches based on instructional design which can be added to ontology network to build the proposed model. As a case study, a chemical experiment in a virtual environment and its instantiation are described in terms of ON-SMMILE.

...read moreread less

36 citations

Journal Article•10.1016/J.DATAK.2018.03.001•

A novel Multiple Attribute Decision Making approach based on interval data using U2P-Miner algorithm

[...]

Hêriş Golpîra¹•Institutions (1)

Islamic Azad University¹

1 May 2018

TL;DR: The effectiveness of the model is finally demonstrated through a numerical example while the broad comparative and sensitivity analysis further proves its validity and superiority.

...read moreread less

Abstract: This paper aims to introduce a technique for order of preference using pattern mining based on Decision Makers (DMs) level of risk aversion. However, the model is essentially defined on the problem of supplier selection, it can be used to deal with almost any similar decision making problem. This novel Multiple Attribute Decision Making (MADM) model takes the advantages of the U2P-Miner algorithm, the interval data weighting method, and the Linear Assignment Method (LAM). The key idea behind the method is to consider the attribute with more frequent patterns as the common attribute and to assign a smaller weight to it. Since, the model handles interval data as input, it can be guaranteed that the model uses the detailed information and, therefore, the resulting weight factors are more realistic. The DMs risk aversion level is also addressed in the model, which is necessary in real-life situations. Accordingly, the proposed decision making process depends directly on DMs attitude toward risk. It gives DM the opportunity to make a decision in two ways: 1) based on the specified risk aversion level, 2) based on an integrated approach using LAM. The linearity of the LAM, by itself, enhances the scalability of the model. Moreover, the necessity of providing pairwise comparison judgments is completely eliminated in the model and, therefore, the reliability of the decision making is enhanced. The effectiveness of the model is finally demonstrated through a numerical example while the broad comparative and sensitivity analysis further proves its validity and superiority.

...read moreread less

35 citations

Journal Article•10.1016/J.DATAK.2018.02.004•

Using big data and network analysis to understand Wikipedia article quality

[...]

Jun Liu¹, Sudha Ram²•Institutions (2)

Dakota State University¹, University of Arizona²

1 May 2018

TL;DR: It is found that internal bonding interacts positively with external bridging resulting in a multiplier effect on article quality, which has implications for developing automated techniques for quality assessment of Wikipedia and also provides insights into improving quality of these articles.

...read moreread less

Abstract: The research reported in this paper focuses on the question of why Wikipedia articles are different in quality. Since these articles are developed in an open and social environment, our work investigates if the social capital of contributors plays a role in determining the quality of the articles. We focus on three major types of social capital with respect to teams of contributors working on Wikipedia articles: internal bonding, external bridging and functional diversity. Through a social network analysis of these articles based on a dataset extracted from its edit history, our research finds that all three types of social capital have a significant impact on their quality. In addition, we found that internal bonding interacts positively with external bridging resulting in a multiplier effect on article quality. The findings of our research have implications for developing automated techniques for quality assessment of Wikipedia and also provide insights into improving quality of these articles.

...read moreread less

34 citations

Journal Article•10.1016/J.DATAK.2018.07.007•

Full-fledged semantic indexing and querying model designed for seamless integration in legacy RDBMS

[...]

Joe Tekli¹, Richard Chbeir², Agma J. M. Traina³, Caetano Traina³, Kokou Yetongnon, Carlos Arturo Raymundo Ibañez⁴, Marc Al Assad¹, Christian Kallas¹ - Show less +4 more•Institutions (4)

Lebanese American University¹, University of Pau and Pays de l'Adour², University of São Paulo³, Universidad Peruana de Ciencias Aplicadas⁴

1 Sep 2018

TL;DR: This paper addresses the problem of semantic-aware querying and provides a general framework for modeling and processing semantic-based keyword queries in textual databases, i.e., considering the lexical and semantic similarities/disparities when matching user query and data index terms.

...read moreread less

Abstract: In the past decade, there has been an increasing need for semantic-aware data search and indexing in textual (structured and NoSQL) databases, as full-text search systems became available to non-experts where users have no knowledge about the data being searched and often formulate query keywords which are different from those used by the authors in indexing relevant documents, thus producing noisy and sometimes irrelevant results. In this paper, we address the problem of semantic-aware querying and provide a general framework for modeling and processing semantic-based keyword queries in textual databases, i.e., considering the lexical and semantic similarities/disparities when matching user query and data index terms. To do so, we design and construct a semantic-aware inverted index structure called SemIndex, extending the standard inverted index by constructing a tightly coupled inverted index graph that combines two main resources: a semantic network and a standard inverted index on a collection of textual data. We then provide a general keyword query model with specially tailored query processing algorithms built on top of SemIndex, in order to produce semantic-aware results, allowing the user to choose the results' semantic coverage and expressiveness based on her needs. To investigate the practicality and effectiveness of SemIndex, we discuss its physical design within a standard commercial RDBMS allowing to create, store, and query its graph structure, thus enabling the system to easily scale up and handle large volumes of data. We have conducted a battery of experiments to test the performance of SemIndex, evaluating its construction time, storage size, query processing time, and result quality, in comparison with legacy inverted index. Results highlight both the effectiveness and scalability of our approach.

...read moreread less

32 citations

Journal Article•10.1016/J.DATAK.2017.12.003•

A spreading activation-based label propagation algorithm for overlapping community detection in dynamic social networks

[...]

Mohammad Taghi Sattari¹, Kamran Zamanifar¹•Institutions (1)

University of Isfahan¹

1 Jan 2018

TL;DR: Experimental results on both real and synthetic networks show that all variations of the proposed method detect communities more accurately compared to the benchmark methods while they are slower than these methods.

...read moreread less

Abstract: Community detection in temporal social networks is an increasingly challenging subject in network analysis. The Label Propagation Algorithm (LPA) is a simple and fast approach for community detection in dynamic networks. However, it tends to generate monster communities which decrease the accuracy of community detection, especially in dynamic social networks. In this paper, we propose a modified LPA, called Spreading Activation Label Propagation Algorithm in order to solve the problem. This method assigns a property, called activation value, to each label, where pairs (label name, activation value) are propagated by spreading activation process and the LPA. Furthermore, this algorithm uses two weighting algorithms, where each of them corresponds to one variation of the proposed method. Here, the variations of the proposed method and other available methods on real and synthetic networks are implemented. Experimental results on both real and synthetic networks show that all variations of the proposed method detect communities more accurately compared to the benchmark methods while they are slower than these methods.

...read moreread less

27 citations

Journal Article•10.1016/J.DATAK.2018.05.002•

Privacy-preserving collaborative fuzzy clustering

[...]

Lingjuan Lyu¹, James C. Bezdek¹, Yee Wei Law², Xuanli He¹, Marimuthu Palaniswami¹ - Show less +1 more•Institutions (2)

University of Melbourne¹, University of South Australia²

1 Jul 2018

TL;DR: Results using seven cluster validity indices, root mean squared error (RMSE) and accuracy ratio show that clustering results based on two-stage-perturbed data are comparable to the clusteringResults based on raw data confirm the utility of the privacy-preserving scheme when used with either FCM or HCM.

...read moreread less

Abstract: The proliferation of Internet of Things devices has contributed to the emergence of participatory sensing (PS), where multiple individuals collect and report their data to a third-party data mining cloud service for analysis. The need for the participants to collaborate with each other for this analysis gives rise to the concept of collaborative learning. However, the possibility of the cloud service being semi-honest poses a key challenge: preserving the participants' privacy. In this paper, we address this challenge with a two-stage scheme called RG+RP: in the first stage, each participant perturbs his/her data by passing the data through a nonlinear function called repeated Gompertz (RG); in the second stage, he/she then projects his/her perturbed data to a lower dimension in an (almost) distance-preserving manner, using a specific random projection (RP) matrix. The nonlinear RG function is designed to mitigate maximum a posteriori (MAP) estimation attacks, while random projection resists independent component analysis (ICA) attacks and ensures clustering accuracy. The proposed two-stage randomisation scheme is assessed in terms of its recovery resistance to MAP estimation attacks. Preliminary theoretical analysis as well as experimental results on synthetic and real-world datasets indicate that RG+RP has better recovery resistance to MAP estimation attacks than most state-of-the-art techniques. For clustering, fuzzy c-means (FCM) is used. Results using seven cluster validity indices, root mean squared error (RMSE) and accuracy ratio show that clustering results based on two-stage-perturbed data are comparable to the clustering results based on raw data — this confirms the utility of our privacy-preserving scheme when used with either FCM or HCM.

...read moreread less

27 citations

Journal Article•10.1016/J.DATAK.2018.02.003•

The landscape of smart aging: Topics, applications, and agenda

[...]

Il-Yeol Song¹, Min Song², Tatsawan Timakum², Su Ryeon Ryu², Hanju Lee² - Show less +1 more•Institutions (2)

Drexel University¹, Yonsei University²

1 May 2018

TL;DR: The results of the comprehensive literature review indicate that the discussions on smart aging in the scientific publications are by and large classified into the following three directions: Technologies, Aging Medical Care, and Behavior and Social.

...read moreread less

Abstract: Smart aging is an emerging research topic that has a profound impact on society and well-being of aging population. To the best of our knowledge, there has been no systematic analysis of grasping what research has been conducted on smart aging. Thus, there is no discussion of major issues and future directions of smart aging. In this paper, we provide an overview of smart aging in three ways: 1) to synthesize the components of smart aging based on the comprehensive literature review, 2) to examine the range of topics extracted from 3760 web pages and 3) to analyze the research activities on smart aging by conducting a content analysis of 4500 web pages of the NIH funded organizations' websites related to smart aging. The results of the comprehensive literature review indicate that the discussions on smart aging in the scientific publications are by and large classified into the following three directions: Technologies, Aging Medical Care, and Behavior and Social. In addition, the major topics from search engine datasets, which echoes more general discussions from various different parties, are related to entertainment program and social media, along with medical science and innovation technologies, whereas the research activities of NIH funded organizations focused on cross-disciplinary research in Behavioral and Social science, and Medical Care.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.004•

Uncertain data classification with additive kernel support vector machine

[...]

Zongxia Xie¹, Zongxia Xie², Zongxia Xie³, Yong Xu², Qinghua Hu³ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, Harbin Institute of Technology², Tianjin University³

1 Sep 2018

TL;DR: This work introduces an efficient algorithm to compute the kernel functions, and solve the additive kernel SVMs, and shows the efficiency of additive-kernel SVMs in uncertain data classification.

...read moreread less

Abstract: In this work, a classification learning algorithm is designed within the framework of support vector machines through modeling uncertain data with additive kernels, which are introduced to calculate the similarity between uncertain samples characterized by probability density functions (PDFs) The PDFs are used as features of the uncertain samples, where the value of a feature is not a single value, but a set of values that represent the probability distribution of the noise This is different with the existing methods which represent an uncertain sample by a set of new samples around it, but use the farthest or nearest value in the distribution to construct the optimal hyperplane With the properties of kernel functions, we can easily extend additive kernels to compute the similarity between samples described with multiple uncertain features Furthermore, we introduce an efficient algorithm to compute the kernel functions, and solve the additive kernel SVMs The experimental results show the efficiency of additive-kernel SVMs in uncertain data classification

...read moreread less

Journal Article•10.1016/J.DATAK.2018.04.008•

A probabilistic evaluation procedure for process model matching techniques

[...]

Elena Kuss¹, Henrik Leopold², Han van der Aa², Heiner Stuckenschmidt¹, Hajo A. Reijers² - Show less +1 more•Institutions (2)

University of Mannheim¹, VU University Amsterdam²

1 Sep 2018

TL;DR: A novel evaluation procedure is proposed that builds on the assessments of multiple annotators to define the notion of a non-binary gold standard and allows for more detailed insights into the performance of matching systems than a traditional evaluation based on a binary gold standard.

...read moreread less

Abstract: Process model matching refers to the automatic identification of corresponding activities between two process models. It represents the basis for many advanced process model analysis techniques such as the identification of similar process parts or process model search. A central problem is how to evaluate the performance of process model matching techniques. Current evaluation methods require a binary gold standard that clearly defines which correspondences are correct. The problem is that often not even humans can agree on a set of correct correspondences. Hence, evaluating the performance of matching techniques based on a binary gold standard does not take the true complexity of the matching problem into account and does not fairly assess the capabilities of a matching technique. In this paper, we propose a novel evaluation procedure for process model matching techniques. In particular, we build on the assessments of multiple annotators to define the notion of a non-binary gold standard. In this way, we avoid the problem of agreeing on a single set of correct correspondences. Based on this non-binary gold standard, we introduce probabilistic versions of precision, recall, and F-measure as well as a distance-based performance measure. We use a dataset from the Process Model Matching Contest 2015 and a total of 16 matching systems to assess and compare the insights that can be obtained by using our evaluation procedure. We find that our probabilistic evaluation procedure allows us to gain more detailed insights into the performance of matching systems than a traditional evaluation based on a binary gold standard.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.05.003•

Hierarchical partitioning of the output space in multi-label data

[...]

Yannis Papanikolaou¹, Grigorios Tsoumakas¹, Ioannis Katakis²•Institutions (2)

Aristotle University of Thessaloniki¹, University of Nicosia²

1 Jul 2018

TL;DR: Hierarchy of Multi-label classifiers (HOMER) as discussed by the authors is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and then employing a given base MLC to the resulting sub-problems.

...read moreread less

Abstract: Hierarchy Of Multi-label classifiERs (HOMER) is a multi-label learning algorithm that breaks the initial learning task to several, easier sub-tasks by first constructing a hierarchy of labels from a given label set and secondly employing a given base multi-label classifier (MLC) to the resulting sub-problems. The primary goal is to effectively address class imbalance and scalability issues that often arise in real-world multi-label classification problems. In this work, we present the general setup for a HOMER model and a simple extension of the algorithm that is suited for MLCs that output rankings. Furthermore, we provide a detailed analysis of the properties of the algorithm, both from an aspect of effectiveness and computational complexity. A secondary contribution involves the presentation of a balanced variant of the k means algorithm, which serves in the first step of the label hierarchy construction. We conduct extensive experiments on six real-world data sets, studying empirically HOMER's parameters and providing examples of instantiations of the algorithm with different clustering approaches and MLCs, The empirical results demonstrate a significant improvement over the given base MLC.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.09.002•

Experimental identification of hard data sets for classification and feature selection methods with insights on method selection

[...]

Cuiju Luan¹, Guozhu Dong²•Institutions (2)

Shanghai Maritime University¹, Wright State University²

1 Nov 2018

TL;DR: In this paper, the authors report an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods and rank methods separately for hard data sets and for easy data sets.

...read moreread less

Abstract: The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8; it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on the easy data sets. This paper is the first to rank methods separately for hard data sets and for easy data sets. It turns out that the classifier rankings resulting from our experiments are somehow different from those in the literature and hence they offer new insights on method selection. It should be noted that the Random Forest method remains to be the best in all groups of experiments.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.005•

Improved suffix blocking for record linkage and entity resolution

[...]

Amin Allam¹, Spiros Skiadopoulos², Panos Kalnis¹•Institutions (2)

King Abdullah University of Science and Technology¹, University of Peloponnese²

1 Sep 2018

TL;DR: The non-incremental variation of record linkage is considered and a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method is presented.

...read moreread less

Abstract: Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.09.003•

INSiGHT: A system to detect violent extremist radicalization trajectories in dynamic graphs

[...]

Benjamin W. K. Hung¹, Anura P. Jayasumana¹, Vidarshana W. Bandara²•Institutions (2)

Colorado State University¹, CA Technologies²

1 Nov 2018

TL;DR: The overall INSiGHT architecture is presented and is aimed at assisting law enforcement and intelligence agencies in monitoring and screening for those individuals whose behaviors indicate a significant risk for violence, and allow for the better prioritization of limited investigative resources.

...read moreread less

Abstract: The number and lethality of violent extremist plots motivated by the Salafi-jihadist ideology have been growing for nearly the last decade in many parts of the world including both the U.S and Western Europe. While detecting the radicalization of violent extremists is a key component in preventing future terrorist attacks, it remains a significant challenge to law enforcement due to the issues of both scale and dynamics. We propose the development of a radicalization trend detection system as a risk assessment assistance technology that relies on data mined from public data and government databases for individuals who exhibit risk indicators for extremist violence, and enables law enforcement to monitor those individuals at the scope and scale that is lawful, and accounts for the dynamic indicative behaviors of the individuals and their associates rigorously and automatically. We frame our approach to monitoring the radicalization pattern of behaviors as a unique dynamic graph pattern matching problem, and develop a technology called INSiGHT ( In vestigative S earch for G rap h - T rajectories) to help identify individuals or small groups with conforming subgraphs to a radicalization query pattern, and follow the match trajectories over time. This paper presents the overall INSiGHT architecture and is aimed at assisting law enforcement and intelligence agencies in monitoring and screening for those individuals whose behaviors indicate a significant risk for violence, and allow for the better prioritization of limited investigative resources. We demonstrated the performance of INSiGHT on a variety of datasets, to include small synthetic radicalization-specific datasets and a real behavioral dataset of time-stamped radicalization indicators of recent U.S. violent extremists.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.06.001•

Automatic query reformulations for feature location in a model-based family of software products

[...]

Francisca Pérez, Jaime Font¹, Lorena Arcega¹, Carlos Cetina•Institutions (1)

University of Oslo¹

1 Jul 2018

TL;DR: The results show that reformulated queries do not improve the performance in models, which could lead towards a new direction in the creation or reconsideration of these techniques to be applied in models.

...read moreread less

Abstract: No maintenance activity can be completed without Feature Location (FL), which is finding the set of software artifacts that realize a particular functionally. Despite the importance of FL, the vast majority of work has been focused on retrieving code, whereas other software artifacts such as the models have been neglected. Furthermore, locating a piece of information from a query in a large repository is a challenging task as it requires knowledge of the vocabulary used in the software artifacts. This can be alleviated by automatically reformulating the query (adding or removing terms). In this paper, we test four existing query reformulation techniques, which perform the best for FL in code but have never been used for FL in models. Specifically, we test these techniques in two industrial domains: a model-based family of firmwares for induction hobs, and a model-based family of PLC software to control trains. We compare the results provided by our FL approach using the query and the reformulated queries by means of statistical analysis. Our results show that reformulated queries do not improve the performance in models, which could lead towards a new direction in the creation or reconsideration of these techniques to be applied in models.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.06.003•

Assessing data analysis performance in research contexts: An experiment on accuracy, efficiency, productivity and researchers' satisfaction

[...]

Patricia Martin-Rodilla¹, Jose Ignacio Panach², Cesar Gonzalez-Perez¹, Oscar Pastor³•Institutions (3)

Spanish National Research Council¹, University of Valencia², Polytechnic University of Valencia³

1 Jul 2018

TL;DR: Some clear benefits of the cognitive inclusion in the software designed for research contexts data analysis are found, with statistically significant differences in terms of accuracy, productivity and researcher's satisfaction in support of this explicit inclusion, although some efficiency weaknesses are detected.

...read moreread less

Abstract: This paper has the support of Generalitat Valenciana through project IDEO (PROMETEOII/2014/039) and Spanish Ministry of Science and Innovation through project DataME (ref: TIN2016-80811-P).

...read moreread less

Journal Article•10.1016/J.DATAK.2018.01.003•

A branch and bound strategy for Fast Trajectory Similarity Measuring

[...]

Andre Salvaro Furtado¹, Andre Salvaro Furtado², Laércio Lima Pilla¹, Vania Bogorny¹•Institutions (2)

Universidade Federal de Santa Catarina¹, Santa Catarina Federal Institute of Education, Science and Technology²

1 May 2018

TL;DR: This article presents a new strategy which takes into account the distance properties in Euclidean spaces to reduce the number of pair-wise point comparison required to determine all the matching points of two trajectories.

...read moreread less

Abstract: The increasing use of GPS-enabled devices allowed the collection of huge volumes of movement data in the form of trajectories. An important research problem in trajectory data analysis is the similarity measurement. For most applications, a trajectory-to-trajectory comparison is needed, and therefore, scalability of trajectory similarity measures directly impact the viability to use these techniques. Most similarity measures adopt a dynamic programming implementation, which has a quadratic time complexity in all cases, computing the pair-wise distance for all trajectory points, thus limiting the scalability of these measures. In this article we present a new strategy which takes into account the distance properties in Euclidean spaces to reduce the number of pair-wise point comparison required to determine all the matching points of two trajectories. An extensive experimental evaluation over real GPS trajectory datasets demonstrates the pruning power over 85% in the number of distance computations required to determine the matchings, and a significant execution time speed-up of up to one order of magnitude over the dynamic programming approach.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.003•

ICGT: A novel incremental clustering approach based on GMM tree

[...]

Yuchai Wan¹, Yuchai Wan², Xiabi Liu¹, Yi Wu¹, Lunhao Guo¹, Qiming Chen¹, Wang Murong¹ - Show less +3 more•Institutions (2)

Beijing Institute of Technology¹, Beijing Technology and Business University²

1 Sep 2018

TL;DR: This paper proposes a novel incremental clustering approach utilizing Gaussian Mixture Model (GMM), termed as ICGT (Incremental Construction of GMM Tree), which creates and dynamically adjusts a GMM tree consistent to the sequentially presented data.

...read moreread less

Abstract: Streaming data presents new challenges to data mining algorithms. To conduct data clustering on the streaming data, this paper proposes a novel incremental clustering approach utilizing Gaussian Mixture Model (GMM), termed as ICGT (Incremental Construction of GMM Tree). The ICGT creates and dynamically adjusts a GMM tree consistent to the sequentially presented data. Each leaf node in the tree corresponds to a dense Gaussian distribution and each non-leaf node to a GMM. To update the GMM tree for insertion of the newly arrived data points, we introduce the definitions of node connectivity and connected subsets, and present the tree update algorithm. We further develop a clustering evaluation criterion and search strategy to determine the final partition of the data set based on the constructed GMM tree. We evaluated the proposed approach on synthetic and real-world data sets and compared ICGT with other incremental and static clustering methods. The experimental results confirm that our approach is effective and promising.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.01.002•

Sampling strategies for extracting information from large data sets

[...]

Alexandru Boicea¹, Ciprian-Octavian Truica¹, Florin Radulescu¹, Elena-Cristina Buşe¹•Institutions (1)

Politehnica University of Bucharest¹

1 May 2018

TL;DR: Comparisons between sampling algorithms are presented in order to determine which one performs better when taking into account set operations such as intersect, union and difference, and on execution times.

...read moreread less

Abstract: Getting information from large volumes of data is very expensive in terms of resources like CPU and memory, as well as computation time. The analysis of a small data set extracted from the original set is preferred. From this small set, called sample, approximate results can be obtained. The errors are acceptable given the reduced cost necessary for processing the data. Using sampling algorithms with small errors saves execution time and resources. This paper presents comparisons between sampling algorithms in order to determine which one performs better when taking into account set operations such as intersect, union and difference. The comparison focuses on the errors introduced by each algorithm for different sample sizes and on execution times.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.05.001•

Less is more: A rule-based syntactic simplification module for improved text-to-pictograph translation

[...]

Leen Sevens¹, Vincent Vandeghinste¹, Ineke Schuurman¹, Frank Van Eynde¹•Institutions (1)

Katholieke Universiteit Leuven¹

1 Sep 2018

TL;DR: This work developed a rule-based simplification system for Dutch Text-to-Pictograph translation by using recursion and applying the simplification operations in a logical way, so that only one syntactic parse is needed per message.

...read moreread less

Abstract: In order to enable or facilitate online communication for people with an intellectual disability, the Text-to-Pictograph translation system automatically translates Dutch written text into a series of Sclera or Beta pictographs. The baseline system presents the reader with a more or less verbatim pictograph-per-word translation. As a result, long and complex input sentences lead to long and complex pictograph translations, leaving the end users confused and distracted. To overcome these problems, we developed a rule-based simplification system for Dutch Text-to-Pictograph translation. By using recursion and applying the simplification operations in a logical way, only one syntactic parse is needed per message. Promising results are obtained.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.006•

Knowledge-rich image gist understanding beyond literal meaning

[...]

Lydia Weiland¹, Ioana Hulpus¹, Simone Paolo Ponzetto¹, Wolfgang Effelsberg¹, Laura Dietz² - Show less +1 more•Institutions (2)

University of Mannheim¹, University of New Hampshire²

1 Sep 2018

TL;DR: In this paper, the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles is investigated, and a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge is proposed.

...read moreread less

Abstract: We investigate the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles. To this end, we propose a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge that have previously been shown to be highly effective for text understanding. Our method identifies the connotation of objects beyond their denotation: where most approaches to image understanding focus on the denotation of objects, i.e., their literal meaning, our work addresses the identification of connotations, i.e., iconic meanings of objects, to understand the message of images. We view image understanding as the task of representing an image-caption pair on the basis of a wide-coverage vocabulary of concepts such as the one provided by Wikipedia, and cast gist detection as a concept-ranking problem with image-caption pairs as queries. Our proposed algorithm brings together aspects of entity linking and clustering, subgraph selection, semantic relatedness, and learning-to-rank in a novel way. In addition to this novel task and a complete evaluation of our approach, we introduce a novel dataset to foster further research on this problem. To enable a throughout investigation of the problem of gist understanding, we produce a gold standard of over 300 image-caption pairs and over 8000 gist annotations covering a wide variety of topics at different levels of abstraction. We use this dataset to experimentally benchmark the contribution of different kinds of signals from heterogeneous sources, namely image and text. The best result with a Mean Average Precision (MAP) of 0.69 indicate that by combining both dimensions we are able to better understand the meaning of our image-caption pairs than when using language or vision information alone. Our supervised approach relies on the availability of human-annotated gold standard datasets. Annotating images with, possibly complex, topic labels is arguably a very time-consuming task that must rely on expert human annotators. We accordingly investigate whether parts of this process could be automatized using automatic image annotation and caption generation techniques. Our results indicate the general feasibility of an end-to-end approach to gist detection when replacing one of the two dimensions with automatically generated input, i.e., using automatically generated image tags or generated captions. However, we also show experimentally that state-of-the-art image and text understanding is better at understanding literal meanings of image-caption pairs, with non-literal pairs being instead generally more difficult to detect, thus paving the way for future work on understanding the message of images beyond their literal content.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.10.002•

An ontology-based approach to knowledge representation for Computer-Aided Control System Design

[...]

Carmen Benavides¹, Isaías García¹, Héctor Alaiz¹, Luis Quesada²•Institutions (2)

University of León¹, University College Cork²

1 Nov 2018

TL;DR: A study of the use of knowledge models represented in ontologies for building Computer Aided Control Systems Design (CACSD) tools with the root locus method, presenting the results and benefits found.

...read moreread less

Abstract: Different approaches have been used in order to represent and build control engineering concepts for the computer. Software applications for these fields are becoming more and more demanding each day, and new representation schemas are continuously being developed. This paper describes a study of the use of knowledge models represented in ontologies for building Computer Aided Control Systems Design (CACSD) tools. The use of this approach allows the construction of formal conceptual structures that can be stated independently of any software application and be used in many different ones. In order to show the advantages of this approach, an ontology and an application have been built for the domain of design of lead/lag controllers with the root locus method, presenting the results and benefits found.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.05.007•

A Comprehensive Study: Sentence Compression with Linguistic Knowledge-enhanced Gated Neural Network

[...]

Yang Zhao¹, Xiaoyu Shen², Hajime Senuma¹, Akiko Aizawa³•Institutions (3)

University of Tokyo¹, Max Planck Society², National Institute of Informatics³

1 Sep 2018

TL;DR: A gating mechanism is introduced and a gated neural network that selectively exploits linguistic knowledge for deletion-based sentence compression is proposed that leads to better compression upon both automatic metrics and human evaluation, compared to previous competitive compression methods.

...read moreread less

Abstract: Sentence compression aims to shorten a sentence into a compression while remaining grammatical and preserving the underlying meaning of the original sentence. Previous works have recognized that linguistic features such as parts-of-speech tags and dependency labels are helpful to compression generation. In this work, we introduce a gating mechanism and propose a gated neural network that selectively exploits linguistic knowledge for deletion-based sentence compression. An extensive experiment was conducted on four downstream datasets, showing that the proposed gated neural network method leads to better compression upon both automatic metrics and human evaluation, compared to previous competitive compression methods. We also observed that the generated compression by the proposed gated neural network share more grammatical relations in common with the ground-truth compression than the baseline method, indicating that important grammatical relations, such as subject or object of a sentence, are more likely to be kept in the compression by the proposed method. Furthermore, visualization analysis is conducted to explore the selective use of linguistic features, suggesting that the gate mechanism could condition the predicted compression on different linguistic features.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.04.005•

Cardinality constraints and functional dependencies over possibilistic data

[...]

Tania Roblot¹, Sebastian Link¹•Institutions (1)

University of Auckland¹

1 Sep 2018

TL;DR: This framework empowers users to model uncertainty in an intuitive way, without the requirement to put a precise value on it, and shows how to visualize any given set of cardinality constraints and functional dependencies in the form of an Armstrong sketch.

...read moreread less

Abstract: Modern applications require advanced techniques and tools to process large volumes of uncertain data. For that purpose we study cardinality constraints and functional dependencies as a declarative mechanism to control the occurrences and interrelationships of uncertain data. Uncertainty is modeled qualitatively by assigning to each object a degree of possibility by which the object occurs in an uncertain instance. Cardinality constraints and functional dependencies are assigned a degree of certainty that stipulates on which objects they hold. Our framework empowers users to model uncertainty in an intuitive way, without the requirement to put a precise value on it. Our class of cardinality constraints and functional dependencies enjoys a natural possible world semantics, which is exploited to establish several tools to reason about them. We characterize the associated implication problem axiomatically and algorithmically in linear input time. Furthermore, we show how to visualize any given set of our cardinality constraints and functional dependencies in the form of an Armstrong sketch. Even though the problem of finding an Armstrong sketch is precisely exponential, our algorithm computes a sketch with conservative use of time and space. Data engineers may therefore compute Armstrong sketches that they can jointly inspect with domain experts in order to consolidate the set of cardinality constraints and functional dependencies meaningful for a given application domain.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.002•

BEstream: Batch Capturing with Elliptic Function for One-Pass Data Stream Clustering

[...]

Niwan Wattanakitrungroj¹, Saranya Maneeroj¹, Chidchanok Lursinsap¹•Institutions (1)

Chulalongkorn University¹

1 Sep 2018

TL;DR: A newly proposed set of algorithms capture the data in forms streaming batch and identify the cluster afterwards based on the structure of adaptive hyper-elliptic micro-cluster components, which is more suitable for capturing data than the other structures in the compared previous methods.

...read moreread less

Abstract: Tremendous data have been generated in forms of streaming data and various distributions in most applications in different areas such as business, science, engineering, and medicine. This creates a new problem of space and time complexities where the incoming data can overflow the memory of an analysing machine and the flow of data may contain some scattered portions of data from different clusters. This situation leads to the incorrect clustering results. The challenge of the clustering on streaming data is clustering the data which continuously growing, unstable, and non-existent from time to time. This paper proposed the concept of discard-after-cluster based on the structure of adaptive hyper-elliptic micro-cluster components. Instead of gradually including each datum into its true cluster, a newly proposed set of algorithms capture the data in forms streaming batch and identify the cluster afterwards. The number of micro-clusters can be increased or decreased according to the dynamical distribution of incoming data as well as the overlap conditions of micro-clusters. A set of new recursive functions for updating parameters, checking overlap conditions, removing micro-clusters, and merging micro-clusters after discarding previously clustered data were introduced. The proposed algorithm was tested on synthetic and real data sets. The elliptic-micro-cluster structure is more suitable for capturing data than the other structures in the compared previous methods. In addition, our method named BEstream showed the more efficient results than the previous data stream clustering algorithms based on the rand index and normalized mutual information measures.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.07.010•

Accurate and efficient profile matching in knowledge bases

[...]

Jorge Martinez-Gil, Alejandra Lorena Paoletti, Gábor Rácz¹, Attila Sali¹, Klaus-Dieter Schewe - Show less +1 more•Institutions (1)

Alfréd Rényi Institute of Mathematics¹

1 Sep 2018

TL;DR: In this article, a matching theory that uses filters in lattices to represent profiles, and matching values in the interval [0, 1] is presented, where the higher the matching value the better is the fit.

...read moreread less

Abstract: A profile describes a set of properties, e.g. a set of skills a person may have, a set of skills required for a particular job, or a set of abilities a football player may have with respect to a particular team strategy. Profile matching aims to determine how well a given profile fits to a requested profile and vice versa. The approach taken in this article is grounded in a matching theory that uses filters in lattices to represent profiles, and matching values in the interval [0,1]: the higher the matching value the better is the fit. Such lattices can be derived from knowledge bases to represent the knowledge about profiles. An interesting question is, how human expertise concerning the matching can be exploited to obtain most accurate matchings. It will be shown that if a set of filters together with matching values by some human expert is given, then under some mild plausibility assumptions a matching measure can be determined such that the computed matching values preserve the relevant rankings given by the expert. A second question concerns the efficient querying of databases of profile instances. For matching queries that result in a ranked list of profile instances matching a given one it will be shown how corresponding top-k queries can be evaluated on grounds of pre-computed matching values. In addition, it will be shown how the matching queries can be exploited for gap queries that determine how profile instances need to be extended in order to improve in the rankings.

...read moreread less

Journal Article•10.1016/J.DATAK.2018.05.010•

How to Repair Inconsistency in OWL 2 DL Ontology Versions

[...]

Leila Bayoudhi¹, Najla Sassi¹, Wassim Jaziri¹•Institutions (1)

Taibah University¹

1 Jul 2018

TL;DR: An a priori inconsistency approach was proposed to generate consistent OWL 2 DL ontology versions and relies on the OWL 1 DL change kits, which anticipate inconsistencies upon each change request on an ontology version.

...read moreread less

Abstract: Semantic modeling knowledge formalisms, such as ontologies, have to follow the continuous evolution and changes of knowledge. However, ontology changes should never affect its consistency. Ontology needs to remain in a consistent state along its whole engineering process. In the literature, most of approaches check/repair ontology inconsistencies in an a posteriori way. In this paper, an a priori inconsistency approach was proposed to generate consistent OWL 2 DL ontology versions. It relies on the OWL 2 DL change kits, which anticipate inconsistencies upon each change request on an ontology version. The proposed approach predicts potential inconsistencies, provides an a priori repair action and applies the required changes. Consistency rules were defined and used to check logical inconsistencies, but also syntactical invalidities and style issues. A protege plugin was implemented to validate our approach.

...read moreread less