TL;DR: This paper comparatively analyze 11 proposed frameworks for entity matching and considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task.
Abstract: Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.
TL;DR: BeAware!, a framework for ontology-driven information systems aiming at increasing an operator's situation awareness introduces the concept of spatio-temporal primitive relations between observed real-world objects thereby improving the reusability of the framework.
Abstract: Information overload is a severe problem for human operators of large-scale control systems as, for example, encountered in the domain of road traffic management. Operators of such systems are at risk to lack situation awareness, because existing systems focus on the mere presentation of the available information on graphical user interfaces-thus endangering the timely and correct identification, resolution, and prevention of critical situations. In recent years, ontology-based approaches to situation awareness featuring a semantically richer knowledge model have emerged. However, current approaches are either highly domain-specific or have, in case they are domain-independent, shortcomings regarding their reusability. In this paper, we present our experience gained from the development of BeAware!, a framework for ontology-driven information systems aiming at increasing an operator's situation awareness. In contrast to existing domain-independent approaches, BeAware!'s ontology introduces the concept of spatio-temporal primitive relations between observed real-world objects thereby improving the reusability of the framework. To show its applicability, a prototype of BeAware! has been implemented in the domain of road traffic management. An overview of this prototype and lessons learned for the development of ontology-driven information systems complete our contribution.
TL;DR: In this paper, after the introduction of the collaboration process, different ways to integrate background knowledge into it are presented and how such integration in the collaborative process is beneficial is discussed.
Abstract: The aim of collaborative clustering is to make different clustering methods collaborate, in order to reach at an agreement on the partitioning of a common dataset. As different clustering methods can produce different partitioning of the same dataset, finding a consensual clustering from these results is often a hard task. The collaboration aims to make the methods agree on the partitioning through a refinement of their results. This process tends to make the results more similar. In this paper, after the introduction of the collaboration process, we present different ways to integrate background knowledge into it. Indeed, in recent years, the integration of background knowledge in clustering algorithms has been the subject of a lot of interest. This integration often leads to an improvement of the quality of the results. We discuss how such integration in the collaborative process is beneficial and we present experiments in which background knowledge is used to guide collaboration.
TL;DR: The most relevant step in the framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements, and is a fully automatic approach that handles and analyzes the end- user requirements automatically.
Abstract: It is widely accepted that the conceptual schema of a data warehouse must be structured according to the multidimensional model. Moreover, it has been suggested that the ideal scenario for deriving the multidimensional conceptual schema of the data warehouse would consist of a hybrid approach (i.e., a combination of data-driven and requirement-driven paradigms). Thus, the resulting multidimensional schema would satisfy the end-user requirements and would be conciliated with the data sources. Most current methods follow either a data-driven or requirement-driven paradigm and only a few use a hybrid approach. Furthermore, hybrid methods are unbalanced and do not benefit from all of the advantages brought by each paradigm. In this paper we present our approach for multidimensional design. The most relevant step in our framework is Multidimensional Design by Examples (MDBE), which is a novel method for deriving multidimensional conceptual schemas from relational sources according to end-user requirements. MDBE introduces several advantages over previous approaches, which can be summarized as three main contributions. (i) The MDBE method is a fully automatic approach that handles and analyzes the end-user requirements automatically. (ii) Unlike data-driven methods, we focus on data of interest to the end-user. However, the user may not be aware of all the potential analyses of the data sources and, in contrast to requirement-driven approaches, MDBE can propose new multidimensional knowledge related to concepts already queried by the user. (iii) Finally, MDBE proposes meaningful multidimensional schemas derived from a validation process. Therefore, the proposed schemas are sound and meaningful.
TL;DR: This paper uses the Web as a massive learning corpus to retrieve data and to infer information distribution using highly contextualized queries aimed at improving the quality of the result.
Abstract: Class descriptors such as attributes, features or meronyms are rarely considered when developing ontologies. Even WordNet only includes a reduced amount of part-of relationships. However, these data are crucial for defining concepts such as those considered in classical knowledge representation models. Some attempts have been made to extract those relations from text using general meronymy detection patterns; however, there has been very little work on learning expressive class attributes (including associated domain, range or data values) at an ontological level. In this paper we take this background into consideration when proposing and implementing an automatic, non-supervised and domain-independent methodology to extend ontological classes in terms of learning concept attributes, data-types, value ranges and measurement units. In order to present a general solution and minimize the data sparseness of pattern-based approaches, we use the Web as a massive learning corpus to retrieve data and to infer information distribution using highly contextualized queries aimed at improving the quality of the result. This corpus is also automatically updated in an adaptive manner according to the knowledge already acquired and the learning throughput. Results have been manually checked by means of an expert-based concept-per-concept evaluation for several well distinguished domains showing reliable results and a reasonable learning performance.
TL;DR: With the implementation of this architecture, the practicality of automating ontology generation through ontology reuse is shown, and the results were encouraging, resulting in five lessons pertinent to future automated ontological reuse study.
Abstract: Realizing the Semantic Web involves creating ontologies, a tedious and costly challenge. Reuse can reduce the cost of ontology engineering. Semantic Web ontologies can provide useful input for ontology reuse. However, the automated reuse of such ontologies remains underexplored. This paper presents a generic architecture for automated ontology reuse. With our implementation of this architecture, we show the practicality of automating ontology generation through ontology reuse. We experimented with a large generic ontology as a basis for automatically generating domain ontologies that fit the scope of sample natural language web pages. The results were encouraging, resulting in five lessons pertinent to future automated ontology reuse study.
TL;DR: This work proposes an effective Fuzzy-based Multi-label Document Clustering (FMDC) approach that integrates fuzzy association rule mining with an existing ontology WordNet to alleviate problems of document clustering.
Abstract: With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, overlapping clusters, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy-based Multi-label Document Clustering (FMDC) approach that integrates fuzzy association rule mining with an existing ontology WordNet to alleviate these problems. In our approach, the key terms will be extracted from the document set, and the initial representation of all documents is further enriched by using hypernyms of WordNet in order to exploit the semantic relations between terms. Then, a fuzzy association rule mining algorithm for texts is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, each document is dispatched into more than one target cluster by referring to these candidate clusters, and then the highly similar target clusters are merged. We conducted experiments to evaluate the performance based on Classic, Re0, R8, and WebKB datasets. The experimental results proved that our approach outperforms the influential document clustering methods with higher accuracy. Therefore, our approach not only provides more general and meaningful labels for documents, but also effectively generates overlapping clusters.
TL;DR: The compression strategy proposed in ECM-DS puts the basis for a novel class of intelligent applications over data streams where the knowledge on actual streams is integrated-with and correlated-to the knowledge related to expired events that are considered critical for the target OLAP analysis scenario.
Abstract: An innovative event-based lossy compression model for effective and efficient OLAP over data streams, called ECM-DS, is presented and experimentally assessed in this paper. The main novelty of our compression approach with respect to traditional data stream compression techniques relies on exploiting the semantics of the reference application scenario in order to drive the compression process by means of the ''degree of interestingness'' of events occurring in the target stream. This finally improves the quality of retrieved approximate answers to OLAP queries over data streams, and, in turn, the quality of complex knowledge discovery tasks over data streams developed on top of ECM-DS, and implemented via ad-hoc data stream mining algorithms. Overall, the compression strategy we propose in this research puts the basis for a novel class of intelligent applications over data streams where the knowledge on actual streams is integrated-with and correlated-to the knowledge related to expired events that are considered critical for the target OLAP analysis scenario. Finally, a comprehensive experimental evaluation over several classes of data stream sets clearly confirms the benefits deriving from the event-based data stream compression approach proposed in ECM-DS.
TL;DR: This paper elaborate on the design of a relational RDF store, called RDFProv, which is optimized for scientific workflow provenance querying and management, and proposes three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema.
Abstract: Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. Our approach to provenance management seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. In this paper, we elaborate on the design of a relational RDF store, called RDFProv, which is optimized for scientific workflow provenance querying and management. Specifically, we propose: i) two schema mapping algorithms to map an OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with two popular relational RDF stores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM, showed that our optimizations result in improved performance and scalability for provenance metadata management. Finally, our case study for provenance management in a real-life biological simulation workflow showed the production quality and capability of the RDFProv system. Although presented in the context of scientific workflow provenance management, many of our proposed techniques apply to general RDF data management as well.
TL;DR: This paper presents an ontology mapping software framework that has been designed and implemented to help users in designing and/or exploiting comprehensive mapping systems, based on a library of mapping modules implementing functions such as discovering mappings or evaluating mapping strategies.
Abstract: Ontology mapping, or matching, aims at identifying correspondences among entities in different ontologies. Several strands of research come up with algorithms often combining multiple mapping strategies to improve the mapping accuracy. However, few approaches have systematically investigated the requirements of a mapping system both from the functional (i.e., the features that are required) and user point of view (i.e., how the user can exploit these features). This paper presents an ontology mapping software framework that has been designed and implemented to help users (both expert and non-expert) in designing and/or exploiting comprehensive mapping systems. It is based on a library of mapping modules implementing functions such as discovering mappings or evaluating mapping strategies. In particular, the strategy predictor module of the designed framework, for each specific mapping task, can ''predict'' mapping modules to be exploited and parameter values (e.g., weights and thresholds). The implemented system, called UFOme, assists users during the various phases of a mapping task execution by providing a user friendly ontology mapping environment. The UFOme implementation and its prediction capabilities and accuracy were evaluated on the Ontology Alignment Evaluation Initiative tests with encouraging results.
TL;DR: This article presents C-Phrase, a natural language interface system that can be configured by normal, non-specialized, web-based technical teams and introduces the evaluation metric of willingness that complements the standard metrics of precision and recall.
Abstract: This article presents C-Phrase, a natural language interface system that can be configured by normal, non-specialized, web-based technical teams. C-Phrase models queries in an extended version of Codd's tuple calculus and uses synchronous context-free grammars with lambda-expressions to represent semantic grammars. Given an arbitrary relational database, authors rapidly build an NLI using what we term the name-tailor-define protocol. We present a small study demonstrating the effectiveness of this approach for the GEO corpus and we introduce the evaluation metric of willingness that complements the standard metrics of precision and recall. However our true evaluation comes as we open-source C-Phrase.
TL;DR: SKYRANK is a framework for ranking the skyline points in the absence of a user-defined preference function, thereby discovering a limited subset of the most interesting points of the skyline set and is extended to handle top-k preference skyline queries, when the user's preferences are available.
Abstract: Skyline queries aim to help users make intelligent decisions over complex data by discovering a set of interesting points, when different and often conflicting criteria are considered. Unfortunately, as the dimensionality of the dataset grows, the skyline operator loses its discriminating power and returns a large fraction of the data. The huge size of the result set hinders decision-making and motivates the ranking of skyline points. Therefore, users prefer to retrieve the top-k skyline points instead of the whole skyline set. In this paper, we propose SKYRANK, a framework for ranking the skyline points in the absence of a user-defined preference function, thereby discovering a limited subset of the most interesting points of the skyline set. For this purpose, we define the skyline graph, which relies on the dominance relationships between the skyline points for different subsets of dimensions (subspaces). SKYRANK applies well-known authority-based ranking algorithms on the skyline graph and, as described in this paper, discovers the importance of a skyline point exploiting the subspace dominance relationships. Furthermore, we extend SKYRANK to handle top-k preference skyline queries, when the user's preferences are available. Our experimental evaluation illustrates the complexity of the dominance relationships and the ranking ability of our framework.
TL;DR: A method to integrate external knowledge sources such as DBpedia and OpenCyc into an ontology learning system that automatically suggests labels for unknown relations in domain ontologies based on large corpora of unstructured text is presented.
Abstract: This paper presents a method to integrate external knowledge sources such as DBpedia and OpenCyc into an ontology learning system that automatically suggests labels for unknown relations in domain ontologies based on large corpora of unstructured text. The method extracts and aggregates verb vectors from semantic relations identified in the corpus. It composes a knowledge base which consists of (i) verb centroids for known relations between domain concepts, (ii) mappings between concept pairs and the types of known relations, and (iii) ontological knowledge retrieved from external sources. Applying semantic inference and validation to this knowledge base improves the quality of suggested relation labels. A formal evaluation compares the accuracy and average ranking precision of this hybrid method with the performance of methods that solely rely on corpus data and those that are only based on reasoning and external data sources.
TL;DR: This work presents a flexible and customizable template-based mechanism for the representation of a conceptual ETL design as a narrative, which is the most natural means of communication and does not require particular technical skills or familiarity with any specific model.
Abstract: Extract-Transform-Load (ETL) processes constitute the back stage of Data Warehouse architectures. Several studies characterize the ETL design as a time-consuming and error-prone procedure. A critical phase in the ETL lifecycle involves the early communications and design steps that aim at producing a conceptual ETL design. Various research approaches have dealt with the conceptual modeling of ETL processes, but all share two inconveniences: they require intensive human effort from the designers to create them, as well as technical knowledge from the business people to understand them. In this paper, we focus on the second aspect and provide a method for the representation of a conceptual ETL design as a narrative, which is the most natural means of communication and does not require particular technical skills or familiarity with any specific model. Specifically, this work builds upon previously proposed techniques that automate the conceptual design by leveraging Semantic Web technology. The key idea is to map the involved data stores, either source or target, to a domain ontology and then, to use a reasoner for producing the ETL design. We discuss how linguistic techniques can be used for the establishment of a common application vocabulary. We present a flexible and customizable template-based mechanism for the representation of the ETL design as a narrative. Finally, we discuss issues related to the production of meaningful reports and we provide implementation details.
TL;DR: An agile information modeling technique, called Anchor Modeling, is proposed that offers non-destructive extensibility mechanisms, thereby enabling robust and flexible management of changes in a data warehouse environment.
Abstract: Maintaining and evolving data warehouses is a complex, error prone, and time consuming activity. The main reason for this state of affairs is that the environment of a data warehouse is in constant change, while the warehouse itself needs to provide a stable and consistent interface to information spanning extended periods of time. In this article, we propose an agile information modeling technique, called Anchor Modeling, that offers non-destructive extensibility mechanisms, thereby enabling robust and flexible management of changes. A key benefit of Anchor Modeling is that changes in a data warehouse environment only require extensions, not modifications, to the data warehouse. Such changes, therefore, do not require immediate modifications of existing applications, since all previous versions of the database schema are available as subsets of the current schema. Anchor Modeling decouples the evolution and application of a database, which when building a data warehouse enables shrinking of the initial project scope. While data models were previously made to capture every facet of a domain in a single phase of development, in Anchor Modeling fragments can be iteratively modeled and applied. We provide a formal and technology independent definition of anchor models and show how anchor models can be realized as relational databases together with examples of schema evolution. We also investigate performance through a number of lab experiments, which indicate that under certain conditions anchor databases perform substantially better than databases constructed using traditional modeling techniques.
TL;DR: This paper considers the problem of web page usage prediction in a web site by modeling users' navigation history and web page content with weighted suffix trees and finds that its quality performance is fairly well and in many cases an outperforming one.
Abstract: In this paper we consider the problem of web page usage prediction in a web site by modeling users' navigation history and web page content with weighted suffix trees. This user's navigation prediction can be exploited either in an on-line recommendation system in a web site or in a web page cache system. The method proposed has the advantage that it demands a constant amount of computational effort per one user's action and consumes a relatively small amount of extra memory space. These features make the method ideal for an on-line working environment. Finally, we have performed an evaluation of the proposed scheme with experiments on various web site log files and web pages and we have found that its quality performance is fairly well and in many cases an outperforming one.
TL;DR: ResearchCyc, a version of Cyc that attempts to capture common sense knowledge of the real world, is analyzed and the insights acquired are used to generate suggestions for improving the usability of upper level ontologies.
Abstract: Repositories of knowledge about the real world are intended to serve as surrogates for the meaning and context of terms and concepts. These are being developed at two levels: (1) individual domain ontologies that capture concepts about a particular application domain; and (2) upper level ontologies that contain massive amounts of knowledge about the real world and are domain independent. This paper analyzes ResearchCyc, a version of Cyc, that attempts to capture common sense knowledge of the real world. Experience in applying ResearchCyc to web query processing is reported and the insights acquired are used to generate suggestions for improving the usability of upper level ontologies.
TL;DR: It is concluded that ontology extraction tools still lack the ability to automate the extraction process fully and thus require functional performance improvement, and proposed a set of criteria for evaluating such tools.
Abstract: Ontologies are a key component of the Semantic Web; thus, they are widely used in various applications. However, most ontologies are still built manually, a time-consuming activity which requires many resources. Several tools such as ontology editing tools, ontology merging tools, and ontology extraction tools have therefore been proposed to speed up ontology development. To minimize building time, one promising solution is the automation of the ontology development process. Consequently, the need for an automatic ontology extraction tool has increased in the last two decades and many tools have been developed for this purpose. However, there is still no comprehensive framework for evaluating such tools. In this paper, we proposed a set of criteria for evaluating ontology extraction tools and carried out an evaluation experiment on four ontology extraction tools (i.e., OntoLT, Text2Onto, OntoBuilder, and DODDLE-OWL) using our proposed evaluation framework. Based on the results of our experiment, we concluded that ontology extraction tools still lack the ability to automate the extraction process fully and thus require functional performance improvement.
TL;DR: This paper builds on the assumption that very large ontologies can be efficiently handled using database management systems (DBMS) and proposes to implement reasoning into the DBMS via a set of PL/SQL stored procedures, designed to speed up ontology querying.
Abstract: A major obstacle to the development of ontologies in support of the Semantic Web is the poor capability of current ontology techniques to handle very large ontologies, in particular regarding scalability of reasoners. This paper builds on the assumption that very large ontologies can be efficiently handled using database management systems (DBMS), designed to provide best performance in storing, updating, and managing large volumes of data. To enhance DBMS with the reasoning functionality that characterizes ontology management, we propose to implement reasoning into the DBMS via a set of PL/SQL stored procedures. These procedures support all usual reasoning tasks: Class subsumption, property subsumption, class satisfiability, ABox consistency, and ABox realization. They perform these tasks at update time and materialize all inferred knowledge (facts and axioms) in the database. Contrarily to the inferencing at query time in most of existing works, our approach is designed to speed up ontology querying, which is supposed to represent the most frequent and therefore critical usage of ontologies. The paper discusses querying patterns and reports on benchmarking (with the LUBM benchmark) the performance of our prototype, called OntoMinD, compared to Oracle with Semantic Technologies. Benchmark results demonstrate the appropriateness of our approach.
TL;DR: It was pointed out that the data weighted clustering approach has its unique advantages when mining the outliers of the large scale data sets, when clustering the data set for better clustering results, and especially when these two tasks are done simultaneously.
Abstract: This paper proposes a new kind of data weighted fuzzy c-means clustering approach. Different from most existing fuzzy clustering approaches, the data weighted clustering approach considers the internal connectivity of all data points. An exponent impact factors vector and an influence exponent are introduced to the new model. Together they influence the clustering process. The data weighted clustering can simultaneously produce three categories of parameters: fuzzy membership degrees, exponent impact factors and the cluster prototypes. A new fuzzy algorithm, DWG-K, is developed by combining the data weighted approach and the G-K. Two groups of numerical experiments were executed. Group 1 demonstrates the clustering performance of the DWG-K. The counterpart is the G-K. The results show the DWG-K can obtain better clustering quality and meanwhile it holds the same level of computational efficiency as the G-K holds. Group 2 checks the ability of the DWG-K in mining the outliers. The counterpart is the well-known LOF. The results show the DWG-K has considerable advantage over the LOF in computational efficiency. And the outliers mined by the DWG-K are global. It was pointed out that the data weighted clustering approach has its unique advantages when mining the outliers of the large scale data sets, when clustering the data set for better clustering results, and especially when these two tasks are done simultaneously.
TL;DR: This work presents CSTL, a language for writing automated tests of executable schemas written in UML/OCL, and describes a prototype implementation of a test processor that includes a test manager and a test interpreter that coordinates the execution of the tests.
Abstract: Conceptual schemas of information systems can be tested. The testing of conceptual schemas may be an important and practical means for their validation. We present a list of five kinds of tests that can be applied to conceptual schemas. Two of them require schemas comprising both the structural and the behavioral parts, but we show that it is possible and useful to test incomplete schema fragments, even if they consist of only a few entity and relationship types, integrity constraints and derivation rules. We present CSTL, a language for writing automated tests of executable schemas written in UML/OCL. CSTL includes language primitives for each of the above kinds of tests. CSTL follows the style of the modern xUnit testing frameworks. We describe a prototype implementation of a test processor, which includes a test manager and a test interpreter that coordinates the execution of the tests. Tests written in CSTL can be executed as many times as needed.
TL;DR: An empirical study about the nature of methods, diagrams, and home grown conceptual models as reflected in real practice at IBM, identifying the models as artifacts of "enterprise conceptual modeling".
Abstract: Business analysts, business architects, and solution consultants use a variety of practices and methods in their quest to understand business. The resulting work products often end up being transitioned into the formal world of software requirement definitions or as recommendations for all kinds of business activities. We describe an empirical study about the nature of these methods, diagrams, and home grown conceptual models as reflected in real practice at IBM. We identify the models as artifacts of "enterprise conceptual modeling". We study important features of these models, suggest practical classifications and characterizations, and distinguish them from drawings. Specifically we look into context, type, methods and complexity to determine enterprise conceptual models usage. Our survey shows that the "enterprise conceptual modeling" arena presents a variety of descriptive models, each used by a relatively small group of colleagues. Together they form a spectrum that extends from "drawings" on one end to "standards" on the other.
TL;DR: Running SDM on small repositories of project management applications and scheduling systems, it is found that the approach may provide reasonable draft domain models, whose comprehensibility, correctness, completeness, and consistency levels are satisfactory.
Abstract: A domain model, which captures the common knowledge and the possible variability allowed among applications in a domain, may assist in the creation of other valid applications in that domain. However, to create such domain models is not a trivial task: it requires expertise in the domain, reaching a very high level of abstraction, and providing flexible, yet formal, artifacts. In this paper an approach, called Semi-automated Domain Modeling (SDM), to create draft domain models from applications in those domains, is presented. SDM takes a repository of application models in a domain and matches, merges, and generalizes them into sound draft domain models that include the commonality and variability allowed in these domains. The similarity of the different elements is measured, with consideration of syntactic, semantic, and structural aspects. Unlike ontology and schema integration, these models capture both structural and behavioral aspects of the domain. Running SDM on small repositories of project management applications and scheduling systems, we found that the approach may provide reasonable draft domain models, whose comprehensibility, correctness, completeness, and consistency levels are satisfactory.
TL;DR: The method semi-automatically expands abbreviations/acronyms and annotates compound nouns, with minimal manual effort and empirically proves that the normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching results.
Abstract: Schema matching is the problem of finding relationships among concepts across heterogeneous data sources that are heterogeneous in format and in structure. Starting from the "hidden meaning" associated with schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a "meaning" to schema labels. However, the performance of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns, abbreviations, and acronyms. We address this problem by proposing a method to perform schema label normalization which increases the number of comparable labels. The method semi-automatically expands abbreviations/acronyms and annotates compound nouns, with minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching results.
TL;DR: A taxonomy of semantic conflicts is provided, the main features of each of them are analyzed and an OWL/SWRL modelling for certain realistic scenarios related with information systems is provided.
Abstract: Nowadays, managers of information systems use ontologies and rules as a powerful tool to express the desired behaviour for the system. However, the use of rules may lead to conflicting situations where the antecedent of two or more rules is fulfilled, but their consequent is indicating contradictory facts or actions. These conflicts can be categorised in two different groups, modality and semantic conflicts, depending on whether the inconsistency is owing to the rule language expressiveness or due to the nature of the actions. While there exist certain proposals to detect and solve modality conflicts, the problem becomes more complex with semantic ones. Additionally, current techniques to detect semantic conflicts are usually not considering the use of standard information models. This paper provides a taxonomy of semantic conflicts, analyses the main features of each of them and provides an OWL/SWRL modelling for certain realistic scenarios related with information systems. It also describes different conflict detection techniques that can be applied to semantic conflicts and their pros and cons. Finally, this paper provides a comparison of these techniques based on performance measurements taken in a realistic scenario and suggests a better approach. This approach is then used in other scenarios related with information systems and where different types of semantic conflicts may appear.
TL;DR: This paper introduces the OO-Method COSMIC Function Points (OOmCFP) procedure, which has been systematically designed to measure the functional size of object-oriented applications generated from their conceptual models by means of model transformations.
Abstract: The accurate measurement of the functional size of applications that are automatically generated in MDA environments is a challenge for the software development industry. This paper introduces the OO-Method COSMIC Function Points (OOmCFP) procedure, which has been systematically designed to measure the functional size of object-oriented applications generated from their conceptual models by means of model transformations. The OOmCFP procedure is structured in three phases: a strategy phase, a mapping phase, and a measurement phase. Finally, a case study is presented to illustrate the use of OOmCFP, as well as an analysis of the results obtained.
TL;DR: The refereed proceedings of the 19th International Conference on Applications of Natural Language to Information Systems, NLDB 2014, held in Montpellier, France, in June 2014 are presented in this paper.
Abstract: This book constitutes the refereed proceedings of the 19th International Conference on Applications of Natural Language to Information Systems, NLDB 2014, held in Montpellier, France, in June 2014. The 13 long papers, 8 short papers, 14 poster papers, and 7 demo papers presented together with 2 invited talks in this volume were carefully reviewed and selected from 73 submissions. The papers cover the following topics: syntactic, lexical and semantic analysis; information extraction; information retrieval; and sentiment analysis and social networks.
TL;DR: The concept of an ontological profile is described, which is a semantic extension of an Ontology where each ontology concept is given a description in terms of a vector of weighted keywords.
Abstract: An ontology is a formal conceptualization of a domain, specifying the concepts of the domain and the relations between them. It is however not a straight forward task to use this knowledge for information retrieval purposes. In this paper we describe the concept of an ontological profile, which is a semantic extension of an ontology where each ontology concept is given a description in terms of a vector of weighted keywords. An experiment has been conducted with a prototype search engine using ontological profiles for query expansion. The evaluation shows encouraging results compared to standard keyword based search. Furthermore, we describe the notion of context in an information retrieval setting and address how we can combine semantics and context in search based on query expansion.
TL;DR: An efficient index buffer management scheme, called IBSF, is proposed, which eliminates redundant index units in the index buffer and then delays the time that the indexbuffer requires to become full, which significantly reduces the number of write operations to a flash memory when constructing a B-tree.
Abstract: Recently, NAND flash memory has been one of the best storage mediums for various embedded systems such as MP3 players, mobile phones and laptops because of its shock-resistant, low-power consumption, and none-volatile properties. However, since it has very distinct characteristics including erase-before-write and asymmetric read/write speed, the performance of disk based systems and applications may degrade dramatically when directly adopting them on the flash memory storage systems. Especially when a B-tree is constructed on NAND flash memory, intensive overwrite operations may be caused by record inserting, deleting, and reorganizing. These may result in severe performance degradation when building the B-tree. In this paper, we propose an efficient index buffer management scheme, called IBSF, which eliminates redundant index units in the index buffer and then delays the time that the index buffer requires to become full. Consequently, IBSF significantly reduces the number of write operations to a flash memory when constructing a B-tree. We also show that IBSF yields a better performance on a flash memory by comparing it to the related technique through various experiments.
TL;DR: A novel and simple method (query clauses) is proposed to represent expanded queries which may alleviate some of the negative effects of term correlation in query expansion algorithms and this method is applied to improve stemming.
Abstract: In this paper we deal with two issues. First, we discuss the negative effects of term correlation in query expansion algorithms, and we propose a novel and simple method (query clauses) to represent expanded queries which may alleviate some of these negative effects. Second, we discuss a method to optimize local query-expansion methods using genetic algorithms, and we apply this method to improve stemming. We evaluate this method with the novel query representation method and show very significant improvements for the problem of stemming optimization.