TL;DR: This paper introduces the concept of workflow mining and presents a common format for workflow logs, and discusses the most challenging problems and present some of the workflow mining approaches available today.
Abstract: Many of today's information systems are driven by explicit process models. Workflow management systems, but also ERP, CRM, SCM, and B2B, are configured on the basis of a workflow model specifying the order in which tasks need to be executed. Creating a workflow design is a complicated time-consuming process and typically there are discrepancies between the actual workflow processes and the processes as perceived by the management. To support the design of workflows, we propose the use of workflow mining. Starting point for workflow mining is a so-called "workflow log" containing information about the workflow process as it is actually being executed. In this paper, we introduce the concept of workflow mining and present a common format for workflow logs. Then we discuss the most challenging problems and present some of the workflow mining approaches available today.
TL;DR: This work studies the problem of mining association rules and related time intervals by extending the well-known Apriori algorithm with effective pruning techniques and uses calendar schemas and their calendar-based patterns.
Abstract: We study the problem of mining association rules and related time intervals, where an association rule holds either in all or some of the intervals. To restrict to meaningful time intervals, we use calendar schemas and their calendar-based patterns. A calendar schema example is (year, month, day) and a calendar-based pattern within the schema is (*, 3, 15), which represents the set of time intervals each corresponding to the 15th day of a March. Our focus is finding efficient algorithms for this mining problem by extending the well-known Apriori algorithm with effective pruning techniques. We evaluate our techniques via experiments.
TL;DR: In this article, the problem of the incremental mining of sequential patterns when new transactions or new customers are added to an original database is considered, and a new algorithm for mining frequent sequences that uses information collected during an earlier mining process to cut down the cost of finding new sequential patterns in the updated database is presented.
Abstract: In this paper, we consider the problem of the incremental mining of sequential patterns when new transactions or new customers are added to an original database. We present a new algorithm for mining frequent sequences that uses information collected during an earlier mining process to cut down the cost of finding new sequential patterns in the updated database. Our test shows that the algorithm performs significantly faster than the naive approach of mining the whole updated database from scratch. The difference is so pronounced that this algorithm could also be useful for mining sequential patterns, since in many cases it is faster to apply our algorithm than to mine sequential patterns using a standard algorithm, by breaking down the database into an original database plus an increment.
TL;DR: The DEMO (Demo Engineering Methodology for Organizations)-framework is presented, which builds on the LAP-based theoretical foundation of the DEMO methodology and is demonstrated using a small example.
Abstract: The increasing demands concerning the modifiability and connectivity of business processes cannot be met adequately anymore by relying on best practices only. There is an urgent need for a reference conceptual framework for studying, modeling, analyzing and designing business processes. The Language-Action Perspective (LAP), in particular Habermas' theory of Communicative Action offers a sound and rigid foundation for such a framework. In this paper, the DEMO (Demo Engineering Methodology for Organizations)-framework is presented. It builds on the LAP-based theoretical foundation of the DEMO methodology. Several other LAP-based frameworks have been proposed in the past years. They are evaluated in a comparative review with the DEMO-framework. Several shortcomings of these frameworks are revealed and discussed. The practical applicability of the DEMO-framework is demonstrated using a small example.
TL;DR: This paper shows that these drawbacks of hierarchical agglomerative clustering can be alleviated by closely studying the dendrogram, and proposes methods that exploit this characteristic to reduce the time and memory complexities significantly and to make validation very efficient and accurate.
Abstract: Clustering is the task of grouping similar objects into clusters. A prominent and useful class of algorithm is hierarchical agglomerative clustering (HAC) which iteratively agglomerates the closest pair until all data points belong to one cluster. It outputs a dendrogram showing all N levels of agglomerations where N is the number of objects in the dataset. However, HAC methods have several drawbacks: (1) high time and memory complexities for clustering, and (2) inefficient and inaccurate cluster validation. In this paper we show that these drawbacks can be alleviated by closely studying the dendrogram. Empirical study shows that most HAC algorithms follow a trend where, except for a number of top levels of the dendrogram, all lower levels agglomerate clusters which are very small in size and close in proximity to other clusters. Methods are proposed that exploit this characteristic to reduce the time and memory complexities significantly and to make validation very efficient and accurate. Analyses and experiments show the effectiveness of the proposed method.
TL;DR: The goal of this article is to make the communication norms underlying various LAP workflow loop models (DEMO, ActionWorkflow) explicit and to contrast them with the auditing norms of internal control.
Abstract: The language/action perspective (LAP) as orginally introduced by Winograd and Flores has inspired several tools and information system design methodologies. The goal of this article is to make the communication norms underlying various LAP workflow loop models (DEMO, ActionWorkflow) explicit and to contrast them with the auditing norms of internal control. It appears that the communicative action paradigm embedded in DEMO and the customer satisfaction orientation of ActionWorkflow lead to norms which resemble the ones required by internal control, but there are some important differences. For that reason, we propose an extended workflow loop model that distinguishes between customer relations and agency relations. Whereas current LAP approaches do not take agency relations explicitly into account, the extended workflow loop model allows us to analyze the effects of delegation on communicative processes. A framework is offered for the normative analysis of workflows based on a number of formalized communication norms.
TL;DR: Three new solutions to the problem of visualizing temporal intervals and their relations for querying databases containing several histories are proposed and the expressivity of the visual vocabularies with respect to Allen's Interval Algebra is discussed.
Abstract: A crucial component for turning any temporal reasoning system into a real-world application that can be adopted by a wide base of users is given by its user interface. After analyzing and discussing the state of the art for the visualization of temporal intervals and relations, this paper proposes three new solutions to the problem of visualizing temporal intervals and their relations for querying databases containing several histories. The metaphors exploited in the proposed visual vocabularies are based on real-world, concrete objects, such as strips, springs, weights, and wires. We discuss the expressivity of the visual vocabularies with respect to the well-known Allen's Interval Algebra. A method for mapping queries composed by the visual vocabularies into SQL queries is then described and discussed. The proposed solutions were evaluated with two proper user studies: the first focused on determining which of the adopted metaphors are more frequently perceived and understood in a correct way and was based on a questionnaire; the second considered the two solutions which scored better in the first phase and studied them with a more thorough experiment, which was also based on user interfaces implementing the two proposals. The visual vocabulary which provided the best results has been adopted in a medical system for visual querying clinical temporal databases.
TL;DR: This paper provides a conceptual and an implementation model of fuzzy spatial objects that also incorporates fuzzy geometric union, intersection, and difference operations as well as fuzzy topological predicates for an important kind of spatial vagueness called spatial fuzziness.
Abstract: Uncertainty management for geometric data is currently an important problem in spatial databases, image databases, and geographic information systems. Spatial entities do not always have homogeneous interiors and sharply defined boundaries but frequently their interiors and boundaries are partially or totally indeterminate and vague. For an important kind of spatial vagueness called spatial fuzziness this paper provides a conceptual and an implementation model of fuzzy spatial objects that also incorporates fuzzy geometric union, intersection, and difference operations as well as fuzzy topological predicates. In particular, this model is not based on Euclidean space and not on an infinite-precision arithmetic which lead to lacking numerical robustness and to topological inconsistency of implementations on a computer; it rests on a finite, discrete geometric domain called grid partition which takes into account finite-precision number systems available in computers. Last but not least, this paper is a contribution to achieve a uniform treatment of vector and raster data.
TL;DR: This paper provides a declarative model-theoretic semantics for Xyleme tree queries, a way of checking tree query containment, and a characterization of tree queries as a composition of branch queries, and presents a method for semi-automatically generating the mapping relation.
Abstract: Xyleme is a huge warehouse integrating XML data of the Web. Xyleme considers a simple data model with data trees and tree types for describing the data sources, and a simple query language based on tree queries with boolean conditions. The main components of the data model are a mediated schema modeled by an abstract tree type, as a view of a set of tree types associated with actual data trees, called concrete tree types, and a mapping expressing the connection between the mediated schema and the concrete tree types. The first contribution of this paper is formal: we provide a declarative model-theoretic semantics for Xyleme tree queries, a way of checking tree query containment, and a characterization of tree queries as a composition of branch queries. The other contributions are algorithmic and handle the potentially huge size of the mapping relation which is a crucial issue for semantic integration and query evaluation in Xyleme. First, we propose a method for pre-evaluating queries at compile time by storing some specific meta-information about the mapping into map translation tables. These map translation tables summarize the set of all the branch queries that can be generated from the mediated schema and the set of all the mappings. Then, we propose different operators and strategies for relaxing queries which, having an empty map translation table, will have no answer if they are evaluated against the data. Finally, we present a method for semi-automatically generating the mapping relation.
TL;DR: The proposed model supports all the schema changes which are usually considered in the OODB literature, for which an operational semantics and a formal analysis of their correct behaviour is provided.
Abstract: In this paper we present a formal model for the support of temporal schema versions in object-oriented databases. Its definition is partially based on a generic (ODMG compatible) object model and partially introduces new concepts. The proposed model supports all the schema changes which are usually considered in the OODB literature, for which an operational semantics and a formal analysis of their correct behaviour is provided. Semantic issues arising from the introduction of temporal schema versioning in a conventional or temporal database (concerning the interaction between the intensional and extensional levels of versioning and the management of data in the presence of multiple schema versions) are also considered.
TL;DR: MOVIE is described, a complete, implemented and evaluated solution to the problem of incrementally maintaining materialized OQL views in ODMG-compliant object databases, and throws light into how the effectiveness of incremental maintenance is affected by issues such as database size, and the complexity and selectivity of views.
Abstract: View materialization is an important technique for high performance query processing, data integration and replication. Solutions to the problem of incrementally maintaining materialized views are very relevant. So far, most work on this problem has been confined to relational settings and solutions have not been comprehensively evaluated. This paper describes MOVIE, a complete, implemented and evaluated solution to the problem of incrementally maintaining materialized OQL views in ODMG-compliant object databases. The evaluation throws light into how the effectiveness of incremental maintenance is affected by issues such as database size, and the complexity and selectivity of views.
TL;DR: This work uses both maximum and minimum cardinality constraints in defining the properties and their structural validity criteria yielding a complete analysis of the structural validity of recursive, binary, and ternary relationship types.
Abstract: We explore the criteria that contribute to the structural validity of modeling structures within the entity-relationship (ER) diagram. Our approach examines cardinality constraints in conjunction with the degree of the relationship to address constraint consistency, state compliance, and role uniqueness issues to derive a complete and comprehensive set of decision rules. Unlike typical other analyses that use only maximum cardinality constraints, we have used both maximum and minimum cardinality constraints in defining the properties and their structural validity criteria yielding a complete analysis of the structural validity of recursive, binary, and ternary relationship types. Our study evaluates these relationships as part of the overall diagram and our rules address these relationships as they coexist in a path structure within the model. The contribution of this paper is to provide a comprehensive set of decision rules to determine the structural validity of any ERD containing recursive, binary, and ternary relationships. These decision rules can be readily applied to real world data models regardless of their complexity. The rules can easily be incorporated into the database modeling and designing process, or extended into case tool implementations.
TL;DR: Algorithms for automatically constructing UML diagrams from XML DTDs are presented, enabling fast and easy graphical browsing of XML data sources on the web, and an integration architecture is presented.
Abstract: Extensible Markup Language (XML) is fast becoming the new standard for data representation and exchange on the World Wide Web, e.g., in B2B e-commerce. Modern enterprises need to combine data from many sources in order to answer important business questions, creating a need for integration of web-based XML data. Previous web-based data integration efforts have focused almost exclusively on the logical level of data models, creating a need for techniques that focus on the conceptual level in order to communicate the structure and properties of the available data to users at a higher level of abstraction. The most widely used conceptual model at the moment is the Unified Modeling Language (UML).This paper presents algorithms for automatically constructing UML diagrams from XML DTDs, enabling fast and easy graphical browsing of XML data sources on the web. The algorithms capture important semantic properties of the XML data such as precise cardinalities and aggregation (containment) relationships between the data elements. As a motivating application, it is shown how the generated diagrams can be used for the conceptual design of data warehouses based on web data, and an integration architecture is presented. The choice of data warehouses and On-Line Analytical Processing as the motivating application is another distinguishing feature of the presented approach.
TL;DR: This work presents an algorithmic methodology for refining two-level terminologic networks where concepts are classified into high-level semantic types, with these types constituting a portion of the concepts' semantics.
Abstract: Capturing the semantics of concepts in a terminology has been an important problem in AI. A two-level approach has been proposed where concepts are classified into high-level semantic types, with these types constituting a portion of the concepts' semantics. We present an algorithmic methodology for refining such two-level terminologic networks. A new network is produced consisting of "pure" semantic types and intersection types. Concepts are uniquely re-assigned to these new types. Overall, these types form a better conceptual abstraction, with each exhibiting uniform semantics. Using them, it becomes easier to detect classification errors. The methodology is applied to the UMLS.
TL;DR: The paper proposes a framework for business interaction based on a language/ action perspective that is inspired by a similar framework constructed by Weigand et al.
Abstract: In order to perform business modelling as apart of information systems development, there is a need for frameworks and methods The paper proposes a framework for business interaction based on a language/ action perspective The framework is an architecture of five generic layers The first layer concept is 'business act', which functions as the basic unit of analysis The following four layer concepts are 'action pair', 'exchange', 'business transaction', and 'transaction group' The framework is inspired by a similar framework constructed by Weigand et al The paper makes a critical examination of this framework as a basis for the proposed framework
TL;DR: A new technique is presented based on a new method to measure the similarity of two documents, that represent the actual and the previous version of the monitored page, that has been effectively used to discover changes in selected portions of the original document.
Abstract: In this paper we present a new technique for detecting changes in Web documents. The technique is based on a new method to measure the similarity of two documents, that represent the actual and the previous version of the monitored page. The technique has been effectively used to discover changes in selected portions of the original document.The proposed technique has been implemented in the CMW system providing a change monitoring service on the Web. The main features of CMW are the detection of changes on selected portions of web documents and the possibility to express complex queries on the changed information. For instance, a query can require to check if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.
TL;DR: In this paper, the authors provide a semantic foundation for the vacuuming of transaction-time databases and provide options for user, application, and database interactions in response to queries and updates against vacuumed data.
Abstract: A wide range of real-world database applications, including financial and medical applications, are faced with accountability and traceability requirements. These requirements lead to the replacement of the usual update-in-place policy by an append-only policy that retain all previous states in the database. This policy result in so-called transaction-time databases which are ever-growing. A variety of physical storage structures and indexing techniques as well as query languages have been proposed for transaction-time databases, but the support for physical removal of data, termed vacuuming, has only received little attention. Such vacuuming is called for by, e.g., the laws of many countries and the policies of many businesses. Although necessary, with vacuuming, the database's perfect recollection of the past may be compromised via, e.g., selective removal of records pertaining to past states. This paper provides a semantic foundation for the vacuuming of transaction-time databases. The main focus is to establish a foundation for the correct processing of queries and updates against vacuumed databases. However, options for user, application, and database interactions in response to queries and updates against vacuumed data are also outlined.
TL;DR: A mathematical framework where a formal semantics for object identity can be built irrespectively to computer related things like object identifiers, memory allocations etc is introduced.
Abstract: We introduce a mathematical framework where a formal semantics for object identity can be built irrespectively to computer related things like object identifiers, memory allocations etc. Then, on this base, we build formal semantics for a few major constructs of conceptual modeling (CM) such as association, aggregation, generalization, isA- and isPartOf-relationships. We also give a formal meaning to the two fundamental dichotomies of CM: objects vs. values and entities vs. relationships.On the syntactical side, the language we use for specifying our formal semantic constructs is graph-based and brief: specifications are directed graphs consisting only of three kinds of items--nodes, arrows and marked diagrams. The latter are configurations of nodes and arrows closed in some technical sense and marked with predicate labels taken from a predefined signature. We show that this format does provide a universal abstract syntax for the entire CM-field. Then any particular CM-notation appears as a particular visualization superstructure (concrete syntax) over the same basic specification format as above.
TL;DR: This volume discusses the role of communicative action in organizations and the question which aggregation levels of communication can be distinguished, and describes a pattern framework for business interactions that makes a distinction between five interaction layers.
Abstract: In today s information society, communication plays a crucial role for individuals cooperating to achieve mutual goals. It is the basic mechanism for coordination in organizations. In principle, communication is something between human agents, but the process can be supported by electronic media (for example, email or negotiation support systems), and in other cases can be delegated almost completely to software agents or other systems. Communication is one of the main functions of our present-day information systems. Traditionally, information systems were focused on data management, and the communication function was recognized relatively late. Since 1980, a new paradigm has evolved in the field of information systems: the language/action perspective (LAP), in particular through the pioneering work of Flores and Winograd. As diverse as the applications of LAP in the last twenty years have been, they all have in common the fundamental agreement that language is not only used for representing and sharing information, but also to perform actions, e.g. promises, orders, declarations. The focus is on the pragmatic aspects of language, i.e. how language is used in particular contexts to achieve practical goals such as agreements and mutual understandings. Over the years, many of the LAP ideas have found their way in the IS and AI field, for example, in the design of Agent Communication Languages. In 1996, a first Int. Workshop on the Language/Action Perspective was held in Tilburg, The Netherlands. Since then, an annual workshop has been held in several countries. The four papers presented in this volume were originally submitted and discussed at one of the two most recent workshops (Montreal, 2001, and Delft, 2002). In the first paper, Jan Dietz from Delft University of Technology, presents DEMO, a LAP based methodology that has been developed within a rigorous theoretical framework. DEMO has been applied in many practical cases. The paper discusses the role of communicative action in organizations and the question which aggregation levels of communication can be distinguished. The question of aggregation and abstraction levels in communicative action has been the subject of many recent debates and is also addressed in the second paper by Mikael Lind and G€ oran Goldkuhl (University College of Bor as, Link€ oping University). This paper is focused on interorganizational communication, as in e-commerce applications, and describes a pattern framework for business interactions that makes a distinction between five interaction layers. The aim of the third paper by Hans Weigand and Aldo de Moor (Tilburg University) is a critical analysis of the communication norms underlying the various LAP approaches. The first
TL;DR: A freshness-driven adaptive dynamic content caching technique, which monitors response time and invalidation cycle length and dynamically adjusts caching policies is proposed and implemented within NEC's Cache Portal Web acceleration solution.
Abstract: Both response time and content freshness are essential to e-commerce applications on the Web. One option to achieve good response time is to build a high performance Web site by deploying the state of art IT infrastructures with large network and server capacities. With such a system architecture, freshness of the content delivered is limited by the network latency since when users receive the contents, the contents may have changed at the server. With the wide availability of content delivery networks, many e-commerce Web applications utilize edge cache servers to cache and deliver dynamic contents at locations much closer to users, avoiding network latency. By caching a large number of dynamic content pages in the edge cache servers, response time can be reduced, benefiting from higher cache hit rates. However, this is achieved at the expense of higher invalidation cost. On the other hand, a higher invalidation cost leads to a longer invalidation cycle (time to perform invalidation check on the pages in caches) at the expense of freshness of cached dynamic content. In this paper, we propose a freshness-driven adaptive dynamic content caching technique, which monitors response time and invalidation cycle length and dynamically adjusts caching policies. We have implemented the proposed technique within NEC's Cache Portal Web acceleration solution. We have conducted experiments to evaluate effectiveness of the proposed freshness-driven adaptive dynamic content caching technique. The experimental results show that the proposed technique consistently maintains the best content freshness to users. The experimental results also show that even a Web site with dynamic content caching enabled can further benefit from deployment of our solution with improvement of its content freshness up to 10 times especially during heavy user request traffic and long network latency delay.
TL;DR: This paper presents two novel indexing schemes to compute the skyline of a set of points progressively and shows that the proposed algorithms provide quick initial response time as compared to existing algorithms.
Abstract: Many decision support applications are characterized by several features: (1) the query is typically based on multiple criteria; (2) there is no single optimal answer (or answer set); (3) because of (2), users are typically looking for satisficing answers; (4) for the same query, different users, dictated by their personal preferences, may find different answers meeting their needs. As such, it is important for the DBMS to present all interesting answers that may fulfill a user's need. In this paper, we focus on the set of interesting answers called the skyline. Given a set of points, the skyline comprises the points that are not dominated by other points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. We present two novel indexing schemes to compute the skyline of a set of points progressively. Unlike most existing algorithms that require at least one pass over the dataset to return the first interesting point, our algorithms return interesting points gradually as they are identified. The first algorithm, Bitmap, is completely non-blocking and exploits a bitmap structure to quickly identify whether a point is an interesting point or not. The second method, Index, exploits a transformation mechanism and a B+-tree index to return skyline points in batches. Our extensive performance study shows that the proposed algorithms provide quick initial response time as compared to existing algorithms. Moreover, both schemes can also outperform existing techniques in terms of total response time. While Index is superior in most cases, Bitmap is effective when the number of distinct values per dimension is small as well as when the number of skyline points is large.
TL;DR: A method is defined that deals with view updating and integrity constraint maintenance in an integrated way and it is shown that it is sound and complete.
Abstract: We deal with view updating and integrity constraint maintenance View updating is concerned with translating a request to update derived facts into updates of the underlying base facts Integrity constraint maintenance is aimed to perform the necessary repairs to guarantee that a set of base fact updates does not violate database consistency We define a method that deals with these problems in an integrated way and we show that it is sound and complete Soundness ensures that our method obtains only correct solutions while completeness guarantees that we obtain all valid minimal solutions We also propose set of techniques to provide an efficient implementation of our method
TL;DR: A new normal form is given for OLAP cube design and synthesis and decomposition algorithms to produce normalisedOLAP cube schemata to control the structural sparsity resulting from inter-dimensional functional dependencies.
Abstract: A poorly designed OLAP (on-line analytical processing) cube can have a size much larger than the volume of information, potentially leading to problems with performance and usability. We give a new normal form for OLAP cube design and synthesis and decomposition algorithms to produce normalised OLAP cube schemata. OLAP cube normalisation controls the structural sparsity resulting from inter-dimensional functional dependencies. We assume that functional dependencies are used to describe the constraints of the application universe of discourse. Our methods help the user to identify cube schemata with structural sparsity, and to change the design in order to obtain more economy of space.
TL;DR: An approach to enhance this analysis of multidimensional data by preparing the data set so that the analyst can explore it in a more systematic and effective manner is proposed.
Abstract: On-line analytical processing (OLAP) provides an interactive query-driven analysis of multidimensional data based on a set of navigational operators like roll-up or slice and dice. In most cases, the analyst is expected to use these operations intuitively to find interesting patterns in a huge amount of data of high dimensionality.In this paper, we propose an approach to enhance this analysis by preparing the data set so that the analyst can explore it in a more systematic and effective manner. More precisely we define a measurement of the quality of the representation of multidimensional data and we present a framework for investigating the computation of appropriate representations. We identify the problems of computing such representations and study them w.r.t. an OLAP restructuring operator.
TL;DR: This framework provides a generic graphical knowledge representation model based on Sowa's conceptual structures that may be applied to various domains and may accept, for this purpose, many different ontological extensions.
Abstract: The main contribution of this paper is to lay down a conceptual framework for document semantics modeling. This framework provides a generic graphical knowledge representation model based on Sowa's conceptual structures. Modeling primitives are introduced to represent factual and ontological knowledge that can be expressed in electronic documents. Binding features are proposed so as to keep knowledge representation and knowledge formulation linked together.This framework may be applied to various domains and may accept, for this purpose, many different ontological extensions. Thus an extension is provided so as to properly handle the particular kind of knowledge encountered in the legal domain.
TL;DR: The implementation of SISYPHUS' chunk-oriented file system is presented as well as the core architecture of the system and the reason on various design choices and implementation solutions are presented.
Abstract: In this article, we present the design and implementation of SISYPHUS, a storage manager for data cubes that provides an efficient physical base for performing on-line analytical processing (OLAP) operations. OLAP poses new requirements to the physical storage layer of a database management system. Special characteristics of OLAP cubes such as multidimensionality, hierarchical structure of dimensions, data sparseness, etc., are difficult to handle with ordinary record-oriented storage managers. The SISYPHUS storage manager is based on a chunk-based data model that enables the hierarchical clustering of data with a very low storage cost. In this article we present the implementation of SISYPHUS' chunk-oriented file system as well as present the core architecture of the system and reason on various design choices and implementation solutions.
TL;DR: It is shown how second- order decision tables can be used to restructure acquired tabular knowledge into a condensed but logically equivalent second-order table and the results of experiments with such restructuring are presented.
Abstract: Decision tables are widely used in many knowledge-based and decision support systems. They allow relatively complex logical relationships to be represented in an easily understood form and processed efficiently. This paper describes second-order decision tables (decision tables that contain rows whose components have sets of atomic values) and their role in knowledge engineering to: (1) support efficient management and enhance comprehensibility of tabular knowledge acquired by knowledge engineers, and (2) automatically generate knowledge from a tabular set of examples. We show how second-order decision tables can be used to restructure acquired tabular knowledge into a condensed but logically equivalent second-order table. We then present the results of experiments with such restructuring. Next, we describe SORCER, a learning system that induces second-order decision tables from a given database. We compare SORCER with IDTM, a system that induces standard decision tables, and a state-of-the-art decision tree learner, C4.5. Results show that in spite of its simple induction methods, on the average over the data sets studied, SORCER has the lowest error rate.
TL;DR: This paper presents an approach to estimating the soundness and completeness of queries expressed in the ALCQI DL, based on estimating the cardinalities of query answers and offers some suggestions as to how estimates for Cardinalities of subqueries can be used to aid users in improving the soundedness and completion of query plans.
Abstract: Information integration systems allow users to express queries over high-level conceptual models. However, such queries must subsequently be evaluated over collections of sources, some of which are likely to be expensive to use or subject to periods of unavailability. As such, it would be useful if information integration systems were able to provide users with estimates of the consequences of omitting certain sources from query execution plans. Such omissions can affect both the soundness (the fraction of returned answers which are returned) and the completeness (the fraction of correct answers which are returned) of the answer set returned by a plan. Many recent information integration systems have used conceptual models expressed in description logics (DLs). This paper presents an approach to estimating the soundness and completeness of queries expressed in the ALCQI DL. Our estimation techniques are based on estimating the cardinalities of query answers. We have have conducted some statistical evaluation of our techniques, the results of which are presented here. We also offer some suggestions as to how estimates for cardinalities of subqueries can be used to aid users in improving the soundness and completeness of query plans.
TL;DR: The proposed approach is based on the exploitation of a new model, called E-SDR-Network, for representing and handling, at the extensional level, heterogeneous data sources, ranging from databases to XML documents, object exchange model graphs and other semi-structured data.
Abstract: In this paper we propose an approach for the extensional integration of data sources with heterogeneous representation formats. The proposed approach is based on the exploitation of a new model, called E-SDR-Network, for representing and handling, at the extensional level, heterogeneous data sources, ranging from databases to XML documents, object exchange model graphs and other semi-structured data. Due to the specific features of E-SDR-Network, the proposed extensional integration methodology is capable of: (i) easily handling null or unknown values, (ii) producing consistent query answers from possibly inconsistent data and (iii) reconstructing, at the extensional level, the content of each data source involved in the integration task. Finally, we show that E-SDR-Network and the proposed extensional integration algorithm are the counterpart, at the extensional level, of the SDR-Network conceptual model and the associated intensional integration algorithm, already proposed in the literature. Therefore, in the whole, we obtain a complete approach consisting of two components performing synergically both the intensional and the extensional integration of data sources having heterogeneous data representation formats.