TL;DR: This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept and proposes two modeling approaches based on a design pattern and a dedicated data types.
Abstract: Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.
TL;DR: A set of 18 change patterns and seven change support features are suggested to foster the systematic comparison of existing process management technology in respect to process change support to facilitate the selection of technologies for realizing flexible PAISs.
Abstract: Companies increasingly adopt process-aware information systems (PAISs), which offer promising perspectives for more flexible enterprise computing. The emergence of different process support paradigms and the lack of methods for comparing existing approaches enabling PAIS changes have made the selection of adequate process management technology difficult. This paper suggests a set of 18 change patterns and seven change support features to foster the systematic comparison of existing process management technology in respect to process change support. While the proposed patterns are all based on empirical evidence from several large case studies, the suggested change support features constitute typical functionalities provided by flexible PAISs. Based on the proposed change patterns and features, we provide a detailed analysis and evaluation of selected approaches from both academia and industry. The presented work will not only facilitate the selection of technologies for realizing flexible PAISs, but can also be used as a reference for implementing flexible PAISs.
TL;DR: This study proposes the Isolated Items Discarding Strategy (IIDS), which can be applied to any existing level-wise utility mining method to reduce candidates and to improve performance, and results reveal that the performance of FUM and DCG+ is more efficient than that of ShFSM andDCG, respectively.
Abstract: Traditional methods of association rule mining consider the appearance of an item in a transaction, whether or not it is purchased, as a binary variable. However, customers may purchase more than one of the same item, and the unit cost may vary among items. Utility mining, a generalized form of the share mining model, attempts to overcome this problem. Since the Apriori pruning strategy cannot identify high utility itemsets, developing an efficient algorithm is crucial for utility mining. This study proposes the Isolated Items Discarding Strategy (IIDS), which can be applied to any existing level-wise utility mining method to reduce candidates and to improve performance. The most efficient known models for share mining are ShFSM and DCG, which also work adequately for utility mining as well. By applying IIDS to ShFSM and DCG, the two methods FUM and DCG+ were implemented, respectively. For both synthetic and real datasets, experimental results reveal that the performance of FUM and DCG+ is more efficient than that of ShFSM and DCG, respectively. Therefore, IIDS is an effective strategy for utility mining.
TL;DR: An automatic and unsupervised methodology is presented that is able to discover domain-related verbs, extract non-taxonomically related concepts and label relationships, using the Web as corpus and presents encouraging results for several domains.
Abstract: In recent years, much effort has been put in ontology learning. However, the knowledge acquisition process is typically focused in the taxonomic aspect. The discovery of non-taxonomic relationships is often neglected, even though it is a fundamental point in structuring domain knowledge. This paper presents an automatic and unsupervised methodology that addresses the non-taxonomic learning process for constructing domain ontologies. It is able to discover domain-related verbs, extract non-taxonomically related concepts and label relationships, using the Web as corpus. The paper also discusses how the obtained relationships can be automatically evaluated against WordNet and presents encouraging results for several domains.
TL;DR: The thematic and citation structures of Data and Knowledge Engineering (DKE) (1985-2007) are identified based on text analysis and citation analysis of the bibliographic records of full papers published in the journal.
Abstract: The thematic and citation structures of Data and Knowledge Engineering (DKE) (1985-2007) are identified based on text analysis and citation analysis of the bibliographic records of full papers published in the journal. Temporal patterns are identified by detecting abrupt increases of frequencies of noun phrases extracted from titles and abstracts of DKE papers over time. Conceptual structures of the subject domain are identified by clustering analysis. Concept maps and network visualizations are presented to illustrate salient patterns and emerging thematic trends. A variety of statistics are reported to highlight key contributors and DKE papers that have made profound impacts.
TL;DR: A formal two-step approach for constructing customized process views on structured process models by hiding and omitting activities from the non-customized view that are not requested by the process consumer is described.
Abstract: To enable effective cross-organizational collaborations, process providers have to offer external views on their internal processes to their partners. A process view hides details of an internal process that are secret to or irrelevant for the partners. This paper describes a formal two-step approach for constructing customized process views on structured process models. First, a non-customized process view is constructed from an internal structured process model by aggregating internal activities the provider wishes to hide. Second, a customized process view is constructed by hiding and omitting activities from the non-customized view that are not requested by the process consumer. The feasibility of the approach is shown by means of a case study.
TL;DR: A framework for defining semantic constraints over processes in such a way that they can express real-world domain knowledge and are still manageable concerning the effort for maintenance and semantic process verification is introduced.
Abstract: Adaptivity in process management systems is key to their successful applicability in practice. Approaches have been already developed to ensure system correctness after arbitrary process changes at the syntactical level (e.g., avoiding inconsistencies such as deadlocks or missing input parameters after a process change). However, errors may be still caused at the semantical level (e.g., violation of business rules). Therefore, the integration and verification of domain knowledge will flag a milestone in the development of adaptive process management technology. In this paper, we introduce a framework for defining semantic constraints over processes in such a way that they can express real-world domain knowledge on the one hand and are still manageable concerning the effort for maintenance and semantic process verification on the other hand. This can be used to detect semantic conflicts (e.g., drug incompatibilities) when modeling process templates, applying ad hoc changes at process instance level, and propagating process template modifications to already running process instances, even if they have been already individually modified themselves; i.e., we present techniques to ensure semantic correctness for single and concurrent changes which are, in addition, minimal regarding the set of semantic constraints to be checked. Together with further optimizations of the semantic checks based on certain process meta model properties this allows for efficiently verifying processes. Altogether, the framework presented in this paper provides the basis for process management systems which are adaptive and semantic-aware at the same time.
TL;DR: To manage processes of realistic size, the problem of analyzing the interaction between WS-BPEL processes is addressed and a concept of a flexible model generation is presented which allows the generation of compact Petri net models.
Abstract: We address the problem of analyzing the interaction between WS-BPEL processes. We present a technology chain that starts out with a WS-BPEL process and translates it into a Petri net model. On the model we decide controllability of the process (the existence of a partner process, such that both can interact properly) and compute its operating guideline (a characterization of all properly interacting partner processes). To manage processes of realistic size, we present a concept of a flexible model generation which allows the generation of compact Petri net models. A case study demonstrates the value of this technology chain.
TL;DR: A new algorithmic approach for sanitizing raw data from sensitive knowledge in the context of mining of association rules relies on the maxmin criterion which is a method in decision theory for maximizing the minimum gain and builds upon the border theory of frequent itemsets.
Abstract: In this paper, we are proposing a new algorithmic approach for sanitizing raw data from sensitive knowledge in the context of mining of association rules. The new approach (a) relies on the maxmin criterion which is a method in decision theory for maximizing the minimum gain and (b) builds upon the border theory of frequent itemsets. Experimental results indicate the effectiveness of the proposed methodology both with respect to the hiding results as well as with respect to the time performance compared to similar state of the art approaches.
TL;DR: The proposed hybrid fuzzy time series model with two advanced methods, cumulative probability distribution approach (CPDA) and rough set rule induction, to forecast stock markets shows a greatly improved performance in stock market forecasting compared to other fuzzy timeseries models.
Abstract: This study proposes a hybrid fuzzy time series model with two advanced methods, cumulative probability distribution approach (CPDA) and rough set rule induction, to forecast stock markets. To improve forecasting accuracy, three refining processes of fuzzy time series are provided in the proposed model: (1) using CPDA to discretize the observations in training datasets based on the characteristics of data distribution, (2) generating rules (fuzzy logical relationships) by rough set algorithm and (3) producing forecasting results based on rule support values from rough set algorithm. To verify the forecasting performance of the proposed model in detail, two empirical stock markets (TAIEX and NYSE) are used as evaluating databases; two other methodologies, proposed by Chen and Yu, are used as comparison models, and two different evaluation methods (moving windows) are used. The proposed model shows a greatly improved performance in stock market forecasting compared to other fuzzy time series models.
TL;DR: A new record linkage method is presented, specific for rank swapping, which obtains more links than standard ones, and has the consequence that rank swapping has a higher disclosure risk than believed up to now.
Abstract: Nowadays, the need for privacy motivates the use of methods that allow to protect a microdata file both minimizing the disclosure risk and preserving the data utility. A very popular microdata protection method is rank swapping. Record linkage is the standard mechanism used to measure the disclosure risk of a microdata protection method. In this paper we present a new record linkage method, specific for rank swapping, which obtains more links than standard ones. The consequence is that rank swapping has a higher disclosure risk than believed up to now. Motivated by this, we present two new variants of the rank swapping method, which make the new record linkage technique unsuitable. Therefore, the real disclosure risk of these new methods is lower than the standard rank swapping.
TL;DR: The results demonstrate the potential of a sensitivity-based estimate, as well as the local modeling of prediction error with regression trees, with the best average performance achieved by estimation using the bagging variance approach, which achieved the best performance with neural networks, bagging and locally weighted regression.
Abstract: The paper compares different approaches to estimate the reliability of individual predictions in regression. We compare the sensitivity-based reliability estimates developed in our previous work with four approaches found in the literature: variance of bagged models, local cross-validation, density estimation, and local modeling. By combining pairs of individual estimates, we compose a combined estimate that performs better than the individual estimates. We tested the estimates by running data from 28 domains through eight regression models: regression trees, linear regression, neural networks, bagging, support vector machines, locally weighted regression, random forests, and generalized additive model. The results demonstrate the potential of a sensitivity-based estimate, as well as the local modeling of prediction error with regression trees. Among the tested approaches, the best average performance was achieved by estimation using the bagging variance approach, which achieved the best performance with neural networks, bagging and locally weighted regression.
TL;DR: An approach to process mining is introduced that extends classical discovery mechanisms by means of an abstraction method aimed at producing a taxonomy of workflow models, showing how the taxonomical view of the process can effectively support an explorative ex-post analysis, hinged on the different kinds of process execution discovered from the logs.
Abstract: Process mining techniques have been receiving great attention in the literature for their ability to automatically support process (re)design. Typically, these techniques discover a concrete workflow schema modelling all possible execution patterns registered in a given log, which can be exploited subsequently to support further-coming enactments. In this paper, an approach to process mining is introduced that extends classical discovery mechanisms by means of an abstraction method aimed at producing a taxonomy of workflow models. The taxonomy is built to capture the process behavior at different levels of detail. Indeed, the most-detailed mined models, i.e., the leafs of the taxonomy, are meant to support the design of concrete workflows, as it happens with existing techniques in the literature. The other models, i.e., non-leaf nodes of the taxonomy, represent instead abstract views over the process behavior that can be used to support advanced monitoring and analysis tasks. All the techniques discussed in the paper have been implemented, tested, and made available as a plugin for a popular process mining framework (ProM). A series of tests, performed on different synthesized and real datasets, evidenced the capability of the approach to characterize the behavior encoded in input logs in a precise and complete way, achieving compelling conformance results even in the presence of complex behavior and noisy data. Moreover, encouraging results have been obtained in a real-life application scenario, where it is shown how the taxonomical view of the process can effectively support an explorative ex-post analysis, hinged on the different kinds of process execution discovered from the logs.
TL;DR: The notion of Collective Intelligence (CI) in the realm of the Social Web and its potential to become a new computing paradigm for creating solutions or strategies to tackle wicked problems where the synergistic interactions of a group of people with diverse cultural and professional backgrounds are requested are explored.
Abstract: Since the first formal specifications of modern computing machinery as laid out by Alan Turing and his contemporary fellows, we have been witnessing, during the last three decades, an evolutionary path in computing towards more personalized and contextualized data and knowledge artifacts. Information sharing, co-ordination, co-operation and, to some extent, collaboration among machines has been envisioned for complex problem solving. A prominent example of this problem solving approach has been the Fifth Generation Computer Systems (FGCS) project as launched in Japan in the 1980s and based on the concept of calculation using massive parallelism in logic and hardware. Grid and Distributed Computing, also known as Future Generation Computer Systems, is another similar attempt to exploit massive parallelism in order to solve complex problems. In this article, we explore the notion of Collective Intelligence (CI) in the realm of the Social Web and its potential to become a new computing paradigm for creating solutions or strategies to tackle wicked problems where the synergistic interactions of a group of people with diverse cultural and professional backgrounds are requested.
TL;DR: Verifying BPEL Workflows Under Authorisation Constraints and Optimizing Exception Handling in Workflows Using Process Restructuring are discussed.
TL;DR: A new record linkage technique, specific for microaggregation, is presented, which obtains more correct links than standard techniques and has a higher disclosure risk than believed up to now.
Abstract: The aim of data protection methods is to protect a microdata file both minimizing the disclosure risk and preserving the data utility. Microaggregation is one of the most popular such methods among statistical agencies. Record linkage is the standard mechanism used to measure the disclosure risk of a microdata protection method. However, only standard, and quite generic, record linkage methods are usually considered, whereas more specific record linkage techniques can be more appropriate to evaluate the disclosure risk of some protection methods. In this paper we present a new record linkage technique, specific for microaggregation, which obtains more correct links than standard techniques. We have tested the new technique with MDAV microaggregation and two other microaggregation methods, based on projections, that we propose here for the first time. The direct consequence is that these microaggregation methods have a higher disclosure risk than believed up to now.
TL;DR: This work presents a methodology for sequence classification, which employs sequential pattern mining and optimization, in a two-stage process, and it is compared with similar sequence classification approaches.
Abstract: We present a methodology for sequence classification, which employs sequential pattern mining and optimization, in a two-stage process. In the first stage, a sequence classification model is defined, based on a set of sequential patterns and two sets of weights are introduced, one for the patterns and one for classes. In the second stage, an optimization technique is employed to estimate the weight values and achieve optimal classification accuracy. Extensive evaluation of the methodology is carried out, by varying the number of sequences, the number of patterns and the number of classes and it is compared with similar sequence classification approaches.
TL;DR: A recently-developed constrained cascade generalization method is applied in entity matching and it is shown that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case.
Abstract: To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing semantically corresponding entities in the real world, across the sources. While decision tree techniques have been used to learn entity matching rules, most decision tree learners have an inherent representational bias, that is, they generate univariate trees and restrict the decision boundaries to be axis-orthogonal hyper-planes in the feature space. Cascading other classification methods with decision tree learners can alleviate this bias and potentially increase classification accuracy. In this paper, the authors apply a recently-developed constrained cascade generalization method in entity matching and report on empirical evaluation using real-world data. The evaluation results show that this method outperforms the base classification methods in terms of classification accuracy, especially in the dirtiest case.
TL;DR: By using the ID-tree and datasets to mine closed inter-transaction itemsets, the ICMiner can embed effective pruning strategies to avoid costly candidate generation and repeated support counting.
Abstract: In this paper, we propose an efficient algorithm, called ICMiner (Inter-transaction Closed patterns Miner), for mining closed inter-transaction itemsets. Our proposed algorithm consists of two phases. First, we scan the database once to find the frequent items. For each frequent item found, the ICMiner converts the original transaction database into a set of domain attributes, called a dataset. Then, it enumerates closed inter-transaction itemsets using an itemset-dataset tree, called an ID-tree. By using the ID-tree and datasets to mine closed inter-transaction itemsets, the ICMiner can embed effective pruning strategies to avoid costly candidate generation and repeated support counting. The experiment results show that the proposed algorithm outperforms the EH-Apriori, FITI, ClosedPROWL, and ITP-Miner algorithms in most cases.
TL;DR: It is shown that more-flexible generalization schemes produce higher-quality anonymizations and the bottom-up works better for small k values and small number of quasi-identifier attributes than the top-down approach.
Abstract: When releasing microdata for research purposes, one needs to preserve the privacy of respondents while maximizing data utility. An approach that has been studied extensively in recent years is to use anonymization techniques such as generalization and suppression to ensure that the released data table satisfies the k-anonymity property. A major thread of research in this area aims at developing more flexible generalization schemes and more efficient searching algorithms to find better anonymizations (i.e., those that have less information loss). This paper presents three new generalization schemes that are more flexible than existing schemes. This flexibility can lead to better anonymizations. We present a taxonomy of generalization schemes and discuss their relationship. We present enumeration algorithms and pruning techniques for finding optimal generalizations in the new schemes. Through experiments on real census data, we show that more-flexible generalization schemes produce higher-quality anonymizations and the bottom-up works better for small k values and small number of quasi-identifier attributes than the top-down approach.
TL;DR: It is shown that the data definition and manipulation framework of XML and XQuery can effectively support temporal models and historical queries without requiring extensions to the current standards.
Abstract: By storing the successive versions of a document in an incremental fashion, XML repositories and data warehouses achieve: (i) the efficient preservation of critical information and (ii) the ability to support historical queries on the evolution of documents and their contents. In this paper, we present efficient techniques for managing multi-version document histories and supporting powerful temporal queries on such documents. Our approach consists of: (i) concisely representing the successive versions of a document as an XML document that implements a temporally-grouped data model and (ii) using XML query languages, such as XQuery, to express complex queries on the content of a particular version, and on the temporal evolution of the document elements and contents. We show that the data definition and manipulation framework of XML and XQuery can effectively support temporal models and historical queries without requiring extensions to the current standards; in fact, this approach is effective at representing and querying the histories of relational database tables, which are difficult to manage using SQL. These conclusions emerge through a number of interesting case studies presented in this paper that include W3C documents, the UCLA course catalog, and the CIA World Factbook.
TL;DR: A working prototype system which is based on Fuzzy Clustering CRM (FC-CRM) has been developed and presented to validate the proposed approach and illustrate how it handles the dynamic inflow of new documents.
Abstract: Internet technology enables companies to capture new customers, track their performances and online behavior, and customize communications, products, services, and prices. Analyses of customers and customer interactions for electronic customer relationship management (e-CRM) can be performed by way of using data mining (DM), optimization methods, or combined approaches. One key issue in the analysis of access patterns on the Web is the clustering and classification of Web documents. Generally, the classification has its base on analytical models which assume a pre-fixed set of keywords (attributes) with predefined list of categories. This assumption is not realistic for large and evolving collections of documents such as World Wide Web. We propose a new approach to solve the problem of unknown number of evolving categories. The approach begins with the classification of test documents into a set of initial categories. A working prototype system which is based on Fuzzy Clustering CRM (FC-CRM) has been developed and presented to validate the proposed approach and illustrate how it handles the dynamic inflow of new documents.
TL;DR: A privacy-preserving protocol for filling in missing values using a lazy decision tree imputation algorithm for data that is horizontally partitioned between two parties.
Abstract: Handling missing data is a critical step to ensuring good results in data mining. Like most data mining algorithms, existing privacy-preserving data mining algorithms assume data is complete. In order to maintain privacy in the data mining process while cleaning data, privacy-preserving methods of data cleaning will be required. In this paper, we address the problem of privacy-preserving data imputation of missing data. Specifically, we present a privacy-preserving protocol for filling in missing values using a lazy decision tree imputation algorithm for data that is horizontally partitioned between two parties. The participants of the protocol learn only the imputed values; the computed decision tree is not learned by either party.
TL;DR: This paper extends @tXSchema to support versioning of the schema itself, and introduces the concept of a bundle, which is an XML document that references a base (non-temporal) schema, temporal annotations describing how the document can change, and physical annotations describing where timestamps are placed.
Abstract: The W3C XML Schema recommendation defines the structure and data types for XML documents, but lacks explicit support for time-varying XML documents or for a time-varying schema. In previous work we introduced @tXSchema, which is an infrastructure and suite of tools to support the creation and validation of time-varying documents, without requiring any changes to XML Schema. In this paper we extend @tXSchema to support versioning of the schema itself. We introduce the concept of a bundle, which is an XML document that references a base (non-temporal) schema, temporal annotations describing how the document can change, and physical annotations describing where timestamps are placed. When the schema is versioned, the base schema and temporal and physical schemas can themselves be time-varying documents, each with their own (possibly versioned) schemas. We describe how the validator can be extended to validate documents in this seeming precarious situation of data that changes over time, while its schema and even its representation are also changing.
TL;DR: This research presents a methodology for processing web queries that employs semantic knowledge about different application domains from ResearchCyc, as well as linguistic knowledge from WordNet, to improve web querying processing.
Abstract: Although search engines are very useful for obtaining information from the World Wide Web, users still have problems obtaining the most relevant information when processing their web queries. Prior research has attempted to use different types of knowledge to improve web querying processing with various levels of success. This research presents a methodology for processing web queries that employs semantic knowledge about different application domains from ResearchCyc, as well as linguistic knowledge from WordNet. An analysis of different queries from different application domains using the semantic and linguistic knowledge illustrates how more relevant results can be obtained.
TL;DR: An ontology for representing relevant semantic properties of services and processes is provided, and an algorithm for value-based service selection is presented, and two real life case studies show the effectiveness of the approach.
Abstract: A major objective in business interactions consists in enhancing the business perspective over service provision by developing strategies and tools to provide support in the selection of services according to the value they have for businesses. This means providing a way to determine the value of services according to specific business criteria, and conceive technologies that support the sharing of knowledge involved in service provision. In this paper we present an approach based on semantic repositories. The repository enables a business perspective over service provision, based on the association between services and business processes, and is related to the problem of supporting businesses in the value-driven service selection. This perspective is addressed in the paper by exploiting expressive semantic representations and reasoning. An ontology for representing relevant semantic properties of services and processes is provided, and an algorithm for value-based service selection is presented. Two real life case studies show the effectiveness of the approach.
TL;DR: MeDEA, which stands for Metamodel-based Database Evolution Architecture, is a generic evolution architecture that allows us to maintain the traceability between the different artifacts involved in any database development process.
Abstract: One of the most important challenges that software engineers (designers, developers) still have to face in their everyday work is the evolution of working database systems. As a step for the solution of this problem in this paper we propose MeDEA, which stands for Metamodel-based Database Evolution Architecture. MeDEA is a generic evolution architecture that allows us to maintain the traceability between the different artifacts involved in any database development process. MeDEA is generic in the sense that it is independent of the particular modeling techniques being used. In order to achieve this, a metamodeling approach has been followed for the development of MeDEA. The other basic characteristic of the architecture is the inclusion of a specific component devoted to storing the translation of conceptual schemas to logical ones. This component, which is one of the most noteworthy contributions of our approach, enables any modification (evolution) realized on a conceptual schema to be traced to the corresponding logical schema, without having to regenerate this schema from scratch, and furthermore to be propagated to the physical and extensional levels.
TL;DR: In this paper, the authors classify incomplete decision tables into three types according to their consistency and introduce four new measures for evaluating the decision performance of a decision-rule set extracted from an incomplete decision table.
Abstract: As two classical measures, approximation accuracy and consistency degree can be extended for evaluating the decision performance of an incomplete decision table. However, when the values of these two measures are equal to zero, they cannot give elaborate depictions of the certainty and consistency of an incomplete decision table. To overcome this shortcoming, we first classify incomplete decision tables into three types according to their consistency and introduce four new measures for evaluating the decision performance of a decision-rule set extracted from an incomplete decision table. We then analyze how each of these four measures depends on the condition granulation and decision granulation of each of the three types of incomplete decision tables. Experimental analyses on three practical data sets show that the four new measures appear to be well suited for evaluating the decision performance of a decision-rule set extracted from an incomplete decision table and are much better than the two extended measures.
TL;DR: This paper introduces a formal model of privacy protection, called k-unlinkability, to prevent trail re-identification in distributed data, and guarantees that sensitive data trails are linkable to no less than k identities.
Abstract: In the past, data holders protected the privacy of their constituents by issuing separate disclosures of sensitive (e.g., DNA) and identifying data (e.g., names). However, individuals visit many places and their location-visit patterns, or ''trails'', can re-identify seemingly anonymous data. In this paper, we introduce a formal model of privacy protection, called k-unlinkability, to prevent trail re-identification in distributed data. The model guarantees that sensitive data trails are linkable to no less than k identities. We develop a graph-based model and illustrate how k-unlinkability is a more appropriate solution to this privacy problem compared to alternative privacy protection models.
TL;DR: An efficient approach to the identification of similar subtrees of XML documents in a collection, relying on ad-hoc indexing structures, that supports different notions of similarity, thus it can be customized to different application domains.
Abstract: Due to the heterogeneous nature of XML data for internet applications exact matching of queries is often inadequate. The need arises to quickly identify subtrees of XML documents in a collection that are similar to a given pattern. Similarity involves both tags, that are not required to coincide, and structure, in which not all the relationships among nodes in the tree structure are strictly preserved. In this paper we present an efficient approach to the identification of similar subtrees, relying on ad-hoc indexing structures. The approach allows to quickly detect, in a heterogeneous document collection, the minimal portions that exhibit some similarity with the pattern. These candidate portions are then ranked according to their actual similarity. The approach supports different notions of similarity, thus it can be customized to different application domains. In the paper, three different similarity measures are proposed and compared. The approach is experimentally validated and the experimental results are extensively discussed.