TL;DR: An indexing classification scheme is suggested and some of the current trends in indexing methods, which indicate a clear shift towards hybrid indexing are discussed, are discussed.
Abstract: With the rapid emergence of XML as a data exchange standard over the Web, storing and querying XML data have become critical issues. The two main approaches to storing XML data are (1) to employ traditional storage such as relational database, object-oriented database and so on, and (2) to create an XML-specific native storage. The storage representation affects the efficiency of query processing. In this paper, firstly, we review the two approaches for storing XML data. Secondly, we review various query optimization techniques such as indexing, labeling and join algorithms to enhance query processing in both approaches. Next, we suggest an indexing classification scheme and discuss some of the current trends in indexing methods, which indicate a clear shift towards hybrid indexing.
TL;DR: This survey divides the existing approaches to keyword search on XML into several classes based on the problem they tackled, and performs a comprehensive analysis of these works.
Abstract: Keyword search is a user-friendly approach for users to retrieve information from XML data. Since an XML document can have a large size and contain a lot of information, an XML keyword search result should be a fragment of an XML document dynamically constructed at query time, which is achievable due to the structuredness of XML. Processing keyword searches on XML has several challenges, e.g., what are the elements in the XML document that are relevant to the query? How to generate the results efficiently and rank the results meaningfully? How to present the results to the user in a way such that the user can quickly find the desired information? In this survey, we review the papers in the literature that attempted to address these problems. We divide the existing approaches into several classes based on the problem they tackled, and perform a comprehensive analysis of these works.
TL;DR: A novel labeling scheme is introduced, called extended Dewey, which effectively extends the existing Dewey labeling scheme to combine the types and identifiers of elements in a label, and to avoid the scan of labels for internal query nodes to accelerate query processing (in I/O cost).
Abstract: Finding all the occurrences of a tree pattern in an XML database is a core operation for efficient evaluation of XML queries. The Dewey labeling scheme is commonly used to label an XML document to facilitate XML query processing by recording information on the path of an element. In order to improve the efficiency of XML tree pattern matching, we introduce a novel labeling scheme, called extended Dewey, which effectively extends the existing Dewey labeling scheme to combine the types and identifiers of elements in a label, and to avoid the scan of labels for internal query nodes to accelerate query processing (in I/O cost). Based on extended Dewey, we propose a series of holistic XML tree pattern matching algorithms. We first present TJFast to answer an XML twig pattern query. To efficiently answer a generalized XML tree pattern, we then propose GTJFast, an optimization that exploits the non-output nodes. In addition, we propose TJFastTL and GTJFastTL based on the tag+level data partition scheme to further reduce I/O costs by level pruning. Finally, we report our comprehensive experimental results to show that our set of XML tree pattern matching algorithms are superior to existing approaches in terms of the number of elements scanned, the size of intermediate results and query performance.
TL;DR: It is concluded that xml Schema validation with a hardened XML Schema is capable of fending XML Signature Wrapping attacks, but bears some pitfalls and disadvantages as well.
Abstract: In the context of security of Web Services, the XML Signature Wrapping attack technique has lately received increasing attention. Following a broad range of real-world exploits, general interest in applicable countermeasures rises. However, few approaches for countering these attacks have been investigated closely enough to make any claims about their effectiveness. In this paper, we analyze the effectiveness of the specific countermeasure of XML Schema validation in terms of fending Signature Wrapping attacks. We investigate the problems of XML Schema validation for Web Services messages, and discuss the approach of Schema Hardening, a technique for strengthening XML Schema declarations. We conclude that XML Schema validation with a hardened XML Schema is capable of fending XML Signature Wrapping attacks, but bears some pitfalls and disadvantages as well.
TL;DR: The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-effects to fully expose to XQuery developers the Hadoop framework.
Abstract: MapReduce/Hadoop has gained acceptance as a framework to process, transform, integrate, and analyze massive amounts of Web data on the Cloud The MapReduce model (simple, fault tolerant, data parallelism on elastic clouds of commodity servers) is also attractive for processing enterprise and scientific data Despite XML ubiquity, there is yet little support for XML processing on top of MapReduce In this paper, we describe ChuQL, a MapReduce extension to XQuery, with its corresponding Hadoop implementation The ChuQL language incorporates records to support the key/value data model of MapReduce, leverages higher-order functions to provide clean semantics, and exploits side-effects to fully expose to XQuery developers the Hadoop framework The ChuQL implementation distributes computation to multiple XQuery engines, providing developers with an expressive language to describe tasks over big data
TL;DR: The aim of this paper is to show the implementation of the general approach transforming any XML Schemas into generated ontologies automatically using XSLT.
Abstract: Designing domain ontologies from scratch is a time-consuming process. In many cases, both the terminologies and the syntactic structures of domain data models are already described in form of XML Schemas. XSLT transformations are used to lift the syntactic level of XML documents to the semantic level of OWL ontologies by mapping any XML Schemas to generated ontologies automatically. Ontology engineers base domain ontologies on generated ontologies to enrich the information located in the XML schemas with additional domain specific semantic information. The aim of this paper is to show the implementation of the general approach transforming any XML Schemas into generated ontologies automatically using XSLT.
TL;DR: Evaluation results based on the dataset from the ISO/IEC standardization of the vehicle to grid communication interface (V2G CI) prove the applicability of the generated XML-based Web services of restricted devices in terms of message size, performance, and code footprint.
Abstract: Embedded network programming remains a highly complex task for developers since unique characteristics of such networks have to be faced: one of them is the communication between a diversity of resource constraint nodes. Another one is the infrastructure dynamics. The widely-used standardized Web service technologies would perfectly meet such unique characteristics and ease the development of applications. Such technologies that enable, e.g., requesting or subscribing service data, however, process usually plain XML documents which are not suitable for small embedded devices with very limited resources. This is due to XML's verbosity, its bandwidth usage, and its associated processing overhead. The paper addresses these issues and describes an innovative and optimized source code generation technique by means of W3C's Efficient XML Interchange (EXI) format for developing XML-based Web services for the embedded domain. This offers developers a seamless use of the wide-spread service protocols in the embedded domain as well. Evaluation results based on the dataset from the ISO/IEC standardization of the vehicle to grid communication interface (V2G CI) prove the applicability of the generated XML-based Web services of restricted devices in terms of message size, performance, and code footprint.
TL;DR: An overview of the facilities of the XSUpdate language and of the Eχup system is provided to provide an insight into the functioning of this engine for processing schema modification and document adaptation statements.
Abstract: Data on the Web mostly are in XML format and the need often arises to update their structure, commonly described by an XML Schema. When a schema is modified the effects of the modification on documents need to be faced. XSUpdate is a language that allows to easily identify parts of an XML Schema, apply a modification primitive on them and finally define an adaptation for associated documents, while Eχup is the corresponding engine for processing schema modification and document adaptation statements. Purpose of this demonstration is to provide an overview of the facilities of the XSUpdate language and of the Eχup system.
TL;DR: A system and method for dynamically retrieving, manipulating, updating, creating, and displaying data from sources of Extensible Markup Language (XML) documents is presented in this article.
Abstract: A system and method for dynamically retrieving, manipulating, updating, creating, and displaying data from sources of Extensible Markup Language (XML) documents. The program memory comprises system-user entered data definitions and business rules. The system imports XML document data into the system data definitions, processes the data using the business rules definitions and exports XML documents. The system can automatically create XML document formats from its data definitions and can automatically create its data definitions from XML document formats. The system-user can also define the mapping between XML document formats and the system data definitions. The system data definition is the combination of a Relational data model, an Object data model, and an XML data model.
TL;DR: The approach presented in this paper extends an existing XML conceptual model with the support for multiple versions of the model, and it is possible to define a set of changes between two versions of a schema.
Abstract: One of the key characteristics of XML applications is their dynamic nature. When a system grows and evolves, old user requirements change and/or new requirements accumulate. Apart from changes in the interface, it is also necessary to modify the existing documents with each new version, so they are valid against the new specification. The approach presented in this paper extends an existing XML conceptual model with the support for multiple versions of the model. Thanks to this extension, it is possible to define a set of changes between two versions of a schema. This work contains an outline of an algorithm that compares two versions of a schema and produces a revalidation script in XSL.
TL;DR: This study investigates the characteristics and challenges in building Open Geospatial Consortium Inc. (OGC) catalog service, and presents a general lightweight XML adapter for relational tables, followed by a general OGC catalog service solution based on this adapter.
TL;DR: A conceptual approach, and its implementation, to integrate external syntactic data representations with organizational internal semantic data representations by using the notion of heterogeneous mappings which are established between the two types of representations are presented.
Abstract: XML-based standards have been widely used to enable and ease Business-to-Business (B2B) integration. Examples of standards include cXML, CIDX and ebXML. While these XML-based standards are syntactic, contemporary organizations have available new means to structure their internal data representations using semantic descriptions, such as RDF(S) and OWL. This scenario poses an interesting challenge: ''How to reconcile external XML-based standards and internal OWL-based representations in B2B integration scenarios?'' In this paper, we present a conceptual approach, and its implementation, to integrate external syntactic data representations with organizational internal semantic data representations by using the notion of heterogeneous mappings which are established between the two types of representations. The application developed, B2BISS, enables an effective management of mappings. As the number of mappings stored in the repository increases over time, organizations can gradually rely on a semi-automatic to automatic B2B integration.
TL;DR: An overview on existing research related to XML document/grammar comparison is provided, presenting the background and discussing the various techniques related to the problem, as well as discussing some prominent application domains.
Abstract: XML document comparison is becoming an ever more popular research issue due to the increasingly abundant use of XML. Likewise, a growing interest fosters the development of XML grammar matching and comparison, due to the proliferation of heterogeneous XML data sources, particularly on the Web. Nonetheless, the process of comparing XML documents with XML grammars, i.e., XML document and grammar similarity evaluation, has not yet received the attention it deserves. In this paper, we provide an overview on existing research related to XML document/grammar comparison, presenting the background and discussing the various techniques related to the problem. We also discuss some prominent application domains, ranging over document classification and clustering, document transformation, grammar evolution, selective dissemination of XML information, XML querying, as well as alert filtering in intrusion detection systems and Web Services matching and communications.
TL;DR: This work presents an algebraic approach for propagating source updates to XML materialized views expressed in a powerful XML tree pattern formalism and highlights the benefits of this approach over existing algorithms through a series of experiments.
Abstract: Materialized views can bring important performance benefits when querying XML documents. In the presence of XML document changes, materialized views need to be updated to faithfully reflect the changed document. In this work, we present an algebraic approach for propagating source updates to XML materialized views expressed in a powerful XML tree pattern formalism. Our approach differs from the state of the art in the area in two important ways. First, it relies on set-oriented, algebraic operations, to be contrasted with node-based previous approaches. Second, it exploits state-of-the-art features of XML stores and XML query evaluation engines, notably XML structural identifiers and associated structural join algorithms. We present algorithms for determining how updates should be propagated to views, and highlight the benefits of our approach over existing algorithms through a series of experiments.
TL;DR: In this paper, a system and method for delivering content in real-time using advanced messaging technology that reduces the risk of content being lost or dropped in transmission is presented, using a custom simplified XML format to deliver realtime textual, numeric, and metadata content directly to subscribers.
Abstract: A system and method for delivering content in real-time using advanced messaging technology that reduces the risk of content being lost or dropped in transmission. The system and method utilize a custom, simplified XML format to deliver real-time textual, numeric, and metadata content directly to subscribers. The XML tag set specifies all of the information needed to package, process, and distribute real-time content messages and includes an advanced tagging structure that allows granular content customization. Messages are built on the fly using multi-channel data processing techniques. The XML delivery system and method offers an array of real-time market-specific page-based “Alert” services and aggregated newswires with accompanying real-time numeric data feeds. These feeds contain proprietary assessments and other price data across a broad spectrum of global and regional commodity markets, including oil, petrochemicals, metals, electric power, natural gas, coal, and risk.
TL;DR: This paper designs TwigTable algorithm to incorporate property and value information into query processing, and proposes three object-based optimization techniques to Twig table that can be correctly discovered in any XML data.
Abstract: In this paper, we demonstrate how the semantic information, such as value, property, object class and relationship between object classes in XML data impacts XML query processing. We show that the lack of using semantics causes different problems in value management and content search in existing approaches. Motivated on solving these problems, we propose a semantic approach for XML twig pattern query processing. In particular, we design TwigTable algorithm to incorporate property and value information into query processing. This information can be correctly discovered in any XML data. In addition, we propose three object-based optimization techniques to TwigTable. If more semantics of object classes are known in an XML document, we can process queries more efficiently with these semantic optimizations. Last, we show the benefits of our approach by a comprehensive experimental study.
TL;DR: This work focuses on XML data integration by studying rewritings of XML target schemas in terms of source schemas, and considers Visibly pushdown Automata (VPAs), which accept Visibly Pushdown Languages (VPLs), which are the basis of formalisms for specifying XML schemas.
TL;DR: In this paper, a hybrid navigation/streaming format for XML documents is proposed to allow efficient storage and processing of queries on the XML data that provides the benefits of both navigation and streaming and ameliorates the disadvantages of each.
Abstract: A method for storing XML documents a hybrid navigation/streaming format is provided to allow efficient storage and processing of queries on the XML data that provides the benefits of both navigation and streaming and ameliorates the disadvantages of each. Each XML document to be stored is independently analyzed to determine a combination of navigable and streamable storage format that optimizes the processing of the data for anticipated access patterns.
TL;DR: In this paper, an XML template having one or more nodes is received and mapping information indicating an association of data and nodes of the uploaded XML template is obtained. Once the mapping is received, the structure of the XML template was determined.
Abstract: An XML template having one or more nodes is received. Mapping information indicating an association of data and nodes of the uploaded XML template is obtained. Once the mapping is received, the structure of the XML template is determined. Based on the determined structure and the mapping provided, an XML based SQL query is generated. The generated SQL query can be executed to provide the XML document.
TL;DR: A new mapping method is developed to overcome the limitations the limitations and shows that it is efficient in terms of removing relation redundancy.
Abstract: The eXtensible Markup Language (XML) has recently emerged as a standard for data representation and interchange on the web. Based on its popularity used in most application, the critical issues are to store and to query XML data to exploit the full power of this technology. Since relational database is widely used technology for storing and querying, therefore replacing it with pure XML database is not a good choice and very expensive process. It is thus crucial to map XML data into relational data and this process is one that occurs frequently. Many existing methods exist in the literature, and defining what the best mapping method is explicitly important. The intention of this paper is to the existing mapping methods in terms of generating good relational schema. At the end a new mapping method is developed to overcome the limitations the limitations and shows that it is efficient in terms of removing relation redundancy.
TL;DR: A unified definition is presented, the key properties including validation of XML graphs against different XML schema languages are outlined, and a software package is provided that enables others to make use of these ideas.
TL;DR: This paper proposes a mechanism for generating the relational schema from a set of integrated XML files, which includes defining aset of mapping rules from the OWL (Ontology Web Language) ontology to the relational format.
Abstract: Many applications require storing XML data, which can be achieved by using a relational database (RDB). In order to accomplish that, we need a set of transformation rules that maps the XML structure to a collection of relations. However, XML files from the same application domain might have different structures, making the mapping process to a unique relational schema more difficult. To overcome this, we can previously generate an integrated schema that represents the individual XML structures, and then map it to the relational format. Afterwards, the original XML files are stored into the database. In our proposal, the integrated schema is represented as an ontology. In this paper, we propose a mechanism for generating the relational schema from a set of integrated XML files, which includes defining a set of mapping rules from the OWL (Ontology Web Language) ontology to the relational format. The mapping process is implemented in OntoRel tool.
TL;DR: It is shown that Visibly Pushdown Languages are closed under the defined language operators and this enables us to expand the schemas (for XML) in order to account for flexible or constrained evolution.
TL;DR: A notion of compactness is formally defined which allows for comparing documents and shows that the update-based method produces time-stamped XML documents that are more satisfactory wrt space-efficiency than the general method.
Abstract: The management of temporal data is a crucial issue in many applications. Recently, XML has become the standard for data exchange and representation. Consequently, important efforts have been made on the development of temporal extensions for XML. This paper investigates how to generate or maintain space-efficient time-stamped documents. We formally define a notion of compactness which allows for comparing documents. Then, we present two methods. For the first one, called general method, no restriction is made on the evolution of the XML documents whereas for the second one, called update-based method, changes are assumed to be specified by updates. For both methods, the issue is to enable processing very large documents, to use existing engines and to comply to Xquery Update Facility. The two methods are compared in terms of space-efficiency. The update-based method produces time-stamped XML documents that are more satisfactory wrt space-efficiency than the general method. This goes to show that the update-based method effectively takes advantage of the updates.
TL;DR: A new XML clustering algorithm that relies solely on document structure and the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups is put forward.
Abstract: With the vastly growing data resources on the In- ternet, XML is one of the most important standards for document management. Not only does it provide enhancements to document exchange and storage, but it is also helpful in a variety of informa- tion retrieval tasks. Document clustering is one of the most inter- esting research areas that utilize XML's semi-structural nature. In this paper, we put forward a new XML clustering algorithm that relies solely on document structure. We propose the use of maximal frequent subtrees and an operator called Satisfy/Violate to divide documents into groups. The algorithm is experimentally evaluated on real and synthetic data sets with promising results.
TL;DR: A conceptual model for XML data is exploited to generate SAWSDL enriched XML schemas, but mainly to automatically generate the so called Lifting and Lowering schema mappings in a form of XSLT scripts.
Abstract: With the introduction of the SAWSDL W3C recommendation, the possibility of enriching web service interfaces with semantic model references surfaced as a foundation for semantic web services. However, the recommendation says neither what the semantic model should be nor what to do with the actual XML data. In this paper, we exploit our conceptual model for XML data to generate SAWSDL enriched XML schemas, but mainly to automatically generate the so called Lifting and Lowering schema mappings in a form of XSLT scripts. These scripts can be used to transform the XML data produced by the web service into RDF data (lifting) and vice versa (lowering). In the RDF data state the data can be manipulated using a knowledge given by a corresponding ontology mapped to our model. Also the reasoning power granted by the ontology description can be exploited.
TL;DR: This work has developed a parallel simplified XPath language using Compute Unified Device Architecture (CUDA) on GPU, and evaluates the model on a recent NVIDIA GPU in comparison with its counterpart on eight-core CPU.
Abstract: As XML is playing a crucial role in web services, databases, and document processing, efficient processing of XML queries has become an important issue. On the other hand, due to the increasing number of users, high throughput of XML queries is also required to execute tens of thousands of queries in a short time. Given the great success of GPGPU (General-Purpose computations on the Graphics Processors), we propose a parallel XML query model based on GPU, which mainly consists of two efficient task distribution strategies, to improve the efficiency and throughput of XML queries. We have developed a parallel simplified XPath language using Compute Unified Device Architecture (CUDA) on GPU, and evaluate our model on a recent NVIDIA GPU in comparison with its counterpart on eight-core CPU. The experiment results show that our model achieves both higher throughput and efficiency than CPU-based XML query.
TL;DR: This work isolates a set of five requirements that must be fulfilled in order to have a faithful representation of the XML data-exchange problem by a relational translation, and demonstrates that these requirements naturally suggest the inlining technique for dataexchange tasks.
Abstract: We consider data exchange for XML documents: given source and target schemas, a mapping between them, and a document conforming to the source schema, construct a target document and answer target queries in a way that is consistent with source information. The problem has primarily been studied in the relational context, in which data-exchange systems have also been built. Since many XML documents are stored in relations, it is natural to consider using a relational system for XML data exchange. However, there is a complexity mismatch between query answering in relational and XML data exchange, which indicates that restrictions have to be imposed on XML schemas and mappings, and on XML shredding schemes, to make the use of relational systems possible. We isolate a set of five requirements that must be fulfilled in order to have a faithful representation of the XML data-exchange problem by a relational translation. We then demonstrate that these requirements naturally suggest the inlining technique for dataexchange tasks. Our key contribution is to provide shredding algorithms for schemas, documents, mappings and queries, and demonstrate that they enable us to correctly perform XML data-exchange tasks using a relational system.