TL;DR: This survey considers two classes of major XML query processing techniques: the relational approach and the native approach, which could result in higher query processing performance and also significantly reduce system reengineering costs.
Abstract: Extensible markup language (XML) is emerging as a de facto standard for information exchange among various applications on the World Wide Web. There has been a growing need for developing high-performance techniques to query large XML data repositories efficiently. One important problem in XML query processing is twig pattern matching, that is, finding in an XML data tree D all matches that satisfy a specified twig (or path) query pattern Q. In this survey, we review, classify, and compare major techniques for twig pattern matching. Specifically, we consider two classes of major XML query processing techniques: the relational approach and the native approach. The relational approach directly utilizes existing relational database systems to store and query XML data, which enables the use of all important techniques that have been developed for relational databases, whereas in the native approach, specialized storage and query processing systems tailored for XML data are developed from scratch to further improve XML query performance. As implied by existing work, XML data querying and management are developing in the direction of integrating the relational approach with the native approach, which could result in higher query processing performance and also significantly reduce system reengineering costs.
TL;DR: In this paper, the authors propose an approach to enable dynamic VoiceXML in an X+V page of a multimodal application implemented with the multimodAL application operating in a multimoderal browser on a multi-modal device supporting multiple modes of interaction including a voice mode and one or more non-voice modes.
Abstract: Enabling dynamic VoiceXML in an X+V page of a multimodal application implemented with the multimodal application operating in a multimodal browser on a multimodal device supporting multiple modes of interaction including a voice mode and one or more non-voice modes, the multimodal application operatively coupled to a VoiceXML interpreter, including representing by the multimodal browser an XML element of a VoiceXML dialog of the X+V page as an ECMAScript object, the XML element comprising XML content; storing by the multimodal browser the XML content of the XML element in an attribute of the ECMAScript object; and accessing the XML content of the XML element in the attribute of the ECMAScript object from an ECMAScript script in the X+V page.
TL;DR: A theoretically complete algorithm is provided that always infers the correct XSD when a sufficiently large corpus of XML documents is available and a variant of this algorithm is presented that works well on real-world data sets.
Abstract: Although the presence of a schema enables many optimizations for operations on XML documents, recent studies have shown that many XML documents in practice either do not refer to a schema, or refer to a syntactically incorrect one. It is therefore of utmost importance to provide tools and techniques that can automatically generate schemas from sets of sample documents. While previous work in this area has mostly focused on the inference of Document Type Definitions (DTDs for short), we will consider the inference of XML Schema Definitions (XSDs for short) --- the increasingly popular schema formalism that is turning DTDs obsolete. In contrast to DTDs where the content model of an element depends only on the element's name, the content model in an XSD can also depend on the context in which the element is used. Hence, while the inference of DTDs basically reduces to the inference of regular expressions from sets of sample strings, the inference of XSDs also entails identifying from a corpus of sample documents the contexts in which elements bear different content models. Since a seminal result by Gold implies that no inference algorithm can learn the complete class of XSDs from positive examples only, we focus on a class of XSDs that captures most XSDs occurring in practice. For this class, we provide a theoretically complete algorithm that always infers the correct XSD when a sufficiently large corpus of XML documents is available. In addition, we present a variant of this algorithm that works well on real-world (and therefore incomplete) data sets.
TL;DR: In this article, a device control system including at least one device operable by the system, a processor, and two or more communication components, each communication component including an XML parser for parsing the XML document and extracting the message data.
Abstract: A device control system including at least one device operable by the system, at least one processor, software executing on the at least one processor for receiving message data and determining a corresponding XML document type, software executing on the at least one processor for generating a XML document based on the XML document type, the XML document including the message data, software executing on the processor for packetizing the XML document, and two or more communication components, each communication component including an XML parser for parsing the XML document and extracting the message data.
TL;DR: This paper evaluates existing range-based and prefix-based labeling schemes, before proposing its own scheme based on DeweyIDs, which is experimentally explored as a general and immutable node labeling mechanism, stress its synergetic potential for query processing and locking, and show how it can be implemented efficiently.
Abstract: We explore suitable node labeling schemes used in collaborative XML DBMSs (XDBMSs, for short) supporting typical XML document processing interfaces. Such schemes have to provide holistic support for essential XDBMS processing steps for declarative as well as navigational query processing and, with the same importance, lock management. In this paper, we evaluate existing range-based and prefix-based labeling schemes, before we propose our own scheme based on DeweyIDs. We experimentally explore its suitability as a general and immutable node labeling mechanism, stress its synergetic potential for query processing and locking, and show how it can be implemented efficiently. Various compression and optimization measures deliver surprising space reductions, frequently reduce the size of storage representation-compared to an already space-efficient encoding scheme-to less than 20-30% in the average and, thus, conclude their practical relevance.
TL;DR: In this article, an application program interface (API) is provided for requesting, storing, and accessing data within a health integration network, which facilitates secure and seamless access to the centrally-stored data by offering authentication/authorization, as well as the ability to receive requests in an extensible language format, such as XML, and returns resulting data in XML format.
Abstract: An application program interface (API) is provided for requesting, storing, and otherwise accessing data within a health integration network. The API facilitates secure and seamless access to the centrally-stored data by offering authentication/authorization, as well as the ability to receive requests in an extensible language format, such as XML, and returns resulting data in XML format. The data can also have transformation, style and/or schema information associated with it which can be returned in the resulting XML and/or applied to the data beforehand by the API. The API can be utilized in many environment architectures including XML over HTTP and a software development kit (SDK).
TL;DR: A lossless schema mapping algorithm to generate a database schema from a DTD, which makes several improvements over existing algorithms, and two linear data mapping algorithms based on DOM and SAX, respectively, to map ordered XML data to relational data are proposed.
TL;DR: This paper has developed an application-oriented and domain-specific benchmark called "Transaction Processing over XML" (TPoX), which exercises all aspects of XML databases, including storage, indexing, logging, transaction processing, and concurrency control.
Abstract: XML database functionality has been emerging in "XML-only" databases as well as in the major relational database products. Yet, there is no industry standard XML database benchmark to evaluate alternative implementations. The research community has proposed several benchmarks which are all useful in their respective scope, such as evaluating XQuery processors. However, they do not aim to evaluate a database system in its entirety and do not represent all relevant characteristics of a real-world XML application. Often they only define read-only single-user tests on a single XML document. We have developed an application-oriented and domain-specific benchmark called "Transaction Processing over XML" (TPoX). It exercises all aspects of XML databases, including storage, indexing, logging, transaction processing, and concurrency control. Based on our analysis of real XML applications, TPoX simulates a financial multi-user workload with XML data conforming to the FIXML standard. In this paper we describe TPoX and present early performance results. We also make its implementation publicly available.
TL;DR: The ability to compute XML key propagation is a first step toward establishing a connection between XML data and its relational representation at the semantic level.
TL;DR: This paper proposes two O(|D||Q|)-time stream-querying algorithms, LQ and EQ, which are based on the lazy strategy and on the eager strategy, respectively, and are the first XPath stream-quireying algorithms that achieve O( |D|| Q|) time performance.
Abstract: In this paper we address the problem of evaluating XPath queries over streaming XML data We consider a practical XPath fragment called Univariate XPath, which includes the commonly used '/' and '//' axes and allows *-node tests and arbitrarily nested predicates It is well known that this XPath fragment can be efficiently evaluated in O(|D||Q|) time in the non-streaming environment, where |D| is the document size and |Q| is the query size However, this is not necessarily true in the streaming environment, since streaming algorithms have to satisfy stricter requirement than non-streaming algorithms, in that all data must be read sequentially in one pass Therefore, it is not surprising that state-of-the-art stream-querying algorithms have higher time complexity than O(|D||Q|) In this paper we revisit the XPath stream-querying problem, and show that Univariate XPath can be efficiently evaluated in O|D||Q|) time in the streaming environment Specifically, we propose two O(|D||Q|)-time stream-querying algorithms, LQ and EQ, which are based on the lazy strategy and on the eager strategy, respectively To the best of our knowledge, LQ and EQ are the first XPath stream-querying algorithms that achieve O(|D||Q|) time performance Further, our algorithms achieve O(|D||Q|) time performance without trading off space performance Instead, they have better buffering-space performance than state-of-the-art stream-querying algorithms In particular, EQ achieves optimal buffering-space performance Our experimental results show that our algorithms have not only good theoretical complexity but also considerable practical performance advantages over existing algorithms
TL;DR: The core part of the paper describes the infrastructural services for XML document storage with compressed DeweyIDs, the principles and methods for navigational and declarative processing of queries, as well as the lock modes and protocols to enable efficient collaboration.
Abstract: Implementation techniques for relational database management systems (DBMSs) have proven their efficiency and robustness in many existing systems. However, many of these concepts and mechanisms cannot be used when implementing a native XML DBMS (XDBMS) because of substantial differences in the processing properties of natively stored XML documents as compared to relational tables. Therefore, we have to develop new and appropriate techniques with ACID transaction guarantees tailored to the processing characteristics of tree documents and the operations on them. For this reason, we want to provide for an efficient infrastructure of XDBMSs consisting of tree node addressing and indexing together with fine-grained locking of tree nodes. In this respect, our prime and novel contribution is to reveal the potential of our prefix-based node labeling called DeweyIDs supporting record addressing, indexing, and locking protocols. In this paper, we first sketch our version of prefix-based node labeling and summarize a quantitative study on them. An overview of our layered XDBMS architecture indicates the concepts and functionalities to be reused from relational DBMS implementations. The core part of the paper describes the infrastructural services for XML document storage with compressed DeweyIDs, the principles and methods for navigational and declarative processing of queries, as well as the lock modes and protocols to enable efficient collaboration. Selected empirical experiments evaluate the XTC system performance and support our system assessment.
TL;DR: This paper proposes the Probabilistic Interval XML (PIXML for short) data model, and provides an operational semantics that may be used to compute answers to queries and that is correct for a large class of probabilistic instances.
Abstract: Interest in XML databases has been expanding rapidly over the last few years. In this paper, we study the problem of incorporating probabilistic information into XML databases. We propose the Probabilistic Interval XML (PIXML for short) data model in this paper. Using this data model, users can express probabilistic information within XML markups. In addition, we provide two alternative formal model-theoretic semantics for PIXML data. The first semantics is a “global” semantics which is relatively intuitive, but is not directly amenable to computation. The second semantics is a “local” semantics which supports efficient computation. We prove several correspondence results between the two semantics. To our knowledge, this is the first formal model theoretic semantics for probabilistic interval XML. We then provide an operational semantics that may be used to compute answers to queries and that is correct for a large class of probabilistic instances.
TL;DR: Reference-based SQL/XML operators as discussed by the authors return a reference to a node to determine whether the corresponding node comes logical before, after, or is the same as another node.
Abstract: Techniques for processing reference-based SQL/XML operators are provided. Instead of extracting copies of one or more nodes from XML data, a reference-based operator returns a reference to a node. Such a reference is used to determine, for example, whether the corresponding node comes logical before, after, or is the same as another node. An SQL/XML query that includes a reference-based operator may be the original query, or may be generated (e.g., rewritten) from a non-SQL/XML query, such as an XQuery query. One or more physical rewrites may be performed on the SQL/XML query, depending on how the XML data is stored and/or whether an XML index exists for the XML data.
TL;DR: In this article, computer-implemented methods and computer-readable storage media are disclosed for facilitating browser-based, what-you-see-is-whatyou-get (WYSIWYG) editing of an extensible markup language (XML) file.
Abstract: Computer-implemented methods and computer-readable storage media are disclosed for facilitating browser-based, what-you-see-is-what-you-get (WYSIWYG) editing of an extensible markup language (XML) file. A browser executing on a local computing system is used to access a hypertext markup language (HTML) representation of an extensible markup language (XML) file. The HTML representation includes a plurality of elements of the XML file formatted in accordance with an extensible stylesheet language (XSL) transform associated with the XML file. A plurality of editing handlers is inserted within the HTML representation to facilitate modifying the HTML representation and applying the changes to the XML file. A user is permitted to modify the HTML representation for purposes of applying the modifications to the XML file.
TL;DR: A stealing-based dynamic load-balancing mechanism, called ThreadCrew, by which multiple threads are able to process the disjointed parts of the XML document in parallel with balanced load distribution, and a novel mechanism to trace the stealing actions is provided.
Abstract: A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has been regarded as the performance bottleneck in most systems and applications. On the other side, the multicore processor, emerged as a solution for the clock-speed limitation of the modern CPUs, has been growingly prevalent. Leveraging the parallelism provided by the multicorere source to speedup the software execution is becoming the trend of the software development. In this paper, we present a parallel processing model for the XML document. The model is not designed just for a specific XML processing task, instead, it is a general model, by which we are able to explore various parallel XML document processing. The kernel of the model is a stealing-based dynamic load-balancing mechanism, called ThreadCrew, by which multiple threads are able to process the disjointed parts of the XML document in parallel with balanced load distribution. The model also provides a novel mechanism to trace the stealing actions, thus the equivalent sequential result can be gotten by gluing the multiple parallel-running results together. To show the feasibility and effectiveness of our approaches, we present our C# implementation of parallel XML serialization in this paper. Our empirical study shows our parallel XML serialization algorithm can improved the XML serializing performance significantly on a multicore machine.
TL;DR: ViST as mentioned in this paper is a novel index structure for searching XML documents that uses tree structures as the basic unit of query to avoid expensive join operations, and provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over indexing either just content or structure.
Abstract: The present invention provides a ViST (or “virtual suffix tree”), which is a novel index structure for searching XML documents. By representing both XML documents and XML queries in structure-encoded sequences, it is shown that querying XML data is equivalent to finding (non-contiguous) subsequence matches. A variety of XML queries, including those with branches, or wild-cards (‘*’ and ‘//’), can be expressed by structure-encoded sequences. Unlike index methods that disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide the final answers, ViST uses tree structures as the basic unit of query to avoid expensive join operations. Furthermore, ViST provides a unified index on both content and structure of the XML documents, hence it has a performance advantage over methods indexing either just content or structure. ViST supports dynamic index update, and it relies solely on B+Trees without using any specialized data structures that are not well supported by common database management systems (hereinafter referred to as “DBMSs”).
TL;DR: In this paper, the system provides a Wrapper class for the XML Document class and the Element class, which can be used to access external components as required by a user application.
Abstract: Systems and methods for loading XML documents on demand are described. The system provides a Wrapper class for the XML Document class and the Element class. A user application then utilizes the Wrapper class in the same way that the Element class and Document class would be used to access any element in the XML Document. The Wrapper class loads external components as required. The external component retrieval is completely transparent to the user application and the user application is able to access the entire XML document as if it were completely loaded into a DOM object in memory. Accordingly, each element is accessible in a random manner. In one configuration, the XML document components or external components are stored in a database in a BLOB field as a Digital Document. The system uses external components to efficiently use resources as compared to systems using Xlink and external entities.
TL;DR: A system for providing XML-based asynchronous and interactive feeds for Web applications that provides a highly efficient and extensible XML Javascript framework allowing easy insertion of a comment/news feed control into any Web page as discussed by the authors.
Abstract: A system for providing XML-based asynchronous and interactive feeds for Web applications that provides a highly efficient and extensible XML Javascript framework allowing easy insertion of a comment/news feed control into any Web page. The framework allows for reading of any XML format and provides a new and easy way for modifying the look-and-feel of the control via HTML templates with familiar XPath bindings. The rendering performed through the system supports both flat and indented (“threaded”) views for a comment thread. The system improves the parsing speed of incoming XML, and supports a flexible event model for others to develop plug-ins and mashups in the spirit of Web 2.0.
TL;DR: This paper presents a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data.
Abstract: Clio is an existing schema-mapping tool that provides user-friendly means to manage and facilitate the complex task of transformation and integration of heterogeneous data such as XML over the Web or in XML databases. By means of mappings from source to target schemas, Clio can help users conveniently establish the precise semantics of data transformation and integration. In this paper we study the problem of how to efficiently implement such data transformation (i.e., generating target data from the source data based on schema mappings). We present a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and discuss methodologies and algorithms for implementing these phases. In particular, we elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data (including duplicate elimination). We compare our transformation framework with alternative methods such as using XQuery or SQL/XML provided by current commercial databases. The results demonstrate that the three-phase framework (although as simple as it is) is highly scalable and outperforms the alternative methods by orders of magnitude.
TL;DR: This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time, and features a small memory footprint that increases cache locality, whilst still supporting standard APIs and necessary database operations, such as queries and updates, efficiently.
Abstract: As XML database sizes grow, the amount of space used for storing the data and auxiliary data structures becomes a major factor in query and update performance. This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well. As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient.
TL;DR: In this article, a system and method for partitioning XML-based content into fragments, where transport packets are generated for encapsulating the fragments and streaming the encapsulated fragments to a receiver, such as a mobile device.
Abstract: A system and method for partitioning XML-based content into fragments, where transport packets are generated for encapsulating the fragments and streaming the encapsulated fragments to a receiver, such as a mobile device. Fragmentation of the XML-based content can be performed either with or without regard for any underlying XML syntax or structure. In either case, certain relevant fragmentation information is encapsulated with the fragmented XML-based content in the transport packets that allow for various reconstruction, error concealment, and retransmission schemes for presenting the streamed XML-based content on/to the receiver.
TL;DR: A taxonomy of changes for XML schema evolution is described and guidelines for writing queries in such a way that they continue to operate as expected across evolving schemas are proposed.
Abstract: In XML databases, new schema versions may be released as frequently as once every two weeks. This poster describes a taxonomy of changes for XML schema evolution. It examines the impact of those changes on schema validation and query evaluation. Based on that study, it proposes guidelines for XML schema evolution and for writing queries in such a way that they continue to operate as expected across evolving schemas.
TL;DR: This research develops a space efficient DOM parser, called SEDOM, based on a new compression approach and a set of manipulation algorithms, which enable many DOM operations to be performed when the data are in the compressed format, and allow individual parts of a document to be compressed, decompressed and manipulated.
Abstract: In many XML applications, parsing is a key operation. When the processing involves modifying data, random access, and/or in an order different from the one in which elements are stored, a DOM parser has to be used. A major problem with using a DOM parser is memory consumption. The size of a DOM tree created from an XML document may be as large as 10 times of the size of the original document. Maintaining the tree of a big document requires a large amount of memory. It may cause costly swapping. In the worst cases, a DOM parser cannot handle a document at all because of its size. In this research, we develop a space efficient DOM parser, called SEDOM. It is based on a new compression approach and a set of manipulation algorithms, which enable many DOM operations to be performed when the data are in the compressed format, and allow individual parts of a document to be compressed, decompressed and manipulated. It can be used to efficiently manipulate very large XML documents. In this paper, we describe SEDOM, and compare its performance with three existing DOM parsers and an XML compressor.
TL;DR: This chapter discusses and compares the most relevant similarity measures and their employment for XML document clustering and compares link-based similarity approaches developed for Web data clustering for XML documents.
TL;DR: In this paper, the authors present a method and system for performing operations on data using XML streams, such as addition, subtraction, multiplication, and division, in XML data.
Abstract: The present invention provides a method and system for performing operations on data using XML streams. An XML schema defines a limited set of operations that may be performed on data. These operations include addition, subtraction, multiplication and division. The operations are placed in an XML stream that conforms to the XML schema. The XML stream may perform one or more of the defined operations on the data. The limited set of operations allows data to be validated and processed without excessive overhead.
TL;DR: XSAGs are the first scalable query language for XML streams that allows for actual data transformations rather than just document filtering and the XSAG formalism provides a strong intuition for which queries can or cannot be processed scalably on streams.
Abstract: We introduce the notion of XML Stream Attribute Grammars (XSAGs). XSAGs are the first scalable query language for XML streams (running strictly in linear time with bounded memory consumption independent of the size of the stream) that allows for actual data transformations rather than just document filtering. XSAGs are also relatively easy to use for humans. Moreover, the XSAG formalism provides a strong intuition for which queries can or cannot be processed scalably on streams. We introduce XSAGs together with the necessary language-theoretic machinery, study their theoretical properties such as expressiveness and complexity, and discuss their implementation.
TL;DR: It is shown how the system can be used to tackle various two-level transformation scenarios, such as XML schema evolution coupled with document migration, and hierarchical-relational data mappings that convert between XML documents and SQL databases.
Abstract: A two-level data transformation consists of a type-level transformation of a data format coupled with value-level transformations of data instances corresponding to that format. We have implemented a system for performing two-level transformations on XML schemas and their corresponding documents, and on SQL schemas and the databases that they describe. The core of the system consists of a combinator library for composing type-changing rewrite rules that preserve structural information and referential constraints. We discuss the implementation of the system's core library, and of its SQL and XML front-ends in the functional language Haskell. We show how the system can be used to tackle various two-level transformation scenarios, such as XML schema evolution coupled with document migration, and hierarchical-relational data mappings that convert between XML documents and SQL databases.
TL;DR: In this paper, a transition system and an extensible markup language (XML) representation of the data is generated by querying the XML representation using (markup) query language.
Abstract: The invention concerns model program analysis of software code using model checking. Initially, a transition system (22) and an extensible markup language (XML) (24) representation of the data is generated. Next, labels (26) for the transition system are generated by querying the XML representation of the data using (markup) query language. The labels and the structure of the transition system are then used as input to model checking techniques to analyse the software code (28). It is an advantage of the invention that the problem of labelling a transition system can be transformed into the XML domain so that detailed information about the software code can be extracted using queries in a format that can be run in the XML domain which are well known. At the same time the transformation to the XML domain does not prevent the use of efficient model checking technologies.
TL;DR: In XPath query evaluation, indices similar to those used in relational database systems - namely, value indices on tags and text values - are first used, together with structural join algorithms, which turn out to be simple and efficient.
Abstract: Supporting efficient access to XML data using XPath [3] continues to be an important research problem [6, 12]. XPath queries are used to specify nodelabeled trees which match portions of the hierarchical XML data. In XPath query evaluation, indices similar to those used in relational database systems - namely, value indices on tags and text values - are first used, together with structural join algorithms [1, 2, 19]. This approach turns out to be simple and efficient. However, the structural containment relationships native to XML data are not directly captured by value indices.