TL;DR: A technique is presented that allows to represent the tree structure of an XML document in an efficient way by “compressing” their tree structure, which allows to directly execute queries without prior decompression.
Abstract: Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by “compressing” their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations.
TL;DR: This paper develops a method to perform holistic twig pattern matching on XML documents partitioned using various streaming schemes and can process a large class of twig patterns consisting of both ancestor-descendant and parent-child relationships and avoid generating redundant intermediate results.
Abstract: Searching for all occurrences of a twig pattern in an XML document is an important operation in XML query processing. Recently a holistic method TwigStack. [2] has been proposed. The method avoids generating large intermediate results which do not contribute to the final answer and is CPU and I/O optimal when twig patterns only have ancestor-descendant relationships. Another important direction of XML query processing is to build structural indexes [3][8][13][15] over XML documents to avoid unnecessary scanning of source documents. We regard XML structural indexing as a technique to partition XML documents and call it streaming scheme in our paper. In this paper we develop a method to perform holistic twig pattern matching on XML documents partitioned using various streaming schemes. Our method avoids unnecessary scanning of irrelevant portion of XML documents. More importantly, depending on different streaming schemes used, it can process a large class of twig patterns consisting of both ancestor-descendant and parent-child relationships and avoid generating redundant intermediate results. Our experiments demonstrate the applicability and the performance advantages of our approach.
TL;DR: The TurboXPath path processor is proposed, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document, and can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources.
Abstract: Efficient querying of XML streams will be one of the fundamental features of next-generation information systems. In this paper we propose the TurboXPath path processor, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document. TurboXPath can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources. Internally, TurboXPath uses a tree-shaped path expression with multiple outputs to drive the execution. The result of a query execution is a sequence of tuples of XML fragments matching the output nodes. Based on a streamed execution model, TurboXPath scales up to large documents and has limited memory consumption for increased concurrency. Experimental evaluation of a prototype demonstrates performance gains compared to other state-of-the-art path processors.
TL;DR: In this paper, a method and apparatus for translating queries such as path expressions and SQL/XML constructs into SQL statements to be executed against an XML index, which improves processor time as opposed to applying path expressions directly to the original XML documents to extract the desired information.
Abstract: A method and apparatus is provided for translating queries, such as path expressions and SQL/XML constructs, into SQL statements to be executed against an XML index, which improves processor time as opposed to applying path expressions directly to the original XML documents to extract the desired information. Simple path expressions, filter expressions, descendant axes, wildcards, logical expressions, relational expressions, literals, and other path expressions are all translated into SQL for efficient querying of an XML index. Similarly, rules for translating SQL/XML constructs into SQL are provided.
TL;DR: This paper analyzes a scheme for XML access control with symmetric encryption and secret sharing in the context of the rigorous models of modern cryptography, and obtains formal results in simple, symbolic terms close to the vocabulary of Miklau and Suciu.
Abstract: Some promising recent schemes for XML access control employ encryption for implementing security policies on published data, avoiding data duplication. In this paper we study one such scheme, due to Miklau and Suciu. That scheme was introduced with some intuitive explanations and goals, but without precise definitions and guarantees for the use of cryptography (specifically, symmetric encryption and secret sharing). We bridge this gap in the present work. We analyze the scheme in the context of the rigorous models of modern cryptography. We obtain formal results in simple, symbolic terms close to the vocabulary of Miklau and Suciu. We also obtain more detailed computational results that establish security against probabilistic polynomial-time adversaries. Our approach, which relates these two layers of the analysis, continues a recent thrust in security research and may be applicable to a broad class of systems that rely on cryptographic data protection.
TL;DR: In this paper, the authors present techniques, systems and apparatus for automatically generating schema using an initial documents constructed in an XML compatible format, which can be implemented as software operating on a computer system, as a computer module, as computer program product and as a series of related devices and products.
Abstract: Techniques, systems and apparatus for automatically generating schema using an initial documents constructed in an XML compatible format are disclosed. A method involves providing an initial XML document that and analyzing the XML document to identify the XML data structures in the document and generating a data framework that corresponds to the data structure of the XML document. The data items of the initial XML document are analyzed to determine data constraints based on the data items of the initial XML. Schema are then generated based on the data framework generated and the data constraints determined from the raw xml data. These principles can be implemented as software operating on a computer system, as a computer module, as a computer program product and as a series of related devices and products.
TL;DR: The proposed methodology on building XML data warehouses covers processes including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables, and utilise the XQuery technology in all of the above processes.
Abstract: Developing a data warehouse for XML documents involves two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and the other of querying it, by applying techniques to better answer users’ queries. This paper focuses on the first part; that is identifying a systematic approach for building a data warehouse of XML documents, specifically for transferring data from an underlying XML database into a defined XML data warehouse. The proposed methodology on building XML data warehouses covers processes including data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables. In this paper, we also present a case study on how to put this methodology into practice. We utilise the XQuery technology in all of the above processes.
TL;DR: Foreword Preface I: XML Technologies 1. HTML and Web Pages 2. XML Documents 3. Navigating XML Trees with XPath 4. Schema Languages 5. Transforming XML documents with XSLT 6. Querying XML Documents with XQuery
Abstract: Foreword Preface I: XML Technologies 1. HTML and Web Pages 2. XML Documents 3. Navigating XML Trees with XPath 4. Schema Languages 5. Transforming XML Documents with XSLT 6. Querying XML Documents with XQuery 7. XML Programming II: Web Technologies 8. The HTTP Protocol 9. Programming Web Applications with Servlets 10. Programming Web Applications with JSP 11. Web Services 12. A Complete Application Bibliography Index
TL;DR: A new Labelling Scheme for Dynamic XML data (LSDX) is proposed that supports the representation of the ancestor - descendant relationship and sibling relationship between nodes and facilitates fast update of XML data.
Abstract: In order to facilitate query processing for XML data, several path indexing, labelling and numbering scheme have been proposed. However, if XML data need to be updated frequently, most of these approaches will need to re-compute existing labels which is rather time consuming. In this paper, we propose a new Labelling Scheme for Dynamic XML data (LSDX) that supports the representation of the ancestor - descendant relationship and sibling relationship between nodes. Moreover, LSDX supports the process of updating XML data without the need of re-labelling existing labels, hence facilitating fast update. Some experimental works have been conducted to show its effectiveness.
TL;DR: A prototype compiler for XJ is built, and preliminary experiments demonstrate that the performance of XJ programs can approach that of traditional low-level API-based interfaces, while providing a higher level of abstraction.
Abstract: The increased importance of XML as a data representation format has led to several proposals for facilitating the development of applications that operate on XML data. These proposals range from runtime API-based interfaces to XML-based programming languages. The subject of this paper is XJ, a research language that proposes novel mechanisms for the integration of XML as a first-class construct into Java™. The design goals of XJ distinguish it from past work on integrating XML support into programming languages --- specifically, the XJ design adheres to the XML Schema and XPath standards. Moreover, it supports in-place updates of XML data thereby keeping with the imperative nature of Java. We have built a prototype compiler for XJ, and our preliminary experiments demonstrate that the performance of XJ programs can approach that of traditional low-level API-based interfaces, while providing a higher level of abstraction.
TL;DR: This work provides the notion of security views for characterizing information accessible to authorized users, a transformed (sanitized) DTD schema that can be used by users for query formulation and optimization and proposes a number of generalizations for security policies.
Abstract: We investigate a generalization of the notion of XML security view introduced by Stoica and Farkas [17] and later refined by Fan et al. [8]. The model consists of access control policies specified over DTDs with XPath expression for data-dependent access control policies. We provide the notion of security views for characterizing information accessible to authorized users. This is a transformed (sanitized) DTD schema that can be used by users for query formulation and optimization. Then we show an algorithm to materialize "authorized" version of the document from the view and an algorithm to construct the view from an access control specification. We also propose a number of generalizations for security policies.
TL;DR: The proposed XML parser, Deltarser, is adaptive since it partially parses and then remembers XML document fragments that it has not met before, and processes safely since its partial parsing correctly checks the well-formedness of documents.
Abstract: XML (Extensible Markup Language) processing can incur significant runtime overhead in XML-based infrastructural middleware such as Web service application servers. This paper proposes a novel mechanism for efficiently processing similar XML documents. Given a new XML document as a byte sequence, the XML parser proposed in this paper normally avoids syntactic analysis but simply matches the document with previously processed ones, reusing those results. Our parser is adaptive since it partially parses and then remembers XML document fragments that it has not met before. Moreover, it processes safely since its partial parsing correctly checks the well-formedness of documents. Our implementation of the proposed parser complies with the JSR 63 standard of the Java API for XML Processing (JAXP) 1.1 specification. We evaluated Deltarser performance with messages using Google Web services. Comparing to Piccolo (and Apache Xerces), it effectively parses 35% (106%) faster in a server-side use-case scenario, and 73% (126%) faster in a client-side use-case scenario.
TL;DR: This work presents XSugar, which makes it possible to manage dual syntax for XML languages, and statically checks that the transformations are reversible and that all XML documents generated from the alternative syntax are valid according to a given XML schema.
Abstract: XML is successful as a machine processable data interchange format, but it is often too verbose for human use. For this reason, many XML languages permit an alternative more legible non-XML syntax. XSLT stylesheets are often used to convert from the XML syntax to the alternative syntax; however, such transformations are not reversible since no general tool exists to automatically parse the alternative syntax back into XML.
We present XSugar, which makes it possible to manage dual syntax for XML languages. An XSugar specification is built around a context-free grammar that unifies the two syntaxes of a language. Given such a specification, the XSugar tool can translate from alternative syntax to XML and vice versa. Moreover, the tool statically checks that the transformations are reversible and that all XML documents generated from the alternative syntax are valid according to a given XML schema.
TL;DR: This work provides a formal framework for XML Schema-driven decompositions, which encompasses the decomposition proposed in prior work and extends them with decomPOSitions that employ denormalized tables and binary-coded XML fragments.
Abstract: XML database systems emerge as a result of the acceptance of the XML data model. Recent works have followed the promising approach of building XML database management systems on underlying RDBMS’s. Achieving query processing performance reduces to two questions: (i) How should the XML data be decomposed into data that are stored in the RDBMS? (ii) How should the XML query be translated into an efficient plan that sends one or more SQL queries to the underlying RDBMS and combines the data into the XML result? We provide a formal framework for XML Schema-driven decompositions, which encompasses the decompositions proposed in prior work and extends them with decompositions that employ denormalized tables and binary-coded XML fragments. We provide corresponding query processing algorithms that translate the XML query conditions into conditions on the relational tables and assemble the decomposed data into the XML query result. Our key performance focus is the response time for delivering the first results of a query. The most effective of the described decompositions have been implemented in XCacheDB, an XML DBMS built on top of a commercial RDBMS, which serves as our experimental basis. We present experiments and analysis that point to a class of decompositions, called inlined decompositions, that improve query performance for full results and first results, without significant increase in the size of the database.
TL;DR: This tutorial will provide an insight into how XML functionality fits into relational database management systems as seen by three major relational vendors: IBM, Microsoft and Oracle.
Abstract: As XML has evolved from a document markup language to a widely-used format for exchange of structured and semistructured data, managing large amounts of XML data has become increasingly important. A number of companies, including both established database vendors and startups, have recently announced new XML database systems or new XML functionality integrated into existing database systems. This tutorial will provide an insight into how XML functionality fits into relational database management systems as seen by three major relational vendors: IBM, Microsoft and Oracle.
TL;DR: In this paper, a method of parsing an XML data stream comprises receiving XML data streams containing a namespace prefix and an associated element tag name, which are converted into a token that uniquely represents a namespace specification that is associated with the prefix and the element tag.
Abstract: In one embodiment, a method of parsing an XML data stream comprises receiving an XML data stream containing a namespace prefix and an associated element tag name. The element tag name is associated with an element tag. The namespace prefix and the element tag name are converted into a token that uniquely represents a namespace specification that is associated with the namespace prefix and the element tag. A stack is defined and is configured to receive one or more tokens during parsing of the XML data stream. Parsing of the XML data stream is performed without requiring an XML tree structure comprising an XML document embodied by the XML data stream, to be built.
TL;DR: This work extends the usual XML data model with symbolic representations of cryptographic values and uses predicates on this data model to describe the semantics of security elements and of sample protocols distributed with the Microsoft WSE implementation of WS-Security.
TL;DR: To enforce the access constraints on user queries, the Secure Query Rewrite (SQR) is proposed -- a set of rules that can be used to rewrite a user XPath query on the security view into an equivalent XQuery expression against the original data, with the guarantee that the users only see information in the view but not any data that was blocked.
Abstract: Being able to express and enforce role-based access control on XML data is a critical component of XML data management. However, given the semi-structured nature of XML, this is non-trivial, as access control can be applied on the values of nodes as well as on the structural relationship between nodes. In this context, we adopt and extend a graph editing language for specifying role-based access constraints in the form of security views. A Security Annotated Schema (SAS) is proposed as the internal representation for the security views and can be automatically constructed from the original schema and the security view specification. To enforce the access constraints on user queries, we propose Secure Query Rewrite (SQR) -- a set of rules that can be used to rewrite a user XPath query on the security view into an equivalent XQuery expression against the original data, with the guarantee that the users only see information in the view but not any data that was blocked. Experimental evaluation demonstrates the efficiency and the expressiveness of our approach.
TL;DR: A hardware XML accelerator as discussed by the authors includes one or more processors (e.g., CMT processors), a parser unit, a cryptographic unit, and various interfaces such as memory, a network, a communication bus, etc.
Abstract: A method and apparatus for accelerating processing of a structured document A hardware XML accelerator includes one or more processors (eg, CMT processors), one or more hardware XML parser units, one or more cryptographic units and various interfaces (eg, to memory, a network, a communication bus) An XML document may be processed in its entirety or may be parsed in segments (eg, as it is received) A parser unit parses a document or segment character by character, validates characters, assembles tokens from the document, extracts data, generates token headers (to describe tokens and data) and forwards the token headers and data for consumption by an application A cryptographic unit may enforce web security, XML security or some other security scheme, by providing encryption/decryption functionality, computing digital signatures, etc Software processing, bus utilization and latencies (eg, memory, bus) are greatly reduced, thereby providing significantly improved XML processing and security processing throughput
TL;DR: The XTreeNet project unifies the publish/subscribe and query/response models with a single common XML aware overlay network for XML-based information producers and consumers.
Abstract: XML is becoming a ubiquitous format for information exchange on the Internet. To alleviate the problems of "whom to ask" and "whom to tell" when connecting XML information producers with consumers over the network, content-based querying and dissemination of information have been investigated in the literature. Our XTreeNet project unifies the publish/subscribe and query/response models with a single common XML aware overlay network for XML-based information producers and consumers. This integrated framework lends itself to a variety of applications.
TL;DR: In this paper, a unified web-based voice messaging system provides voice application control between a web browser and an application server via an hypertext transport protocol (HTTP) connection on an Internet Protocol (IP) network.
Abstract: A unified web-based voice messaging system provides voice application control between a web browser and an application server via an hypertext transport protocol (HTTP) connection on an Internet Protocol (IP) network. The application server executes the voice-enabled web application by runtime execution of a first set of extensible markup language (XML) documents that define the voice-enabled web application to be executed. In addition, control data for the voice-enabled web application, and log files that record events that occur during execution of the voice-enabled web application, are generated and processed using an XML tag format. A second set of XML documents specify application parameters and control information to be used by the application runtime environment for execution of the first set of XML documents. The second set of XML documents enables the application server to maintain a generic application runtime environment, enabling applications to share common control information and provide personalized services for subscribers based on respective user specific control attributes. The generation of log files using an XML tag format enables the log files to use a standardized XML structure that includes log element type, log element attribute, and log element data information. Hence, logs may be written for individual user sessions and overall application information, where the XML log tags may be of sufficient descriptive nature as to be understood using any XML viewer or analyzed by custom log parser configured for locating prescribed XML tags related to a corresponding operation, for example billing, trace routing, etc.
TL;DR: This article reports the first set of results on benchmarking a set of XML database implementations using two XML benchmarks, and selected implementations represent a wide range of approaches, including RDBMS-based systems with document-independent and document-dependent XML-relational schema mapping approaches, and XML native engines based on an Object-Oriented Model and the Document Object Model.
Abstract: XML is emerging as a major standard for representing data on the World Wide Web. Recently, many XML storage models have been proposed to manage XML data. In order to assess an XML database's abilities to deal with XML queries, several benchmarks have also been proposed, including XMark and XMach. However, no reported studies using those benchmarks were found that can provide users with insights on the impacts of a variety of storage models on XML query performance. In this article, we report our first set of results on benchmarking a set of XML database implementations using two XML benchmarks. The selected implementations represent a wide range of approaches, including RDBMS-based systems with document-independent and document-dependent XML-relational schema mapping approaches, and XML native engines based on an Object-Oriented Model and the Document Object Model. Comprehensive experiments were conducted to study relative performance of different approaches and the important issues that affect XML query performance, such as path expression query processing, effectiveness of various partitioning, label-path, and indexing structures.
TL;DR: In this paper, an XML-based solution for data publishing in the presence of an untrusted publisher is presented, which makes use of non-conventional digital signature techniques and queries over encrypted data.
Abstract: Web-based third-party architectures for data publishing are today receiving growing attention, due to their scalability and the ability to efficiently manage large numbers of users and great amounts of data. A third-party architecture relies on a distinction between the Owner and the Publisher of information. The Owner is the producer of information, whereas Publisher provides data management services and query processing functions for (a portion of) the Owner's information. In such architecture, there are important security concerns especially if we do not want to make any assumption on the trustworthy of the Publishers. Although approaches have been proposed [4, 5] providing partial solutions to this problem, no comprehensive framework has been so far developed able to support all the most important security properties in the presence of an untrusted Publisher. In this paper, we develop an XML-based solution to such problem, which makes use of non-conventional digital signature techniques and queries over encrypted data.
TL;DR: A binary XML format and implementation for scientific data called Binary XML for Scientific Applications (BXSA) is presented, and it is shown that performance is comparable to that of commonly used scientific data formats such as netCDF.
Abstract: XML provides flexible, extensible data models and type systems for structured data, and has found wide-acceptance in many domains. XML processing can be slow, however, especially for scientific data, thus leading to the conventional wisdom that XML is not appropriate for such data. Instead, data is stored in specialized binary formats, and is transmitted via work-arounds such as attachments and base64 encoding. Though these work-arounds can be useful, they nonetheless relegate scientific data to second-class status within the Web services framework; and they generally require yet another API, data model, and type system. An alternative solution is to use more efficient encodings of XML, often known as "binary XML". Using XML uniformly throughout an application simplifies and unifies design and development. In this paper we present a binary XML format and implementation for scientific data called Binary XML for Scientific Applications (BXSA). We show that performance is comparable to that of commonly used scientific data formats such as netCDF. These results challenge the prevailing practice of handling control and data separately in scientific applications, with Web services for control and specialized binary formats for data
TL;DR: A method and apparatus for rewriting a database command containing an embedded XML expression such that the rewritten database command recites a text function, in lieu of the embedded XML expressions, is provided in this article.
Abstract: A method and apparatus for rewriting a database command containing an embedded XML expression such that the rewritten database command recites a text function, in lieu of the embedded XML expression, is provided Advantageously, a DBMS may take advantage of the efficiencies in storing XML data within the database, while avoiding the generation of unnecessary XML elements in processing the query when the XML elements contribute nothing to the outcome of the query Cost-base or rule-based analysis may be performed to determine how to rewrite a received database command The database server may functionally evaluate the text function or may use an index defined on a column of the database The text function may function as a primary filter or may reference a column upon which an index is defined, wherein the index operates at the same or higher level than a column being referenced in the embedded XML expression
TL;DR: The classification of XML indexing techniques identifies current practices and trends, offering insight into how developers can improve query processing and select the best solution for particular contexts.
Abstract: XML's increasing diffusion makes efficient XML query processing and indexing all the more critical. Given the semistructured nature of XML documents, however, general query processing techniques won't work. Researchers have proposed several specialized indexing methods that offer query processors efficient access to XML documents, although none are yet fully implemented in commercial products. In this article the classification of XML indexing techniques identifies current practices and trends, offering insight into how developers can improve query processing and select the best solution for particular contexts.
TL;DR: This paper proposes a novel approach to XML access control through rule functions that are managed separately from the documents, and shows the scalability of the scheme by comparing the accessibility evaluation cost of two rule function models.
Abstract: XML documents are frequently used in applications such as business transactions and medical records involving sensitive information. Typically, parts of documents should be visible to users depending on their roles. For instance, an insurance agent may see the billing information part of a medical document but not the details of the patient's medical history. Access control on the basis of data location or value in an XML document is therefore essential. In practice, the number of access control rules is on the order of millions, which is a product of the number of document types (in 1000's) and the number of user roles (in 100's). Therefore, the solution requires high scalability and performance. Current approaches to access control over XML documents have suffered from scalability problems because they tend to work on individual documents. In this paper, we propose a novel approach to XML access control through rule functions that are managed separately from the documents. A rule function is an executable code fragment that encapsulates the access rules (paths and predicates), and is shared by all documents of the same document type. At runtime, the rule functions corresponding to the access request are executed to determine the accessibility of document fragments. Using synthetic and real data, we show the scalability of the scheme by comparing the accessibility evaluation cost of two rule function models. We show that the rule functions generated on user basis is more efficient for XML databases.
TL;DR: This tutorial focuses on DB2's XML support for schema evolution, especiallyDB2's schema repository and document-level validation, as well as SQL/XML's native XML storage, indexing, navigation and query processing.
Abstract: DB2 provides native XML storage, indexing, navigation and query processing through both SQL/XML and XQuery using the XML data type introduced by SQL/XML. In this tutorial we focus on DB2's XML support for schema evolution, especially DB2's schema repository and document-level validation.
TL;DR: In this article, a method of creating a new XML document having at least a root element and a declaration is presented, which is based on retrieving from storage a new fragment XML document comprising at least one XML template for the new XML file.
Abstract: A method of creating a new XML document having at least a root element and a declaration. The method comprises retrieving from storage a new fragment XML document comprising at least one XML template for a new XML file that itself has a root element. Then, at least one XML template is selected and the selected XML template is used to create an XML document. User and programmer interfaces, as well as device and system structures that can implement the method, also are provided.