TL;DR: The XRANK system is presented, designed to handle the novel features of XML keyword search, which naturally generalizes a hyperlink based HTML search engine such as Google and can be used to query a mix of HTML and XML documents.
Abstract: We consider the problem of efficiently producing ranked results for keyword search queries over hyperlinked XML documents. Evaluating keyword search queries over hierarchical XML documents, as opposed to (conceptually) flat HTML documents, introduces many new challenges. First, XML keyword search queries do not always return entire documents, but can return deeply nested XML elements that contain the desired keywords. Second, the nested structure of XML implies that the notion of ranking is no longer at the granularity of a document, but at the granularity of an XML element. Finally, the notion of keyword proximity is more complex in the hierarchical XML data model. In this paper, we present the XRANK system that is designed to handle these novel features of XML keyword search. Our experimental results show that XRANK offers both space and performance benefits when compared with existing approaches. An interesting feature of XRANK is that it naturally generalizes a hyperlink based HTML search engine such as Google. XRANK can thus be used to query a mix of HTML and XML documents.
TL;DR: An exhaust gas recirculator in an internal combustion engine wherein a part of the exhaust gas is supplied to a suction manifold only at the time when the throttle valve is opened and in some time interval immediately after the time afterwards.
Abstract: An exhaust gas recirculator in an internal combustion engine wherein a part of the exhaust gas is supplied to a suction manifold only at the time when the throttle valve is opened and in some time interval immediately after the time afterwards, and the supply of the exhaust gas is stopped in time interval in the steady operation of the engine at the degree of opening of the throttle valve in the above supply period.
TL;DR: eXist as discussed by the authors is an Open Source native XML database system, which supports keyword search on element and attribute contents and an enhanced indexing scheme at the architecture's core supports quick identification of structural node relationships.
Abstract: With the advent of native and XML enabled database systems, techniques for efficiently storing, indexing and querying large collections of XML documents have become an important research topic. This paper presents the storage, indexing and query processing architecture of eXist, an Open Source native XML database system. eXist is tightly integrated with existing tools and covers most of the native XML database features. An enhanced indexing scheme at the architecture's core supports quick identification of structural node relationships. Based on this scheme, we extend the application of path join algorithms to implement most parts of the XPath query language specification and add support for keyword search on element and attribute contents.
TL;DR: This month’s column deals with metadata for XML, primarily the W3C’'s XML Schema Recommendation, which is often seen as highly complex, but quite powerful.
Abstract: This month’s column deals with metadata for XML, primarily the W3C’s XML Schema Recommendation. XML Schema is often seen as highly complex, but quite powerful. We have worked with Chuck Campbell in several standards arenas, including the SQL standard and XML Query. Chuck is an invited expert to the W3C’s XML Schema WG, so we invited him to write a column outlining the features and futures of XML Schema. Jim Melton and Andrew Eisenberg
TL;DR: In this article, the XML component operation can be transformed to a relational database operation on a particular set of one or more relational database constructs of the first set, which does not involve the X component operation.
Abstract: Techniques for executing database commands include receiving a database command that includes an XML component operation that operates on an XML construct that is based on a first set of one or more relational database constructs. It is determined whether the XML component operation can be transformed to a relational database operation on a particular set of one or more relational database constructs of the first set, which does not involve the XML component operation. If it is determined that the XML component operation can be transformed, then the XML component operation is rewritten as a particular relational database operation that operates on the particular set and that does not involve the XML component operation. The particular relational database operation is evaluated. In another aspect, techniques include determining a primitive set of XML generation operations and replacing non-primitive XML generation operations with one or more operations from the primitive set.
TL;DR: The main result of the paper is that typechecking for k-pebble transducers is decidable, and therefore, typechecking can be performed for a broad range of XML transformation languages, including XML-QL and a fragment of XSLT.
TL;DR: A static analysis technique that can identify at compile time which parts of the input document are needed to answer an arbitrary XQuery, and a loading algorithm that takes the resulting information to build a projected document, which is smaller than the original document, and on which the query yields the same result.
Abstract: XQuery is not only useful to query XML in databases, but also to applications that must process XML documents as files or streams. These applications suffer from the limitations of current main-memory XQuery processors which break for rather small documents. In this paper we propose techniques, based on a notion of projection for XML, which can be used to drastically reduce memory requirements in XQuery processors. The main contribution of the paper is a static analysis technique that can identify at compile time which parts of the input document are needed to answer an arbitrary XQuery. We present a loading algorithm that takes the resulting information to build a projected document, which is smaller than the original document, and on which the query yields the same result. We implemented projection in the Galax XQuery processor. Our experiments show that projection reduces memory requirements by a factor of 20 on average, and is effective for a wide variety of queries. In addition, projection results in some speedup during query evaluation.
TL;DR: A complete framework for distributed and replicated dynamic XML documents, and an algorithm that, for a given peer, chooses data and services that the peer should replicate to improve the efficiency of maintaining and querying its dynamic data are described.
Abstract: The advent of XML as a universal exchange format, and of Web services as a basis for distributed computing, has fostered the apparition of a new class of documents: dynamic XML documents. These are XML documents where some data is given explicitly while other parts are given only intensionally by means of embedded calls to web services that can be called to generate the required information. By the sole presence of Web services, dynamic documents already include inherently some form of distributed computation. A higher level of distribution that also allows (fragments of) dynamic documents to be distributed and/or replicated over several sites is highly desirable in today's Web architecture, and in fact is also relevant for regular (non dynamic) documents.The goal of this paper is to study new issues raised by the distribution and replication of dynamic XML data. Our study has originated in the context of the Active XML system [1, 3, 22] but the results are applicable to many other systems supporting dynamic XML data. Starting from a data model and a query language, we describe a complete framework for distributed and replicated dynamic XML documents. We provide a comprehensive cost model for query evaluation and show how it applies to user queries and service calls. Finally, we describe an algorithm that, for a given peer, chooses data and services that the peer should replicate to improve the efficiency of maintaining and querying its dynamic data.
TL;DR: In this paper, a SQL statement includes a particular operator that operates on a first instance of XML type that represents a first set of XML elements, during execution of the SQL statement, the particular operator is evaluated by generating an ordered collection of instances of XML types.
Abstract: Techniques for managing XML data in an SQL compliant DBMS include receiving an SQL statement. The SQL statement includes a particular operator that operates on a first instance of XML type that represents a first set of XML elements. During execution of the SQL statement, the particular operator is evaluated by generating an ordered collection of instances of XML type. Each different instance in the ordered collection is based on a different XML element from the first set; and there is an instance in the ordered collection for every XML element from either the first set or from the first set and its descendents. When descendents are included, each entry in the ordered collection indicates a level in the XML tree. In another aspect, an aggregate operator in the SQL statement operates on a collection of instances, with associated levels, to generate a single instance of XML type.
TL;DR: Comparing relational database performance shows that the desired response times and transaction rates over XML data can not be achieved without major improvements in XML parsing technology, and identifies research topics which are most promising for XML parser performance in database systems.
Abstract: XML parsing is generally known to have poor performance characteristics relative to transactional database processing. Yet, its potentially fatal impact on overall database performance is being underestimated. We report real-word database applications where XML parsing performance is a key obstacle to a successful XML deployment. There is a considerable share of XML database applications which are prone to fail at an early and simple road block: XML parsing. We analyze XML parsing performance and quantify the extra overhead of DTD and schema validation. Comparison with relational database performance shows that the desired response times and transaction rates over XML data can not be achieved without major improvements in XML parsing technology. Thus, we identify research topics which are most promising for XML parser performance in database systems.
TL;DR: This paper is the first attempt at describing the XML Web and the documents contained in it and shows that, despite its short history, XML already permeates the Web, both in terms of generic domains and geographically.
Abstract: Although originally designed for large-scale electronic publishing, XML plays an increasingly important role in the exchange of data on the Web. In fact, it is expected that XML will become the lingua franca of the Web, eventually replacing HTML. Not surprisingly, there has been a great deal of interest on XML both in industry and in academia. Nevertheless, to date no comprehensive study on the XML Web (i.e., the subset of the Web made of XML documents only) nor on its contents has been made. This paper is the first attempt at describing the XML Web and the documents contained in it. Our results are drawn from a sample of a repository of the publicly available XML documents on the Web, consisting of about 200,000 documents. Our results show that, despite its short history, XML already permeates the Web, both in terms of generic domains and geographically. Also, our results about the contents of the XML Web provide valuable input for the design of algorithms, tools and systems that use XML in one form or another.
TL;DR: This paper focuses on the engineering and the experimental evaluation of the MARS system, a system for publishing as XML data from mixed (relational+XML) proprietary storage, while supporting redundancy in storage for tuning purposes.
Abstract: We present a system for publishing as XML data from mixed (relational+XML) proprietary storage, while supporting redundancy in storage for tuning purposes. The correspondence between public and proprietary schemas is given by a combination of LAV-and GAV-style views expressed in XQuery. XML and relational integrity constraints are also taken into consideration. Starting with client XQueries formulated against the public schema the system achieves the combined effect of rewriting-with-views, composition-with-views and query minimization under integrity constraints to obtain optimal reformulations against the proprietary schema. The paper focuses on the engineering and the experimental evaluation of the MARS system.
TL;DR: XML elements and related processes for validation of XML data files are disclosed in this paper, where validation rules are used to specify validation rules that are used by a real-time validation tool to validate data in a node of an XML data file.
Abstract: XML elements and related processes for validation of XML data files is disclosed. These elements are used to specify validation rules that are used by a real-time validation tool to validate data in a node of an XML data file. These elements also are used to specify error messages to be displayed when a node is found to be invalid. Further, they are used to associate executable code with a node that can be executed when the node is modified.
TL;DR: This work provides an O(m log n) incremental validation algorithm using an auxiliary structure of size O(n), where n is the size of the document and m the number of updates, a significant improvement over brute-force re-validation from scratch.
Abstract: We investigate the incremental validation of XML documents with respect to DTDs and XML Schemas, under updates consisting of element tag renamings, insertions and deletions. DTDs are modeled as extended context-free grammars and XML Schemas are abstracted as "specialized DTDs", allowing to decouple element types from element tags. For DTDs, we exhibit an O(m log n) incremental validation algorithm using an auxiliary structure of size O(n), where n is the size of the document and m the number of updates. For specialized DTDs, we provide an O(m log2 n) incremental algorithm, again using an auxiliary structure of size O(n). This is a significant improvement over brute-force re-validation from scratch.
TL;DR: A lightweight fact extractor is presented that utilizes XML tools, such as XPath and XSLT to extract static information from C++ source code programs to facilitate the use of a wide variety of XML tools.
Abstract: A lightweight fact extractor is presented that utilizes XML tools, such as XPath and XSLT to extract static information from C++ source code programs. The source code is first converted into an XML representation, srcML, to facilitate the use of a wide variety of XML tools. The method is deemed lightweight because only a partial parsing of the source is done. Additionally, the technique is quite robust and can be applied to incomplete and noncompilable source code. The trade off to this approach is that queries on some low level details cannot be directly addressed. This approach is applied to a fact extractor benchmark as comparison with other, heavier weight, fact extractors. Fact extractors are widely used to support understanding tasks associated with maintenance, reverse engineering and various other software engineering tasks.
TL;DR: In this article, a system and method for the efficient indexing and delivery of information to interested users who have expressed an interest in or subscribed to information items that are continuously released or published by some data source in XML format is presented.
Abstract: The present invention provides a system and method for the efficient indexing and delivery of information to interested users who have expressed an interest in or “subscribed” to information items that are continuously released or “published” by some data source in XML format. Previously, publish and subscribe systems accepted keyword-based subscription profiles and did not support subscription to XML documents according to their structures. Direct approach to implement XML-based publish and subscribe system by checking each user profile against an XML document is very time consuming. The presentation invention, though, provides an efficient method to identify interested subscribers for each XML document by indexing queries utilizing a graphical structure of nodes. When an XML document is published, the index identifies all matched expressions in the index and delivers at least a portion of an XML document to a user who has expressed an interest in receiving this information.
TL;DR: In this paper, a system and method for querying a stream of XML data in a single pass using standard XQuery expressions is presented, consisting of an expression parser that receives a query and generates a parse tree; a SAX events API that receives the stream of XQuery data and generates an evaluator that receives parse trees and stream of events and buffers fragments from the stream.
Abstract: A system and method for querying a stream of XML data in a single pass using standard XQuery expressions. The system comprises: an expression parser that receives a query and generates a parse tree; a SAX events API that receives the stream of XML data and generates a stream of SAX events; an evaluator that receives the parse tree and stream of SAX events and buffers fragments from the stream of SAX events that meet an evaluation criteria; and a tuple constructor that joins fragments to form a set of tuple results that satisfies the query for the stream of XML data.
TL;DR: This work presents a streaming algorithm for evaluating XPath expressions that use backward axes (parent and ancestor) and forward axes in a single document-order traversal of an XML document that significantly outperforms a traditional nonstreaming XPath engine.
Abstract: We present a streaming algorithm for evaluating XPath expressions that use backward axes (parent and ancestor) and forward axes in a single document-order traversal of an XML document. Other streaming XPath processors handle only forward axes. We show through experiments that our algorithm significantly outperforms (by more than a factor of two) a traditional nonstreaming XPath engine. Furthermore, our algorithm scales better because it retains only the relevant portions of the input document in memory. Our engine successfully processes documents over 1GB in size, whereas the traditional XPath engine degrades considerably in performance for documents over 100 MB in size and fails to complete for documents of size over 200 MB.
TL;DR: A program product, system and method for transforming data between an XML representation and a relational database system wherein a mapping description is created in a mark-up language such as XML and XSL is described in this article.
Abstract: A program product, system and method for transforming data between an XML representation and a relational database system wherein a mapping description is created in a mark-up language such as XML and XSL. The mapping description specifying a set of conditions for source data to satisfy. When mapping to XML, an XML output format is specified in the mapping description and the data is formatted accordingly. When mapping to a RDBMS, actions to be executed on the RDBMS tables are specified in the mapping description and the actions are perfomed.
TL;DR: In this paper, an XML index can be implemented as a node table and the node table may have a B+-tree structure and be populated by shredding the XML values in the primary table.
Abstract: Storing and querying XML data in a primary table or document utilizes an index of XML data and includes creating a primary table structure, creating a primary XML index commensurate with the primary table structure, populating the primary table and the primary XML index, and running a query on the XML data in a primary table by utilizing the XML index. The XML index can be implemented as a node table. The node table may have a B+-tree structure and be populated by shredding the XML values in the primary table. The XML data may be stored as binary large objects in an XML column of the primary table. Secondary XML indexes may be created to assist in the search and retrieval of XML data stored in the primary table. Both the primary XML index and the secondary XML index tables may be created using data definition language statements.
TL;DR: In QRS, reefs (regions expressed by floating-point numbers), a variant of regions, are used for expressing node-numbers, and thus they can be used for detecting ancestor-descendant relationship among nodes for the purpose of efficient query processing.
Abstract: Update management of XML documents is an increasingly important research issue in XML databases, because contents of XML documents evolve as time goes by. Even though, XML databases should be able to effectively process XML queries as well as updates on the documents. We propose a robust node-numbering scheme for XML documents named QRS (quartering-regions scheme). In QRS, reefs (regions expressed by floating-point numbers), a variant of regions, are used for expressing node-numbers. Reefs are almost compatible to regions, and thus they can be used for detecting ancestor-descendant relationship among nodes for the purpose of efficient query processing. Moreover, reefs can cope with updates by utilizing gaps between reefs in terms of floating-point numbers. Consequently, we can avoid node renumbering as much as possible.
TL;DR: It is shown that while most XML path query processing techniques work off SAX events, in some cases it pays off to preprocess the input document, augmenting it with auxiliary information that can be used to evaluate the queries faster.
Abstract: XML path queries form the basis of complex filtering of XML data. Most current XML path query processing techniques can be divided in two groups. Navigation-based algorithms compute results by analyzing an input document one tag at a time. In contrast, index-based algorithms take advantage of precomputed numbering schemes over the input XML document. We introduce a new index-based technique, index-filter, to answer multiple XML path queries. Index-filter uses indexes built over the document tags to avoid processing large portions of the input document that are guaranteed not to be part of any match. We analyze index-filter and compare it against Y-filter, a state-of-the-art navigation-based technique. We show that both techniques have their advantages, and we discuss the scenarios under which each technique is superior to the other one. In particular, we show that while most XML path query processing techniques work off SAX events, in some cases it pays off to preprocess the input document, augmenting it with auxiliary information that can be used to evaluate the queries faster. We present experimental results over real and synthetic XML documents that validate our claims.
TL;DR: In this paper, a system and method for validating an extensible markup language (XML) document and reporting schema violations in real-time is presented, where a parallel tree is maintained that includes nodes corresponding to non-native XML elements of the XML document.
Abstract: A system and method for validating an extensible markup language (XML) document and reporting schema violations in real time. A parallel tree is maintained that includes nodes corresponding to non-native XML elements of the XML document. When changes occur to the XML document, the non-native XML elements corresponding to the changes are marked. The nodes corresponding the marked non-native XML elements are validated against an XML schema that corresponds to the non-native XML markup. The elements and nodes corresponding to errors in the non-native XML markup are then reported to the user according to display indicators in the XML document and the parallel tree.
TL;DR: A semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas, with particular relevance to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships.
Abstract: Web warehousing plays a key role in providing the managers with up-to-date and comprehensive information about their business domain. On the other hand, since XML is now a standard de facto for the exchange of semi-structured data, integrating XML data into web warehouses is a hot topic. In this paper we propose a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation. Differently from previous approaches in the literature, particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships. The approach is implemented in a prototype that reads an XML Schema and produces in output the logical schema of the warehouse.
TL;DR: In this article, a system and method for XML query cursor implementation through the steps of query translation and processing, query result navigation, and positioned update is described. But, given a user's navigation patterns, a system-and method is provided to select either a multi-cursor, outer union, or hybrid approach as an optimal implementation for an XQuery query cursor.
Abstract: A system and method are provided for XML query cursor implementation through the steps of query translation and processing, query result navigation, and positioned update. An XML query cursor implemented in Interface Definition Language (IDL) as well as an extension to XQuery, an XML query language, is described. These steps are addressed by one of three approaches: multi-cursor, outer union, or hybrid. In each approach, XML data is assumed to be stored in a relational database with a mapping that maps each element to a row in a relational database table. In each approach, a system and method provide for cursor movements and positioned updates in increments of a node, sub-tree, or entire document. Given a user's navigation patterns, a system and method is provided to select either a multi-cursor, outer union, or hybrid approach as an optimal implementation for an XML query cursor.
TL;DR: In this paper, a pre-boot execution environment for XML-based security and key management services is described, where XML console in and console out interfaces are loaded, and corresponding API's are published to enable use of the interfaces by various firmware and software components.
Abstract: Methods and systems to support XML-based security and key management services in a pre-boot execution environment. During pre-boot, XML console in and console out interfaces are loaded, and corresponding API's are published to enable use of the interfaces by various firmware and software components. A network stack is set up to enable XML content received at the network interface to be forwarded to the XML console in interface and XML content provided at the XML content out interface to be sent out via the network interface. Security operations may then be performed to authenticate a client system hosting the XML interfaces, to authenticate remote servers to which the client system may communicate with, and to validate boot images provided to the computer system. Key management services are also supported.
TL;DR: A temporal XML query language, τXQuery, is presented, in which valid time support is added to XQuery by minimally extending the syntax and semantics of X query by adopting a stratum approach which maps a τX query to a conventional XQuery.
Abstract: As with relational data, XML data changes over time with the creation, modification, and deletion of XML documents. Expressing queries on time-varying (relational or XML) data is more difficult than writing queries on nontemporal data. In this paper, we present a temporal XML query language, τXQuery, in which we add valid time support to XQuery by minimally extending the syntax and semantics of XQuery. We adopt a stratum approach which maps a τXQuery query to a conventional XQuery. The paper focuses on how to perform this mapping, in particular, on mapping sequenced queries, which are by far the most challenging. The critical issue of supporting sequenced queries (in any query language) is time-slicing the input data while retaining period timestamping. Timestamps are distributed throughout an XML document, rather than uniformly in tuples, complicating the temporal slicing while also providing opportunities for optimization. We propose four optimizations of our initial maximally-fragmented time-slicing approach: selected node slicing, copy-based per-expression slicing, in-place per-expression slicing, and idiomatic slicing, each of which reduces the number of constant periods over which the query is evaluated. While performance tradeoffs clearly depend on the underlying XQuery engine, we argue that there are queries that favor each of the five approaches.
TL;DR: XQBE is designed, a dialect of XQuery inspired by the QBE language, a user-friendly query language supported by MS Access that starts from hierarchical structures, coherent with the hierarchical nature of XML.
Abstract: XQuery, the standard query language for XML, is increasingly popular among computer scientists with a SQL background, since queries in XQuery and SQL require comparable skills to be formulated. However, the number of these experts is limited, and the availability of easier XQuery "dialects" could be extremely valuable. With this motivation in mind, we designed XQBE, a dialect of XQuery inspired by the QBE language (query by example). QBE, initially proposed as an alternative to SQL, has then become popular as the user-friendly query language supported by MS Access. XQBE starts from hierarchical structures, coherent with the hierarchical nature of XML, and uses one or more structures to denote the input documents, and one structure to denote the XML document produced in output. These structures are annotated to express selection predicates; explicit bindings connecting the nodes of these structures visualize the input/output mappings.
TL;DR: In this paper, a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas is proposed, which is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation.
Abstract: Web warehousing plays a key role in providing the managers with up-to-date and comprehensive information about their business domain. On the other hand, since XML is now a standard de facto for the exchange of semi-structured data, integrating XML data into web warehouses is a hot topic. In this paper we propose a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation. Differently from previous approaches in the literature, particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships. The approach is implemented in a prototype that reads an XML Schema and produces in output the logical schema of the warehouse.
TL;DR: In this article, a document authoring system consisting of a server including an authoring management program to prepare a document including tables, and terminals connected via the Internet and including browsing software (browser).
Abstract: There is disclosed a document authoring system mainly comprising a server including an authoring management program to prepare a document including tables, and terminals connected via the Internet and including browsing software (browser). The authoring management program comprises a data management function portion which interprets instruction data concerning change/addition sent from these terminals and attaches a unique identification number to the data, an editing management function portion (DOM of XML) which adds/changes XML format data by a request from the terminal, a data conversion function portion (XSLT, CSS, XML parser) which converts the XML format data into HTML format data, a script tool (script group) which supplies an auxiliary operation function to the terminal, and a publication function portion (Web browser, server program) which publicates the HTML format data on the Internet in an accessible mode.