TL;DR: In this paper, an extensible binary mark-up language for XML-based data storage and communications has been proposed, which is compatible with existing XML standards and provides significantly improved efficiencies for XML based data storage, particularly for narrow and low bandwidth communication media.
Abstract: An extensible binary mark-up language is disclosed that is compatible with existing XML standards yet provides significantly improved efficiencies for XML-based data storage and communications, particularly for narrow and low bandwidth communication media. A corresponding extensible non-binary mark-up language is also disclosed that is compatible with the XML standard. This dual-representation common message format (CMF) allows standard XML tools to be utilized in viewing and editing XML-based data and allows a CMF parser to be utilized to convert the XML formatted information into an extensible binary representation for actual communication through a medium or storage on a wide range of media. Advantages include a very compact, yet flexible and extensible binary data representation (CMF-B) for a corresponding extensible mark-up language (CMF-X), a data packaging scheme that allows for the effective transport of XML-based data over existing data channels, including narrow-bandwidth channels that utilize existing network protocols, and a CMF parser that allows for seamless conversion between CMF-B and CMF-X.
TL;DR: This paper proposes an algorithm named Hash Count to find ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou, and compares it with the state-of-the-art algorithms.
Abstract: Keyword search is integrated in many applications on account of the convenience to convey users' query intention. Recently, answering keyword queries on XML data has drawn the attention of web and database communities, because the success of this research will relieve users from learning complex XML query languages, such as XPath/XQuery, and/or knowing the underlying schema of the queried XML data. As a result, information in XML data can be discovered much easier.To model the result of answering keyword queries on XML data, many LCA (lowest common ancestor) based notions have been proposed. In this paper, we focus on ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou. We propose an algorithm named Hash Count to find ELCAs efficiently. Our analysis shows the complexity of Hash Count algorithm is O(kd|S1|), where k is the number of keywords, d is the depth of the queried XML document and |S1| is the frequency of the rarest keyword. This complexity is the best result known so far. We also evaluate the algorithm on a real DBLP dataset, and compare it with the state-of-the-art algorithms. The experimental results demonstrate the advantage of Hash Count algorithm in practice.
TL;DR: jmzML, a Java API for the Proteomics Standards Initiative mzML data standard, can handle arbitrarily large files in minimal memory, allowing easy and efficient processing of mz ML files using the Java programming language.
Abstract: We here present jmzML, a Java API for the Proteomics Standards Initiative mzML data standard. Based on the Java Architecture for XML Binding and XPath-based XML indexer random-access XML parser, jmzML can handle arbitrarily large files in minimal memory, allowing easy and efficient processing of mzML files using the Java programming language. jmzML also automatically resolves internal XML references on-the-fly. The library (which includes a viewer) can be downloaded from http://jmzml.googlecode.com.
TL;DR: The semantics of query answering in this setting is defined and an algorithm for inferring precise mapping rules from informal schema correspondences is developed, which handles an expressive fragment of XQuery and works both along and against the direction of mapping rules.
Abstract: Peers in a peer-to-peer data management system often have heterogeneous schemas and no mediated global schema. To translate queries across peers, we assume each peer provides correspondences between its schema and a small number of other peer schemas. We focus on query reformulation in the presence of heterogeneous XML schemas, including data---metadata conflicts. We develop an algorithm for inferring precise mapping rules from informal schema correspondences. We define the semantics of query answering in this setting and develop query translation algorithm. Our translation handles an expressive fragment of XQuery and works both along and against the direction of mapping rules. We describe the HePToX heterogeneous P2P XML data management system which incorporates our results. We report the results of extensive experiments on HePToX on both synthetic and real datasets. We demonstrate our system utility and scalability on different P2P distributions.
TL;DR: In this article, horizontal and vertical fragmentation techniques are generalised from the relational datamodel to XML and splitting is introduced as a third kind of fragmentation, and it is shown how relational techniques for de ning reasonable fragments can be applied to the case of XML.
Abstract: The world-wide web (WWW) is often considered to be the world's largest database and the eXtensible Markup Language (XML) is then considered to provide its datamodel. Adopting this view we have to deal with a distributed database. This raises the question, how to obtain a suitable distribution design for XML documents. In this paper horizontal and vertical fragmentation techniques are generalised from the relational datamodel to XML. Furthermore, splitting will be introduced as a third kind of fragmentation. Then it is shown how relational techniques for de ning reasonable fragments can be applied to the case of XML.
TL;DR: This survey paper presents an overview of the different proposals that use XML within data warehousing technology, which range from using XML data sources for regular warehouses to those using full XML warehousing solutions.
TL;DR: The paper evaluates the claims made by Google by developing an algorithm to map an existing XML to Protocol Buffer format and drawing any conclusion on the efficiency and effectiveness of this format as compared to XML.
Abstract: World is shrinking each day through the use of Internet and people are communicating better than before in this widely distributed network. There is a great need to manage this communication over various networks supporting different specifications. One of the widely used techniques for this type of data management is XML data interchange format. Google developers recently introduced Protocol Buffer as an alternative to XML claiming that it overcomes the shortcomings suffered by XML. This paper compares XML and Protocol Buffer data formats by extensive analysis of the two. The paper evaluates the claims made by Google by developing an algorithm to map an existing XML to Protocol Buffer format and drawing any conclusion on the efficiency and effectiveness of this format as compared to XML. It can be hoped that this work will contribute to the upcoming research in this field as people are looking for more robust data interchange format for the future of the Internet.
TL;DR: The design of the first complete field programmable gate array (FPGA) accelerator capable of XML well-formed checking, schema validation, and tree construction at a throughput of 1 cycle per byte (CPB) is detailed.
Abstract: Extensible Markup Language (XML) is playing an increasing important role in web services and database systems. However, the task of XML parsing is often the bottleneck, and as a result, the target of acceleration using custom hardware or multicore CPUs. In this paper, we detail the design of the first complete field programmable gate array (FPGA) accelerator capable of XML well-formed checking, schema validation, and tree construction at a throughput of 1 cycle per byte (CPB). This is a significant advancement from 40 CPB, the best previous reported commercial result. We demonstrate our design on a Xilinx Virtex-5 board, which successfully saturates a 1 Gbps Ethernet link.
TL;DR: This article proposes a novel distributed index structure and a clustering strategy for streaming XML data that enables energy and latencyefficient broadcasting of XML data and demonstrates that it is effective for wireless broadcasting ofxml data and outperforms the previous methods.
Abstract: In this article, we address the problem of delayed query processing raised by tree-based index structures in wireless broadcast environments, which increases the access time of mobile clients. We propose a novel distributed index structure and a clustering strategy for streaming XML data that enables energy and latencyefficient broadcasting of XML data. We first define the DIX node structure to implement a fully distributed index structure which contains the tag name, attributes, and text content of an element, as well as its corresponding indices. By exploiting the index information in the DIX node stream, a mobile client can access the stream with shorter latency. We also suggest a method of clustering DIX nodes in the stream, which can further enhance the performance of query processing in the mobile clients. Through extensive experiments, we demonstrate that our approach is effective for wireless broadcasting of XML data and outperforms the previous methods.
TL;DR: This thesis describes the design of a full-fledged XML storage and query architecture, which represents the core of the Open Source database system BASEX, and introduces a survey on state-of-the-art XML query languages.
Abstract: After its introduction in 1998, XML has quickly emerged as the de facto exchange format for textual data. Only ten years later, the amount of information that is being processed day by day, locally and globally, has virtually exploded, and no end is in sight. Correspondingly, many XML documents and collections have become much too large for being retrieved in their raw form – and this is where database technology gets into the game. This thesis describes the design of a full-fledged XML storage and query architecture, which represents the core of the Open Source database system BASEX. In contrast to numerous other works on XML processing, which either focus on theoretical aspects or practical implementation details, we have tried to bring the two worlds together: well-established and novel concepts from database technology and compiler construction are consolidated to a powerful and extensible software architecture that is supposed to both withstand the demands of complex real-life applications and comply with all the intricacies of the W3C Recommendations. In the Storage chapter, existing tree encodings are explored, which allow XML documents to be mapped to a database. The Pre/Dist/Size triple is chosen as the most suitable encoding and further optimized by merging all XML node properties into a single tuple, compactifying redundant information, and inlining attributes and numeric values. The address ranges of numerous large-scale and real-life XML instances are analyzed to find an optimal tradeoff between maximum document and minimum database size. The process of building a database is described in detail, including the import of tree data other than XML and the creation of main memory database instances. As one of the distinguishing features, the resulting storage is enriched by light-weight structural, value and full-text indexes, which speed up query processing by orders of magnitudes. The Querying chapter is introduced with a survey on state of the art XML query languages. We give some insight into the design of an XQuery processor and then focus on the optimization of queries. Beside classical concepts, such as constant folding or static typing, many optimizations are specific to XML: location paths are rewritten to access less XML nodes, and FLWOR expressions are reorganized to reduce the algorithmic com-
TL;DR: This collection represents an understanding of XMLprocessing technologies in connection with both advanced applications and the latest XML processing technologies that is of primary importance.
Abstract: Advanced Applications and Structures in XML Processing: Label Streams, Semantics Utilization and Data Query Technologies reflects the significant research results and latest findings of scholars worldwide, working to explore and expand the role of XML. This collection represents an understanding of XML processing technologies in connection with both advanced applications and the latest XML processing technologies that is of primary importance. It provides the opportunity to understand topics in detail and discover XML research at a comprehensive level.
TL;DR: A new relational schema named XLight is introduced for storing and processing XML data, and it is compared with some similar methods.
Abstract: Because of increasing use of XML data on the internet, the need for an efficient method of storing and querying XML data is vital. So far, two major types of system for XML data management have been introduced: XML Enabled systems and XML native systems. The former uses relational database system for storing and querying XML data and the latter is a special XML database system which is based on XML data model. Since relational database systems are more mature than XML native systems, it seems that the use of abilities and efficiencies of relational systems is more economical. In this article, we have introduced a new relational schema named XLight for storing and processing XML data, and we have compared it with some similar methods.
TL;DR: An overview of the recent research in dynamic XML labelling schemes is provided and a set of properties are defined that represent a more holistic dynamic labelling scheme and presented through an evaluation matrix for most of the existing schemes that provide update functionality.
Abstract: The adoption of XML as the default data interchange format and the standardisation of the XPath and XQuery languages has resulted in significant research in the development and implementation of XML databases capable of processing queries efficiently. The ever-increasing deployment of XML in industry and the real-world requirement to support efficient updates to XML documents has more recently prompted research in dynamic XML labelling schemes. In this paper, we provide an overview of the recent research in dynamic XML labelling schemes. Our motivation is to define a set of properties that represent a more holistic dynamic labelling scheme and present our findings through an evaluation matrix for most of the existing schemes that provide update functionality.
TL;DR: In this paper, the authors describe a technique for fast and scalable generation and aggregation of XML data, which is based on XML query evaluation and XML query generation and query aggregation, where the XML query is evaluated to determine XML results.
Abstract: Techniques for fast and scalable generation and aggregation of XML data are described. In an example embodiment, an XML query that requests data from XML documents is received. The XML query is evaluated to determine one or more XML results. For each particular XML result, evaluating the XML query comprises: instantiating a particular data structure that represents the particular XML result, where the particular data structure is encoded in accordance with tags specified in the XML query but does not store the tags; and storing, in the particular data structure, one or more locators that respectively point to one or more fragments in the XML documents, where the particular data structure stores the one or more locators but does not store the one or more fragments. On demand, in response to a request indicating the particular XML result, a serialized representation of the particular XML result is generated based at least on the particular data structure.
TL;DR: In this article, the authors propose an architecture that extends conventional computer programming languages that compile into an instance of an extensible markup language (XML) document object model (DOM) to provide support for XML literals in the underlying programming language.
Abstract: An architecture that that extends conventional computer programming languages that compile into an instance of an extensible markup language (XML) document object model (DOM) to provide support for XML literals in the underlying programming language. This architecture facilitates a convenient short cut by replacing the complex explicit construction required by conventional systems to create an instance of a DOM with a concise XML literal for which conventional compilers can translate into the appropriate code. The architecture allows these XML literals to be embedded with expressions, statement blocks or namespaces to further enrich the power and versatility. In accordance therewith, context information describing the position and data types that an XML DOM can accept can be provided to the programmer via, for example, an integrated development environment. Additionally, the architecture supports escaping XML identifiers, a reification mechanism, and a conversion mechanism to convert between collections and singletons.
TL;DR: An implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools is presented and a graphical interface for visualizing and resolving conflicts is provided.
Abstract: XML has become the standard document representation for many popular tools in various domains. When multiple authors collaborate to produce a document, they must be able to work in parallel and periodically merge their efforts into a single work. While there exist a small number of three-way XML merging tools, their performance could be improved in several areas and they lack any form of user interface for resolving conflicts.In this paper, we present an implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools. It uses a specialized versioning tree data structure that supports node identity and change detection. The algorithm applies the traditional three-way merge found in GNU diff3 to the children of changed nodes. The editing operations it supports are addition, deletion, update, and move. A graphical interface for visualizing and resolving conflicts is also provided. An evaluation experiment was conducted comparing the proposed algorithm with three other tools on randomly generated XML data.
TL;DR: This research concludes that for XML-based data, a doubling of bandwidth potential is achievable and CPU burdens minimized when EXI is applied.
Abstract: : The Department of Defense (DoD) Network-Centric data sharing strategy for the Global Information Grid (GIG) is to XMLize all data. The goal of this strategy is to ensure all data is visible, usable and interoperable, when and where needed, to accelerate decision cycles. However, this XML-based data approach comes at the cost of limiting real-time network edge device connectivity to the GIG because they are seldom able to meet the necessary bandwidth and processing requirements due to XML's intrinsic nature of being verbose and often complex to process. This research explores a powerful and robust solution to XML's network depth limits by means of the World Wide Web Consortium's (W3C) proposed alternative XML format, Efficient XML Interchange (EXI). The EXI format removes redundant tags and values from XML documents and encodes numeric content in a binary format. This format delivers significant file size savings and processing efficiencies compared to existing practices. The evolution of XML's path to EXI is summarized based on the results of the XML Binary Characterization (XBC) working group and the W3C's design points of XML. Followed are recommended steps for EXI development and enterprise integration, focusing on a public open source licensing philosophy. EXI algorithms are described with detailed explanations, Java code samples, and part-task test XML documents. Experiments are conducted evaluating the effectiveness of EXI for DoD tactical use and is followed with a recommended optimal EXI configuration. Several predictive models of EXI's performance are presented to enable potential EXI adopters a measurement tool of expected EXI benefit for various XML domains. This research concludes that for XML-based data, a doubling of bandwidth potential is achievable and CPU burdens minimized when EXI is applied.
TL;DR: This paper presents the methodology for XQuery query processing over distributed XML databases, which comprises the steps of query decomposition, data localization, and global optimization.
Abstract: The increasing volume of data stored as XML documents makes fragmentation techniques an alternative to the performance issues in query processing. Fragmented databases are feasible only if there is a transparent way to query the distributed database. Fragments allow for intra-query parallel processing and data reduction. This paper presents our methodology for XQuery query processing over distributed XML databases. The methodology comprises the steps of query decomposition, data localization, and global optimization. This methodology can be used in an XML database or in a system that publishes homogeneous views of semi-autonomous databases. An implementation has been done and experimental results can achieve performance improvements of up to 95% when compared to the centralized environment.
TL;DR: This paper studies the performance evaluation of storing XML documents into relational databases and identifies which mapping approach is best suited for which business environment.
Abstract: XML has emerged as the standard for information representation over the Internet. It is critical to store and query XML data to exploit the full power of the new technology. However, most enterprises today have long secured the use of relational databases. Thus, simply replacing relational databases with a pure XML database is not a good choice. It is thus crucial to map XML data into relational data. This paper studies the performance evaluation of storing XML documents into relational databases and identifies which mapping approach is best suited for which business environment. The performance results for all approaches are presented and a number of interesting results obtained from these evaluations are highlighted.
TL;DR: This paper formulates the XML signature and encryption as the core of web services security technology, and describes how to create and verify XML signature, how to encrypt and decrypt XML data.
Abstract: With the development of web services application, some issues of web services security are increasingly prominent. As a platform-independent language, XML is widely used for its high expansibility. After analysis the traditional web services security technology, this paper formulates the XML signature and encryption as the core of web services security technology, and describes how to create and verify XML signature, how to encrypt and decrypt XML data. The application of XML signature and encryption in the Web services security is illustrated.
TL;DR: A set of transformation rules are specified that are able to automatically generate not only the corresponding XML structure of the DW from secure conceptual DW models, but also the security rules specified within the DW XML structure, thus allowing us to implement both aspects simultaneously.
Abstract: Data Warehouses (DWs) are currently considered to be the cornerstone of Business Intelligence (BI) systems. Security is a key issue in DWs since the business information that they manage is crucial and highly sensitive, and should be carefully protected. However, the increasing amount of data available on the Web signifies that more and more DW systems are considering the Web as the primary data source through which to populate their DWs. XML is therefore widely accepted as being the principal means through which to provide easier data and metadata interchange among heterogeneous data sources from the Web and the DW systems.Although security issues have been considered during the whole development process of traditional DWs, current research lacks approaches with which to consider security when the target platform is based on the Web and XML technologies. The idiosyncrasy of the unstructured and semi-structured data available on the Web definitely requires particular security rules that are specifically tailored to these systems in order to permit their particularities to be captured correctly.In order to tackle this situation, in this paper, we propose a methodological approach based on the Model Driven Architecture (MDA) for the development of Secure XML DWs. We therefore specify a set of transformation rules that are able to automatically generate not only the corresponding XML structure of the DW from secure conceptual DW models, but also the security rules specified within the DW XML structure, thus allowing us to implement both aspects simultaneously. A case study is provided at the end of the paper to show the benefits of our approach.
TL;DR: In this article, an efficient linear algorithm for mapping XML data to relational data is proposed, which can be easily adapted to other inlining algorithms and is based on our previous proposed inlining algorithm.
Abstract: XML has emerged as the standard for representing and exchanging data on the World Wide Web. It is critical to have efficient mechanisms to store and query XML data to exploit the full power of this new technology. Several researchers have proposed to use relational databases to store and query XML data. While several algorithms of schema mapping and query mapping have been proposed, the problem of mapping XML data to relational data, i.e., mapping an XML INSERT statement to a sequence of SQL INSERT statements, has not been addressed thoroughly in the literature. In this paper, we propose an efficient linear algorithm for mapping XML data to relational data. This algorithm is based on our previous proposed inlining algorithm for mapping DTDs to relational schemas and can be easily adapted to other inlining algorithms.
TL;DR: A code on XML node index based quadruple model as spatio-temporal data storage model is proposed and the Native XML Database (NXD) is used to store the spatIO-tem temporal data.
Abstract: This article uses the XML/GML language to describe a feature-based spatio-temporal data model(FBSTDM), but the traditional relational or object-oriented database does not actually support this semi-structured data storage; This paper proposes a code on XML node index based quadruple model as spatio-temporal data storage model and uses the Native XML Database(NXD) to store the spatio-temporal data.
TL;DR: A method for the semiautomatic transition from the design models of a Web application to a running implementation using the XML publishing framework Cocoon which provides a very flexible way to generate documents comprising XSLT and XSP processors.
Abstract: In this paper we present a method for the semiautomatic transition from the design models of a Web application to a running implementation. The design phase consists of constructing a set of UML models such as the conceptual model, the navigation model and the presentation model. We use the UML extension mechanisms, i.e. stereotypes, tagged values and OCL constraints, thereby defining a UML Profile for the Web application domain. We show how these design models can automatically be mapped to XML documents with a structure conforming to their respective XML Schema definitions. Further on we demonstrate techniques how XML documents for the conceptual model are automatically mapped to conceptual DOM objects (Document Object Model). DOM objects corresponding to interactional objects are automatically derived from conceptual DOM objects and/or other interactional DOM objects. The XSLT mechanism serves to transform the logical presentation objects representing the user interface to physical presentation objects, e.g. HTML or WAP pages. Finally we present a production system architecture for Web applications using the XML publishing framework Cocoon which provides a very flexible way to generate documents comprising XSLT and XSP (eXtensible server pages) processors.
TL;DR: This book aims to provide a single account of current studies in soft computing approaches to XML data management to provide the state of the art information to researchers, practitioners, and graduate students of the Web intelligence.
Abstract: This book covers in a great depth the fast growing topic of techniques, tools and applications of soft computing in XML data management. It is shown how XML data management (like model, query, integration) can be covered with a soft computing focus. This book aims to provide a single account of current studies in soft computing approaches to XML data management. The objective of the book is to provide the state of the art information to researchers, practitioners, and graduate students of the Web intelligence, and at the same time serving the information technology professional faced with non-traditional applications that make the application of conventional approaches difficult or impossible.
TL;DR: This paper proposes a general mediation framework to facilitate the storage of the new incoming data in XML format into the relational databases of the legacy health information systems and vice versa and has the capacity to preserve the integrity constraints of the relational schema.
Abstract: Providing a transparent and automatic communication between health information systems for the purpose of exchanging patients' data among healthcare professionals is deemed as one of the most challenging problems in eHealth. Indeed, data storage in health information systems is mainly performed in relational databases, whereas eXtensible Markup Language (XML) is seen as the de facto standard for exchanging data among health organizations. Automating data interchange between relational databases and XML documents remains however a challenge. In this paper, we propose a general mediation framework to facilitate the storage of the new incoming data in XML format into the relational databases of the legacy health information systems and vice versa. The proposed mediation architecture is based on the XML technology and its related languages and derivatives (XML Schema, eXtensible Stylesheet Language Transformations (XSLT)…), which provide powerful tools for sharing, converting and exchanging information. The adopted methodology consists in converting the database model into an XML schema and in performing an automatic, reliable and efficient mapping between the schemas representing the exchanged source and target data by means of the XSLT language. Our approach has the capacity to preserve the integrity constraints of the relational schema, which allows to check the XML infosets for anomalies or incoherencies before updating the relational database from the XML document. It also captures the hierarchy of the tables in the target database, which guarantees that the automatically generated Structured Query Language (SQL) queries will be correctly performed. Moreover, our mediator includes a rule base allowing a coherent and secure mapping between the exchanged data sources for ensuring the database integrity.
TL;DR: VIREX provides an interactive approach for querying and integrating relational databases to produce XML documents and the corresponding schemas and supports VRXQuery, which is a visual naive-users-oriented query language that allows users to specify queries and define views directly on the interactive diagram as a sequence of mouse clicks with minimum keyboard input.
Abstract: VIREX provides an interactive approach for querying and integrating relational databases to produce XML documents and the corresponding schemas. VIREX connects to each database specified by the user; analyzes the catalogue to derive an interactive diagram equivalent to the extended entity-relationship diagram; allows the user to display sample records from the tables in the database; allows the user to rename columns and relations by modifying directly the interactive diagram; facilitates the conversion of the relational database into XML; and derives the XML schema. VIREX works even when the catalogue of the relational database is missing; it extracts the required catalogue information by analyzing the database content. Further, VIREX supports VRXQuery, which is a visual naive-users-oriented query language that allows users to specify queries and define views directly on the interactive diagram as a sequence of mouse clicks with minimum keyboard input. The user is expected to interactively decide on certain factors to be considered in producing the XML result. Such factors include: 1) selecting the relations/attributes to be converted into XML; 2) specifying a predicate to be satisfied by the information to be converted into XML; 3) deciding on the order of nesting between the relations to be converted into XML; 4) ordering for the result. VRXQuery supports selection, projection, nesting/join, union, difference, and order-by. As the result of a query, VIREX displays on the screen the XML schema that satisfies the specified characteristics and generates colored (easy to read) XML document(s). Further, VIREX allows the user to display and review the SQL and XQuery equivalent to each query expressed in VRXQuery.
TL;DR: In this paper, column values that are to be stored for shredded XML documents are separately analyzed for a XML document to determine whether to store a particular column in column-major format or row major format, and what compression technique to use.
Abstract: A database server exploits the power of compression and a form of storing relational data referred to as column-major format, to store XML documents in shredded form. The column values that are to be stored for shredded XML documents are separately analyzed for a XML document to determine whether to store a particular column in column-major format or row-major format, and what compression technique to use, if any.