TL;DR: An extensible programming framework to separate platform-specific optimizations from application codes, and to incrementally improve the performance of an existing application without messing up the code is proposed.
Abstract: This paper proposes an extensible programming framework to separate platform-specific optimizations from application codes. The framework allows programmers to define their own code translation rules for special demands of individual systems, compilers, libraries, and applications. Code translation rules associated with user-defined compiler directives are defined in an external file, and the application code is just annotated by the directives. For code transformations based on the rules, the framework exposes the abstract syntax tree (AST) of an application code as an XML document to expert programmers. Hence, the XML document of an AST can be transformed using any XML-based technologies. Our case studies using real applications demonstrate that the framework is effective to separate platform-specific optimizations from application codes, and to incrementally improve the performance of an existing application without messing up the code.
TL;DR: A bidirectional XML updatelanguage called BIFLUX (BIdirectional FunctionaL Updates for XML), inspired by the FLUX XML update language, is proposed, with a clear and well-behaved biddirectional semantics and a decidable static type system based on regular expression types.
Abstract: Different XML formats are widely used for data exchange and processing, being often necessary to mutually convert between them. Standard XML transformation languages, like XSLT or XQuery, are unsatisfactory for this purpose since they require writing a separate transformation for each direction. Existing bidirectional transformation languages mean to cover this gap, by allowing programmers to write a single program that denotes both transformations. However, they often 1) induce a more cumbersome programming style than their traditionally unidirectional relatives, to establish the link between source and target formats, and 2) offer limited configurability, by making implicit assumptions about how modifications to both formats should be translated that may not be easy to predict.This paper proposes a bidirectional XML update language called BIFLUX (BIdirectional FunctionaL Updates for XML), inspired by the FLUX XML update language. Our language adopts a novel bidirectional programming by update paradigm, where a program succinctly and precisely describes how to update a source document with a target document, in an intuitive way, such that there is a unique "inverse" source query for each update program. BIFLUX extends FLUX with bidirectional actions that describe the connection between source and target formats. We introduce a core BIFLUX language, with a clear and well-behaved bidirectional semantics and a decidable static type system based on regular expression types.
TL;DR: A novel two-tier index structure is proposed to facilitate the access of XML document in an on-demand broadcast system and provides the clients with an overall image of all the XML documents available at the server side and hence enables the clients to locate complete result sets accordingly.
Abstract: XML data broadcast is an efficient way to disseminate semistructured information in wireless mobile environments. In this paper, we propose a novel two-tier index structure to facilitate the access of XML document in an on-demand broadcast system. It provides the clients with an overall image of all the XML documents available at the server side and hence enables the clients to locate complete result sets accordingly. A pruning strategy is developed to cut down the index size and a two-tier structure is proposed to further remove any redundant information. In addition, two index distribution strategies, namely naive distribution and partial distribution, have been designed to interleave the index information with the XML documents in the wireless channels. Theoretical analysis and simulation experiments are also put forward to show the benefits of our indexing methods.
TL;DR: A template is introduced that describes the main components of an XML Matcher, their role and behavior, and illustrates how each of these components has been implemented in some popular XML Matchers.
Abstract: Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.
TL;DR: This paper will present how to shift from XML encryption to JSON encryption, a lightweight data format that is inter-changeable with a programming languages built-in data structures that eliminates translation time and reduces complexity and processing time.
Abstract: JavaScript Object Notation (JSON) is a lightweight data-interchange format. It is easy for humans to read and write. It has a data format that is inter-changeable with a programming languages built-in data structures that eliminates translation time and reduces complexity and processing time. Moreover, JSON has the same strengths of XML. Therefore, it's better to shift form XML security to JSON security. In this paper, we will present how to shift from XML encryption to JSON encryption.
TL;DR: This work proposes parallel versions of two prominent tree labeling schemes based on the MapReduce framework, and presents techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and Map Reduce’s inherited limitation.
Abstract: The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce's inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness.
TL;DR: A formal theory of 8 plus 3 structural patterns for XML elements is introduced, and their identifiability in a number of different XML vocabularies is verified, allowing the creation of visualization and content extraction tools that are completely independent of the schema.
Abstract: Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights into the expected characteristics of a markup language, as well as any regularity that may span vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we explore the idea of structural patterns in XML vocabularies, by examining the characteristics of elements as they are used, rather than as they are defined. We introduce from the ground up a formal theory of 8 plus 3 structural patterns for XML elements, and verify their identifiability in a number of different XML vocabularies. The results allowed the creation of visualization and content extraction tools that are completely independent of the schema and without any previous knowledge of the semantics and organization of the XML vocabulary of the documents.
TL;DR: A comparative analysis of the various schemes available to efficiently store and query the temporal and multi-versioned XML documents based on temporal, change management, versioning, and querying support is provided.
Abstract: Extensible Markup Language (XML) documents are associated with time in two ways: (1) XML documents evolve over time and (2) XML documents contain temporal information. The efficient management of the temporal and multi-versioned XML documents requires optimized use of storage and efficient processing of complex historical queries. This paper provides a comparative analysis of the various schemes available to efficiently store and query the temporal and multi-versioned XML documents based on temporal, change management, versioning, and querying support. Firstly, the paper studies the multi-versioning control schemes to detect, manage, and query change in dynamic XML documents. Secondly, it describes the storage structures used to efficiently store and retrieve XML documents. Thirdly, it provides a comparative analysis of the various commercial tools based on change management, versioning, collaborative editing, and validation support. Finally, the paper presents some future research and development directions for the multi-versioned XML documents.
TL;DR: An optimization approach that takes into consideration the semantics of the dataset in order to deal with the complexity of multi-disciplinary domains in Big Data, in particular when the data is represented as XML documents is adopted.
TL;DR: This chapter aims to give a reasonably comprehensive definition and motivation for the various aspects of the generic XML language and also to illustrate these aspects with some existing XML dialects or vocabularies.
Abstract: This chapter aims to give a reasonably comprehensive definition and motivation for the various aspects of the generic XML language and also to illustrate these aspects with some existing XML dialects or vocabularies We describe elements, attributes, child elements, and the hierarchical structure of XML We talk about “well-formedness” of an XML document and how to identify errors in a document’s structure We discuss the use of namespaces and end with a brief discussion of validating documents with respect to DTDs and XML Schema Readers already familiar with all aspects of XML can skip this chapter and read about the functions used to work with XML in R, which are the subject of each of Chapters 3, 4, 5, and 6
TL;DR: This paper adopts a novel dynamic encoding scheme which is tailored for both static and dynamic possibilistic XML documents to effectively avoid re-labeling after updates, and proposes an efficient algorithm to handle the problem of dynamic twig queries in Possibilism XML documents.
TL;DR: This paper gives formalized representations of XML data sources, including Document Type Definitions (DTDs), XML Schemas, and XML documents, and proposes formal approaches for transforming the XML data Sources into ontologies, and discusses the correctness of the transformations and provides several transformation examples.
Abstract: The eXtensible Markup Language (XML) has reached a wide acceptance as the relevant standardization for representing and exchanging data on the Web Unfortunately, XML covers the syntactic level but lacks semantics, and thus cannot be directly used for the Semantic Web Currently, finding a way to utilize XML data for the Semantic Web is challenging research As we have known that ontology can formally represent shared domain knowledge and enable semantics interoperability Therefore, in this paper, we investigate how to represent and reason about XML with ontologies Firstly, we give formalized representations of XML data sources, including Document Type Definitions (DTDs), XML Schemas, and XML documents On this basis, we propose formal approaches for transforming the XML data sources into ontologies, and we also discuss the correctness of the transformations and provide several transformation examples Furthermore, following the proposed approaches, we implement a prototype tool that can automatically transform XML into ontologies Finally, we apply the transformed ontologies for reasoning about XML, so that some reasoning problems of XML may be checked by the existing ontology reasoners
TL;DR: This paper compares two of the most popular data formats for web applications, JSON and XML, in terms of features, use, and performance testing results.
Abstract: Choosing an appropriate data format for your data makes a difference in programming and performance speed of your application. In this paper, we'll compare two of the most popular data formats for web applications, JSON and XML, in terms of features, use, and performance testing results.
TL;DR: This paper surveys the existing XML fragmentation approaches in literature, comparing their features and highlighting their drawbacks, and establishes a map of the area to establish a consensus in the database community as to what an XML fragment is.
Abstract: Efficient document processing is a must when large volumes of XML data are involved In such critical scenarios, a well-known solution to this problem is to distribute (map) the data among several processing nodes, and then distribute the processing accordingly, taking advantage of parallelism This is the approach taken by distributed databases and MapReduce environments Fragmentation techniques play an important role in these scenarios They provide a way to "cut" the database into pieces and distribute the pieces over a network This way, queries can also be "cut" into sub-queries that run in parallel, thus achieving better performance when compared to the centralized environment However, there is no consensus in the database community as to what an XML fragment is In fact, several approaches in literature present definitions of XML fragments In addition to query processing, using XML fragmentation techniques may also be helpful when managing XML documents distributed along the web or clouds This paper surveys the existing XML fragmentation approaches in literature, comparing their features and highlighting their drawbacks Our contribution resides in establishing a map of the area
TL;DR: ReLab, a subtree based labeling scheme which generates labels using depth-first traversal is introduced, and it is indicated that ReLab outperformed Dietz and region numbering schemes in terms of time taken to generate labels for each XML nodes.
Abstract: XML has become the de facto standard in the real world application over the WWW. Thus, data or query processing is critical to ensure speed response time to cater user queries. Response time is often influenced by the complexity of labeling scheme which is not only used for unique identification of XML nodes, but for structural relationship purpose as well. The labeling scheme adopted is vital to ensure query processing is done flawlessly and promptly. In this paper, we introduce ReLab, a subtree based labeling scheme which generates labels using depth-first traversal. Our experimental evaluation indicated that ReLab outperformed Dietz and region numbering schemes in terms of time taken to generate labels for each XML nodes.
TL;DR: An XML data cube model which offers the complete views to observe XML data, and a basic algorithm to implement its building process on Hadoop is presented and an optimized algorithm more suitable for this kind of XML data is proposed.
Abstract: XML has become a widely used and well structured data format for digital document handling and message transmission. To find useful knowledge in XML data, data warehouse and OLAP applications aimed at providing supports for decision making should be developed. Apache Hadoop is an open source cloud computing framework that provides a distributed file system for large scale data processing. In this paper, we discuss an XML data cube model which offers us the complete views to observe XML data, and present a basic algorithm to implement its building process on Hadoop. To improve the efficiency, an optimized algorithm more suitable for this kind of XML data is also proposed. The experimental results given in the paper prove the effectiveness of our optimization strategies.
TL;DR: A novel Nearest Common Object Node semantics (NCON), which includes not just common object ancestors but also common object descendants is introduced, which outperforms the state-of-the-art approaches in terms of both effectiveness and efficiency.
Abstract: It is well known that some XML elements correspond to objects (in the sense of object-orientation) and others do not. The question we consider in this paper is what benefits we can derive from paying attention to such object semantics, particularly for the problem of keyword queries. Keyword queries against XML data have been studied extensively in recent years, with several lowest-common-ancestor based schemes proposed for this purpose, including SLCA, MLCA, VLCA, and ELCA. It can be seen that identifying objects can help these techniques return more meaningful answers than just the LCA node (or subtree) by returning objects instead of nodes. It is more interesting to see that object semantics can also be used to benefit the search itself. For this purpose, we introduce a novel Nearest Common Object Node semantics (NCON), which includes not just common object ancestors but also common object descendants. We have developed XRich, a system for our NCON-based approach, and used it in our extensive experimental evaluation. The experimental results show that our proposed approach outperforms the state-of-the-art approaches in terms of both effectiveness and efficiency.
TL;DR: A basic framework for detecting conflicts and overlaps in fuzzy XML documents is developed and an approach for reconciling the fuzzy XML data collected from different data sources is proposed.
TL;DR: A querying framework, called FXPath, based on fuzzy logics is suggested, which proposes the use of fuzzy predicates for the definition of more ``vague'' and softer queries, and introduces a function called ``deep-similar'', which aims at substituting XPath's typical``deep-equal'' function.
Abstract: XML has become a widespread format for data exchange over the
Internet. The current state of the art in querying XML data is
represented by XPath and XQuery, both of which define binary
predicates. In this paper, we advocate that binary selection can at times be
restrictive due to very nature of XML, and to the uses that are
made of it. We therefore suggest a querying framework, called
FXPath, based on fuzzy logics. In particular, we propose the use
of fuzzy predicates for the definition of more ``vague'' and
softer queries. We also introduce a function called
``deep-similar'', which aims at substituting XPath's typical
``deep-equal'' function. Its goal is to provide a degree of
similarity between two XML trees, assessing whether they are
similar both structure-wise and content-wise. In this paper
we present the formal syntax and semantics of Fuzzy XPath,
and discuss implementation issues
TL;DR: XML transformation that focuses on each XML construct transforming to a class diagram is described, and can be used as an alternative solution to show a complete reverse XML schema.
Abstract: XML Reverse Engineering is a research that focuses on getting a conceptual model using an XML schema. In integration issue, previous XML reverse engineering researchers apply the reverse method of XML schema or document in order to generate a class diagram. How-ever, to generate a complete class diagram, XML constructs are not used entirely. Therefore, this paper describes XML transformation that focuses on each XML construct transforming to a class diagram. In order to generate a complete class diagram, formal method is used. There are several steps involved in constructing and transforming each XML into a class diagram. In order to ensure the formalization is complete, the ebXml case study is used and from the result obtained, this method can be used as an alternative solution to show a complete reverse XML schema.
TL;DR: A novel XML labeling scheme is proposed that helps quick determination of structural relationship among XML nodes and supports dynamic updates without relabeling nodes in case of update occurrences.
Abstract: Rapid development of XML technology over the World Wide Web has motivated the need for query optimization especially in a dynamic environment As such, a good XML labeling scheme to ensure fast query processing is crucial Although many labeling schemes were proposed in the past, only few support structural relationship efficiently Therefore, in this paper, we propose a novel XML labeling scheme that helps quick determination of structural relationship among XML nodes and supports dynamic updates without relabeling nodes in case of update occurrences
TL;DR: An intensive experimental evaluation on real-world benchmark XML corpora reveals a higher effectiveness of XML co-clustering in comparison with state-of-the-art approaches to XML clustering, by viewing the task as parametric with respect to the XML features.
Abstract: XML co-clustering is a promising method to overcome the effectiveness of traditional XML clustering approaches, due to the exploitation of the mutual relationships between XML documents and their respective XML features while clustering both simultaneously. To shed light on this so far unexplored research direction, we conduct a systematic study of the effectiveness of XML co-clustering, by viewing the task as parametric with respect to the XML features. Thus, the definition and exploitation of three distinct types of XML features, which are respectively informative of the content, structure and both aspects of the XML documents, allows an in-depth investigation of all three different instances of the XML co-clustering task, i.e., XML co-clustering by content alone, structure alone as well as both structure and content. XML co-clustering relies on a non-negative matrix trifactorization technique, that efficiently processes large-scale input data, which is especially useful with large corpora of text-centric XML documents. The relevance of the structural and content features of the XML documents is assessed through a new weighting scheme. An intensive experimental evaluation on real-world benchmark XML corpora reveals a higher effectiveness of XML co-clustering in comparison with state-of-the-art approaches to XML clustering. Insights are also provided on the effectiveness of XML feature clustering.
TL;DR: JqcML as mentioned in this paper is a Java application programming interface (API) for the qcML data format, which provides the ability to read, write, and work in a uniform manner with qc ML data from different sources, including the XML-based qcml file format and the relational database qcDB.
Abstract: The awareness that systematic quality control is an essential factor to enable the growth of proteomics into a mature analytical discipline has increased over the past few years. To this aim, a controlled vocabulary and document structure have recently been proposed by Walzer et al. to store and disseminate quality-control metrics for mass-spectrometry-based proteomics experiments, called qcML. To facilitate the adoption of this standardized quality control routine, we introduce jqcML, a Java application programming interface (API) for the qcML data format. First, jqcML provides a complete object model to represent qcML data. Second, jqcML provides the ability to read, write, and work in a uniform manner with qcML data from different sources, including the XML-based qcML file format and the relational database qcDB. Interaction with the XML-based file format is obtained through the Java Architecture for XML Binding (JAXB), while generic database functionality is obtained by the Java Persistence API (JPA). jqcML is released as open-source software under the permissive Apache 2.0 license and can be downloaded from https://bitbucket.org/proteinspector/jqcml .
TL;DR: In this article, a system and method for converting an XML-based format document (e.g., DOCX) into a template that can be stored, accessed, and/or populated using web services is described.
Abstract: The system and method are disclosed for converting an XML-based format document (e.g., DOCX) into a template that can be stored, accessed, and/or populated using web services. The XML-based format document can include content control tags that can be converted to XML elements and/or scheme information. Further, a unique ID can be assigned to the XML-based format document and the document can be stored as a template associated with the unique ID. A web service can respond to the document ID, apply the scheme information for the document (validate the data), and populate the control tags using XML elements received from another computer.
TL;DR: A secure and efficient XML labeling scheme called Secure Dewey Coding (SDC) is proposed that prevents information leak and assures minimal memory space and time and the generation time also decreased significantly.
Abstract: XML is the commonly utilized content specification format for data interchange over the Internet. In Publish/Subscribe model, producer is the source for an XML document and disseminates the XML content to the consumer using a mediator called publisher. Producer labels the XML document and defines access control policies for the consumers. Securely labeled XML document are encrypted and sent to the publisher with consumers access details. Encryption is used to provide confidentiality and integrity for XML content dissemination. Consumer queries the publisher for their accessible content. Here, XML label plays a vital role which locates the XML content uniquely. The objective is to design a secure label that has to identify each XML tag uniquely, should not reveal any additional information about the source XML document. Also, XML label size should be optimal with less label generation time. We proposed a secure and efficient XML labeling scheme called Secure Dewey Coding (SDC) that prevents information leak and assures minimal memory space and time. The implementation results of the proposed XML labeling scheme showed that the XML label size has been reduced to a maximum and an average of 68% and 59% respectively and the generation time also decreased significantly.
TL;DR: This work proposes a method to integrate XPath and keyword search so that users can accurately express their search demands and shows that the proposed scheme can process queries over XML streams practically.
Abstract: With the rise of Web search engines, processing keyword search over XML and XML streams has drawn much attention from many researchers Compared to conventional query methods, keyword search has several benefits for its simplicity and its user-friendliness in querying XML databases Therefore, a great deal of effort has been put on this search paradigm by trying to improve the quality of search result of pure keyword search, where only keywords are allowed as a query However, due to the vagueness of keyword search, it is hard to accurately express real search intention with just keyword search We observe that there are many cases where the combination of path-based query and keyword search is a better choice and can deal with such challenge To address this problem, we propose a method to integrate XPath and keyword search so that users can accurately express their search demands The experimental results show that the proposed scheme can process queries over XML streams practically
TL;DR: This work proposes a method to integrate XPath with keyword search so that users can express their search demands in more specific ways.
Abstract: Recently, a great deal of attention has been focusing on processing keyword search over static and XML streams. Keyword search is becoming more popular for its simplicity and its user-friendliness in querying XML databases. However, it is hard to express real search intention with just keyword search. There are many cases where the combination of path-based query and keyword search can deal with such issue. To address this problem, we propose a method to integrate XPath with keyword search so that users can express their search demands in more specific ways.
TL;DR: By converting XPath expressions into custom stacks, this solution is the first to provide support for complex XPath structural constructs, such as parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion.
Abstract: Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-based systems. XML-based pub-sub systems provide users with more flexibility by allowing the formulation of complex queries on the content as well as the structure of the streaming messages. Messages that match a given user query are forwarded to the user. This article examines how to exploit the parallelism found in XPath filtering. Using an incoming XML stream, parsing and matching thousands of user profiles are performed simultaneously by matching engines. We show the benefits and trade-offs of mapping the proposed filtering approach onto FPGAs, processing streams of XML at wire speed, and GPUs, providing the flexibility of software. This is in contrast to conventional approaches bound by the sequential aspect of software computing, associated with a large memory footprint. By converting XPath expressions into custom stacks, our solution is the first to provide support for complex XPath structural constructs, such as parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion. The measured speedups resulting from the GPU and FPGA accelerations versus single-core CPUs are up to 6.6X and 2.5 orders of magnitude, respectively. The FPGA approaches are up to 31X faster than software running on 12 CPU cores.