TL;DR: A set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding are presented.
Abstract: This paper argues for, and presents, a modest approach to XML encoding for use by the majority of contemporary linguists who need to engage in corpus construction. While extensive standards for corpus encoding exist - most notably, the Text Encoding Initiative’s Guidelines and the Corpus Encoding Standard based on them - these are rather heavyweight approaches, implicitly intended for major corpus-building projects, which are rather different from the increasingly common efforts in corpus construction undertaken by individual researchers in support of their personal research goals. Therefore, there is a clear benefit to be had from a set of recommendations (not a standard) that outlines general best practices in the use of XML in corpora without going into any of the more technical aspects of XML or the full weight of TEI encoding. This paper presents such a set of suggestions, dubbed Modest XML for Corpora, and posits that such a set of pointers to a limited level of XML knowledge could work as part of the normal, general training of corpus linguists. The Modest XML recommendations cover the following set of things, which, according to the foregoing argument, are sufficient knowledge about XML for most corpus linguists’ day-to-day needs: use of tags; adding attribute value pairs; recommended use of attributes; nesting of tags; encoding of special characters; XML well-formedness; a collection of de facto standard tags and attributes; going beyond the basic de facto standard tags; and text headers.
TL;DR: An extensible programming framework to separate platform-specific optimizations from application codes, and to incrementally improve the performance of an existing application without messing up the code is proposed.
Abstract: This paper proposes an extensible programming framework to separate platform-specific optimizations from application codes. The framework allows programmers to define their own code translation rules for special demands of individual systems, compilers, libraries, and applications. Code translation rules associated with user-defined compiler directives are defined in an external file, and the application code is just annotated by the directives. For code transformations based on the rules, the framework exposes the abstract syntax tree (AST) of an application code as an XML document to expert programmers. Hence, the XML document of an AST can be transformed using any XML-based technologies. Our case studies using real applications demonstrate that the framework is effective to separate platform-specific optimizations from application codes, and to incrementally improve the performance of an existing application without messing up the code.
TL;DR: A bidirectional XML updatelanguage called BIFLUX (BIdirectional FunctionaL Updates for XML), inspired by the FLUX XML update language, is proposed, with a clear and well-behaved biddirectional semantics and a decidable static type system based on regular expression types.
Abstract: Different XML formats are widely used for data exchange and processing, being often necessary to mutually convert between them. Standard XML transformation languages, like XSLT or XQuery, are unsatisfactory for this purpose since they require writing a separate transformation for each direction. Existing bidirectional transformation languages mean to cover this gap, by allowing programmers to write a single program that denotes both transformations. However, they often 1) induce a more cumbersome programming style than their traditionally unidirectional relatives, to establish the link between source and target formats, and 2) offer limited configurability, by making implicit assumptions about how modifications to both formats should be translated that may not be easy to predict.This paper proposes a bidirectional XML update language called BIFLUX (BIdirectional FunctionaL Updates for XML), inspired by the FLUX XML update language. Our language adopts a novel bidirectional programming by update paradigm, where a program succinctly and precisely describes how to update a source document with a target document, in an intuitive way, such that there is a unique "inverse" source query for each update program. BIFLUX extends FLUX with bidirectional actions that describe the connection between source and target formats. We introduce a core BIFLUX language, with a clear and well-behaved bidirectional semantics and a decidable static type system based on regular expression types.
TL;DR: Publisher and editors should now adopt JATS XML for journal publishing because it is an essential format to present readers with a more userfriendly interface.
Abstract: In the era of information technology, scholarly journals cannot escape the rising tide of technological advancement. To be exposed more easily to readers, the web forms of schol arly journals and articles become more important year after year. Furthermore, there is a trend of print journals closing, and a significant emergence of online journals. Journal Ar ticle Tag Suite (JATS) extensible markup language (XML) became an National Infor mation Standards Organization standard language in online journal publishing in 2012. It is an essential format to present readers with a more userfriendly interface. JATS XML was developed by PubMed Central (PMC) XML, which was a deposit form of articles to PMC. Editors and other publishingrelated personnel should be able to understand the concept and production process of XML files. When JATS XML is produced, a variety of web presentation views can be generated, such as PubReader and epub 3.0. Further, JATS XML can be easily converted to digital object identifier CrossRef XML, CrossMark XML, and FundRef XML. Small scholarly society journal editors and publishers can promote the visibility of their journals by depositing JATS XML files to PMC or ScienceCentral. Owing to these benefits of JATS XML, publishers and editors should now adopt JATS XML for journal publishing.
TL;DR: A novel two-tier index structure is proposed to facilitate the access of XML document in an on-demand broadcast system and provides the clients with an overall image of all the XML documents available at the server side and hence enables the clients to locate complete result sets accordingly.
Abstract: XML data broadcast is an efficient way to disseminate semistructured information in wireless mobile environments. In this paper, we propose a novel two-tier index structure to facilitate the access of XML document in an on-demand broadcast system. It provides the clients with an overall image of all the XML documents available at the server side and hence enables the clients to locate complete result sets accordingly. A pruning strategy is developed to cut down the index size and a two-tier structure is proposed to further remove any redundant information. In addition, two index distribution strategies, namely naive distribution and partial distribution, have been designed to interleave the index information with the XML documents in the wireless channels. Theoretical analysis and simulation experiments are also put forward to show the benefits of our indexing methods.
TL;DR: A template is introduced that describes the main components of an XML Matcher, their role and behavior, and illustrates how each of these components has been implemented in some popular XML Matchers.
Abstract: Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.
TL;DR: This paper will present how to shift from XML encryption to JSON encryption, a lightweight data format that is inter-changeable with a programming languages built-in data structures that eliminates translation time and reduces complexity and processing time.
Abstract: JavaScript Object Notation (JSON) is a lightweight data-interchange format. It is easy for humans to read and write. It has a data format that is inter-changeable with a programming languages built-in data structures that eliminates translation time and reduces complexity and processing time. Moreover, JSON has the same strengths of XML. Therefore, it's better to shift form XML security to JSON security. In this paper, we will present how to shift from XML encryption to JSON encryption.
TL;DR: A formal theory of 8 plus 3 structural patterns for XML elements is introduced, and their identifiability in a number of different XML vocabularies is verified, allowing the creation of visualization and content extraction tools that are completely independent of the schema.
Abstract: Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights into the expected characteristics of a markup language, as well as any regularity that may span vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we explore the idea of structural patterns in XML vocabularies, by examining the characteristics of elements as they are used, rather than as they are defined. We introduce from the ground up a formal theory of 8 plus 3 structural patterns for XML elements, and verify their identifiability in a number of different XML vocabularies. The results allowed the creation of visualization and content extraction tools that are completely independent of the schema and without any previous knowledge of the semantics and organization of the XML vocabulary of the documents.
TL;DR: A comparative analysis of the various schemes available to efficiently store and query the temporal and multi-versioned XML documents based on temporal, change management, versioning, and querying support is provided.
Abstract: Extensible Markup Language (XML) documents are associated with time in two ways: (1) XML documents evolve over time and (2) XML documents contain temporal information. The efficient management of the temporal and multi-versioned XML documents requires optimized use of storage and efficient processing of complex historical queries. This paper provides a comparative analysis of the various schemes available to efficiently store and query the temporal and multi-versioned XML documents based on temporal, change management, versioning, and querying support. Firstly, the paper studies the multi-versioning control schemes to detect, manage, and query change in dynamic XML documents. Secondly, it describes the storage structures used to efficiently store and retrieve XML documents. Thirdly, it provides a comparative analysis of the various commercial tools based on change management, versioning, collaborative editing, and validation support. Finally, the paper presents some future research and development directions for the multi-versioned XML documents.
TL;DR: The design and implementation techniques for the EXIP framework for embedded Web development are presented, consisting of a highly efficient EXI processor, a tool for EXI data binding based on templates, and a CoAP/EXI/XHTML Web page engine.
Abstract: Developing and deploying Web applications on networked embedded devices is often seen as a way to reduce the development cost and time to market for new target platforms. However, the size of the messages and the processing requirements of today's Web protocols, such as HTTP and XML, are challenging for the most resource-constrained class of devices that could also benefit from Web connectivity.New Web protocols using binary representations have been proposed for addressing this issue. Constrained Application Protocol (CoAP) reduces the bandwidth and processing requirements compared to HTTP while preserving the core concepts of the Web architecture. Similarly, Efficient XML Interchange (EXI) format has been standardized for reducing the size and processing time for XML structured information. Nevertheless, the adoption of these technologies is lagging behind due to lack of support from Web browsers and current Web development toolkits.Motivated by these problems, this article presents the design and implementation techniques for the EXIP framework for embedded Web development. The framework consists of a highly efficient EXI processor, a tool for EXI data binding based on templates, and a CoAP/EXI/XHTML Web page engine. A prototype implementation of the EXI processor is herein presented and evaluated. It can be applied to Web browsers or thin server platforms using XHTML and Web services for supporting human-machine interactions in the Internet of Things.This article contains four major results: (1) theoretical and practical evaluation of the use of binary protocols for embedded Web programming; (2) a novel method for generation of EXI grammars based on XML Schema definitions; (3) an algorithm for grammar concatenation that produces normalized EXI grammars directly, and hence reduces the number of iterations during grammar generation; (4) an algorithm for efficient representation of possible deviations from the XML schema.
TL;DR: XMLMATE leverages program structure, existing XML schemas, and XML inputs to generate, mutate, recombine, and evolve valid XML inputs, and detected 31 new unique failures in production code.
Abstract: Generating system inputs satisfying complex constraints is still a challenge for modern test generators. We present XMLMATE, a search-based test generator specially aimed at XML-based systems. XMLMATE leverages program structure, existing XML schemas, and XML inputs to generate, mutate, recombine, and evolve valid XML inputs. Over a set of seven XML-based systems, XMLMATE detected 31 new unique failures in production code, all triggered by system inputs and thus true alarms.
TL;DR: An optimization approach that takes into consideration the semantics of the dataset in order to deal with the complexity of multi-disciplinary domains in Big Data, in particular when the data is represented as XML documents is adopted.
TL;DR: This chapter aims to give a reasonably comprehensive definition and motivation for the various aspects of the generic XML language and also to illustrate these aspects with some existing XML dialects or vocabularies.
Abstract: This chapter aims to give a reasonably comprehensive definition and motivation for the various aspects of the generic XML language and also to illustrate these aspects with some existing XML dialects or vocabularies We describe elements, attributes, child elements, and the hierarchical structure of XML We talk about “well-formedness” of an XML document and how to identify errors in a document’s structure We discuss the use of namespaces and end with a brief discussion of validating documents with respect to DTDs and XML Schema Readers already familiar with all aspects of XML can skip this chapter and read about the functions used to work with XML in R, which are the subject of each of Chapters 3, 4, 5, and 6
TL;DR: This paper adopts a novel dynamic encoding scheme which is tailored for both static and dynamic possibilistic XML documents to effectively avoid re-labeling after updates, and proposes an efficient algorithm to handle the problem of dynamic twig queries in Possibilism XML documents.
TL;DR: This paper gives formalized representations of XML data sources, including Document Type Definitions (DTDs), XML Schemas, and XML documents, and proposes formal approaches for transforming the XML data Sources into ontologies, and discusses the correctness of the transformations and provides several transformation examples.
Abstract: The eXtensible Markup Language (XML) has reached a wide acceptance as the relevant standardization for representing and exchanging data on the Web Unfortunately, XML covers the syntactic level but lacks semantics, and thus cannot be directly used for the Semantic Web Currently, finding a way to utilize XML data for the Semantic Web is challenging research As we have known that ontology can formally represent shared domain knowledge and enable semantics interoperability Therefore, in this paper, we investigate how to represent and reason about XML with ontologies Firstly, we give formalized representations of XML data sources, including Document Type Definitions (DTDs), XML Schemas, and XML documents On this basis, we propose formal approaches for transforming the XML data sources into ontologies, and we also discuss the correctness of the transformations and provide several transformation examples Furthermore, following the proposed approaches, we implement a prototype tool that can automatically transform XML into ontologies Finally, we apply the transformed ontologies for reasoning about XML, so that some reasoning problems of XML may be checked by the existing ontology reasoners
TL;DR: This paper compares two of the most popular data formats for web applications, JSON and XML, in terms of features, use, and performance testing results.
Abstract: Choosing an appropriate data format for your data makes a difference in programming and performance speed of your application. In this paper, we'll compare two of the most popular data formats for web applications, JSON and XML, in terms of features, use, and performance testing results.
TL;DR: This paper surveys the existing XML fragmentation approaches in literature, comparing their features and highlighting their drawbacks, and establishes a map of the area to establish a consensus in the database community as to what an XML fragment is.
Abstract: Efficient document processing is a must when large volumes of XML data are involved In such critical scenarios, a well-known solution to this problem is to distribute (map) the data among several processing nodes, and then distribute the processing accordingly, taking advantage of parallelism This is the approach taken by distributed databases and MapReduce environments Fragmentation techniques play an important role in these scenarios They provide a way to "cut" the database into pieces and distribute the pieces over a network This way, queries can also be "cut" into sub-queries that run in parallel, thus achieving better performance when compared to the centralized environment However, there is no consensus in the database community as to what an XML fragment is In fact, several approaches in literature present definitions of XML fragments In addition to query processing, using XML fragmentation techniques may also be helpful when managing XML documents distributed along the web or clouds This paper surveys the existing XML fragmentation approaches in literature, comparing their features and highlighting their drawbacks Our contribution resides in establishing a map of the area
TL;DR: ReLab, a subtree based labeling scheme which generates labels using depth-first traversal is introduced, and it is indicated that ReLab outperformed Dietz and region numbering schemes in terms of time taken to generate labels for each XML nodes.
Abstract: XML has become the de facto standard in the real world application over the WWW. Thus, data or query processing is critical to ensure speed response time to cater user queries. Response time is often influenced by the complexity of labeling scheme which is not only used for unique identification of XML nodes, but for structural relationship purpose as well. The labeling scheme adopted is vital to ensure query processing is done flawlessly and promptly. In this paper, we introduce ReLab, a subtree based labeling scheme which generates labels using depth-first traversal. Our experimental evaluation indicated that ReLab outperformed Dietz and region numbering schemes in terms of time taken to generate labels for each XML nodes.
TL;DR: An XML data cube model which offers the complete views to observe XML data, and a basic algorithm to implement its building process on Hadoop is presented and an optimized algorithm more suitable for this kind of XML data is proposed.
Abstract: XML has become a widely used and well structured data format for digital document handling and message transmission. To find useful knowledge in XML data, data warehouse and OLAP applications aimed at providing supports for decision making should be developed. Apache Hadoop is an open source cloud computing framework that provides a distributed file system for large scale data processing. In this paper, we discuss an XML data cube model which offers us the complete views to observe XML data, and present a basic algorithm to implement its building process on Hadoop. To improve the efficiency, an optimized algorithm more suitable for this kind of XML data is also proposed. The experimental results given in the paper prove the effectiveness of our optimization strategies.
TL;DR: A basic framework for detecting conflicts and overlaps in fuzzy XML documents is developed and an approach for reconciling the fuzzy XML data collected from different data sources is proposed.
TL;DR: A querying framework, called FXPath, based on fuzzy logics is suggested, which proposes the use of fuzzy predicates for the definition of more ``vague'' and softer queries, and introduces a function called ``deep-similar'', which aims at substituting XPath's typical``deep-equal'' function.
Abstract: XML has become a widespread format for data exchange over the
Internet. The current state of the art in querying XML data is
represented by XPath and XQuery, both of which define binary
predicates. In this paper, we advocate that binary selection can at times be
restrictive due to very nature of XML, and to the uses that are
made of it. We therefore suggest a querying framework, called
FXPath, based on fuzzy logics. In particular, we propose the use
of fuzzy predicates for the definition of more ``vague'' and
softer queries. We also introduce a function called
``deep-similar'', which aims at substituting XPath's typical
``deep-equal'' function. Its goal is to provide a degree of
similarity between two XML trees, assessing whether they are
similar both structure-wise and content-wise. In this paper
we present the formal syntax and semantics of Fuzzy XPath,
and discuss implementation issues
TL;DR: XML transformation that focuses on each XML construct transforming to a class diagram is described, and can be used as an alternative solution to show a complete reverse XML schema.
Abstract: XML Reverse Engineering is a research that focuses on getting a conceptual model using an XML schema. In integration issue, previous XML reverse engineering researchers apply the reverse method of XML schema or document in order to generate a class diagram. How-ever, to generate a complete class diagram, XML constructs are not used entirely. Therefore, this paper describes XML transformation that focuses on each XML construct transforming to a class diagram. In order to generate a complete class diagram, formal method is used. There are several steps involved in constructing and transforming each XML into a class diagram. In order to ensure the formalization is complete, the ebXml case study is used and from the result obtained, this method can be used as an alternative solution to show a complete reverse XML schema.
TL;DR: A novel XML labeling scheme is proposed that helps quick determination of structural relationship among XML nodes and supports dynamic updates without relabeling nodes in case of update occurrences.
Abstract: Rapid development of XML technology over the World Wide Web has motivated the need for query optimization especially in a dynamic environment As such, a good XML labeling scheme to ensure fast query processing is crucial Although many labeling schemes were proposed in the past, only few support structural relationship efficiently Therefore, in this paper, we propose a novel XML labeling scheme that helps quick determination of structural relationship among XML nodes and supports dynamic updates without relabeling nodes in case of update occurrences
TL;DR: The objective of this research is to investigate and study the XML indexing techniques in terms of their structures and identify the main limitations of these techniques and any other open issues.
Abstract: The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues.
Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the
size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow.
TL;DR: In this article, a method of encoding an Efficient XML Interchange (EXI) document to represent a JavaScript Object Notation (JSON) document without use of a binary-type JSON representation solution may include fetching a set of tokens associated with the JSON document.
Abstract: A method of encoding an Efficient XML Interchange (EXI) document to represent a JavaScript Object Notation (JSON) document without use of a binary-type JSON representation solution may include fetching a set of tokens associated with the JSON document. The method may also include determining one or more terminal types associated with the set of tokens. The method may also include determining one or more current names and one or more current distances for the set of tokens based in part on the terminal type for the tokens in the set. The method may also include encoding an EXI document representing the JSON document based on the one or more current names and the one or more current distances for the set of tokens associated with the JSON document.
TL;DR: An intensive experimental evaluation on real-world benchmark XML corpora reveals a higher effectiveness of XML co-clustering in comparison with state-of-the-art approaches to XML clustering, by viewing the task as parametric with respect to the XML features.
Abstract: XML co-clustering is a promising method to overcome the effectiveness of traditional XML clustering approaches, due to the exploitation of the mutual relationships between XML documents and their respective XML features while clustering both simultaneously. To shed light on this so far unexplored research direction, we conduct a systematic study of the effectiveness of XML co-clustering, by viewing the task as parametric with respect to the XML features. Thus, the definition and exploitation of three distinct types of XML features, which are respectively informative of the content, structure and both aspects of the XML documents, allows an in-depth investigation of all three different instances of the XML co-clustering task, i.e., XML co-clustering by content alone, structure alone as well as both structure and content. XML co-clustering relies on a non-negative matrix trifactorization technique, that efficiently processes large-scale input data, which is especially useful with large corpora of text-centric XML documents. The relevance of the structural and content features of the XML documents is assessed through a new weighting scheme. An intensive experimental evaluation on real-world benchmark XML corpora reveals a higher effectiveness of XML co-clustering in comparison with state-of-the-art approaches to XML clustering. Insights are also provided on the effectiveness of XML feature clustering.
TL;DR: In this article, a system and method for converting an XML-based format document (e.g., DOCX) into a template that can be stored, accessed, and/or populated using web services is described.
Abstract: The system and method are disclosed for converting an XML-based format document (e.g., DOCX) into a template that can be stored, accessed, and/or populated using web services. The XML-based format document can include content control tags that can be converted to XML elements and/or scheme information. Further, a unique ID can be assigned to the XML-based format document and the document can be stored as a template associated with the unique ID. A web service can respond to the document ID, apply the scheme information for the document (validate the data), and populate the control tags using XML elements received from another computer.
TL;DR: A secure and efficient XML labeling scheme called Secure Dewey Coding (SDC) is proposed that prevents information leak and assures minimal memory space and time and the generation time also decreased significantly.
Abstract: XML is the commonly utilized content specification format for data interchange over the Internet. In Publish/Subscribe model, producer is the source for an XML document and disseminates the XML content to the consumer using a mediator called publisher. Producer labels the XML document and defines access control policies for the consumers. Securely labeled XML document are encrypted and sent to the publisher with consumers access details. Encryption is used to provide confidentiality and integrity for XML content dissemination. Consumer queries the publisher for their accessible content. Here, XML label plays a vital role which locates the XML content uniquely. The objective is to design a secure label that has to identify each XML tag uniquely, should not reveal any additional information about the source XML document. Also, XML label size should be optimal with less label generation time. We proposed a secure and efficient XML labeling scheme called Secure Dewey Coding (SDC) that prevents information leak and assures minimal memory space and time. The implementation results of the proposed XML labeling scheme showed that the XML label size has been reduced to a maximum and an average of 68% and 59% respectively and the generation time also decreased significantly.