TL;DR: In this research, a new lightweight application that adopts a novel model mapping approach was developed using Simple API for XML (SAX) parser and proves to perform significantly better than the DOM-based algorithm in terms of mapping and reconstruction time, and memory efficiency.
Abstract: Extensible Markup Language (XML) is the standard medium for data exchange among businesses over the Internet, hence the need for effective management. However, since XML was not designed for storage and retrieval, its management has become an open research area in the database community. Existing mapping techniques for XML-to-relational database adopt either the structural mapping or the model mapping. Though numerous mapping approaches have been developed, mapping and reconstruction time had been problematic, especially when the document size is large and can hardly fit into main memory. In this research, an application codenamed XAPP, a new lightweight application that adopts a novel model mapping approach was developed using Simple API for XML (SAX) parser. XAPP accepts a document with or without Document Type Definition (DTD). It implements two algorithms: one maps XML data to a relational database and improves mapping time, and the other reconstructs an XML document from a relational database to improve reconstruction time and minimise memory usage. The performance of XAPP was analysed and compared with the Document Object Model (DOM) algorithm. XAPP proves to perform significantly better than the DOM-based algorithm in terms of mapping and reconstruction time, and memory efficiency. The correctness of XAPP was also verified.
TL;DR: In this article , a solution that transforms a Microsoft Excel Open XML Spreadsheet (XLSX) file into a shallow-structured XML file used at Typefi, called Content XML (CXML), using XSLT and XProc is presented.
Abstract: This paper presents a solution that transforms a Microsoft Excel Open XML Spreadsheet (XLSX) file into a shallow-structured XML file used at Typefi, called Content XML (CXML), using XSLT and XProc. This solution has three main research areas: The XProc pipeline to read the Excel file content. Transform Excel tables to a CALS table using XSLT functions. Transform Excel charts and embedded images. Significant information is read from the chart.xml and converted to a Scalable Vector Graphics (SVG) file, and then referenced as an image in the output XML. The XLSX file can contain various elements such as tables, charts, and graphics. This solution does not yet use all the features available within the Excel XML but is a work in progress and future improvements will be guided by customer requests.
TL;DR: In this article , a parallel approach on XML parsing based on NEM-XML is presented, and the experimental results show that their parallel XML parsing algorithm improves XML parsing performance significantly and scales well.
Abstract: As the de facto data representation and data exchange standard, XML has become very popular over Internet. How to improve XML parsing performance is the key to promote its further development and application. Parallel computing is a key technology for solving problems with huge computation. This paper presents a parallel approach on XML parsing based on NEM-XML. The experimental results show that our parallel XML parsing algorithm improves XML parsing performance significantly and scales well.
TL;DR: In this paper , an XLink extension "dbxlink" has been proposed, which allows for modeling interlinked XML instances as integrated views where XLinks are resolved in a transparent way.
Abstract: XML (eXtensible Markup Language) is the de-facto standard for exchanging information and for representing data in the World Wide Web. In contrast to the document-centric perspective given by the well-known language HTML which defines the human-readable content and the layout of web pages, XML offers more flexibility and expressiveness.XML documents are not required to be self-contained but may rather have links to other XML resources. For expressing such links between XML documents, the W3C (World Wide Web Consortium) proposed XLink - but mainly for browsing purposes. If the linked documents are considered from the data-centric viewpoint, it shows that XLink does not specify how the referenced instances should be handled. Especially, it is not possible to query along links though the W3C XML Query (XQuery) Requirements explicitly state that this has to be guaranteed.In order to cope with these issues, an XLink extension "dbxlink" has been proposed. It allows for modeling interlinked XML instances as integrated views where XLinks are resolved in a transparent way. In particular, it is possible to query these instances with XPath and XQuery.In this work, the dbxlink model is described and it is investigated how to query distributed XML instances interlinked with a simple kind of XLinks according to this approach. Different strategies are analyzed and emerging problems like the handling of cyclic instances are treated. It is shown how to extend XPath-based query systems in order to be able to handle queries wrt. dbxlink. Furthermore, optimizing techniques like special caching strategies are proposed. The results of these investigations have been used to conduct a proof-of-concept implementation of the dbxlink approach as an extension to the open source XML database system eXist.
TL;DR: In this article , the authors proposed an efficient prefix-based labeling scheme that uses a hexagonal pattern, which avoids the need for node relabeling when XML documents are updated at random locations, avoids duplicated labels by creating a new label for every inserted node, and reduces the size and time costs of the updated labels.
Abstract: To improve XML query processing, it is necessary to label XML documents efficiently for the indexing process because it allows the structural relationships between the XML nodes to be preserved without having to access the original document. However, XML data on the Web is updated as time passes, which means that the dynamic updating of XML data is an issue that may need to be handled by a XML labeling scheme specifically designed for dynamic updates. Previous XML labeling schemes have limitations when updates take place. For example, a lot of node labels need to be relabeled, a lot of duplicate labels occur during this relabeling process, and the size and time costs of the updated labels are high. Therefore, this paper proposes an efficient prefix-based labeling scheme that uses a hexagonal pattern. The proposed labeling scheme has three main advantages: (i) it avoids the need for node relabeling when XML documents are updated at random locations, (ii) it avoids duplicated labels by creating a new label for every inserted node, and (iii) it reduces the size and time costs of the updated labels. The proposed scheme is evaluated against the three most recent prefix-based labeling schemes in terms of the size and time costs of the updated labels. In addition, the ability of the proposed labeling scheme to handle several updates (such as insertions) in XML documents is also evaluated. The evaluations show that the proposed labeling scheme outperforms previously developed prefix-based labeling schemes in terms of both size and time costs, particularly for large-scale XML datasets, resulting in improved query processing performance. Moreover, the proposed scheme efficiently supports frequent updates at arbitrary positions. The paper concludes with several suggestions for further research.
TL;DR: In this article , the authors proposed two methods: vectorization representation of XML documents and further feature extraction, and the experiment results show that the method of all path feature extraction for XML document can represent the main feature of XML document effectively, and is an important work for handling XML documents of power grid efficiently.
Abstract: XML document stores the information of the new power system source load interaction. It has the characteristics of self-description, extensibility, structure, and content, which makes it widely used. Improving the method of extracting elements from XML documents is very helpful in solving the problem of distributed object operation measurement in the power grid. To classify or analyze XML documents better, based on the theoretical analysis of principal component analysis and the study of the text representation model, this paper proposes two methods: vectorization representation of XML documents and further feature extraction. The experiment result shows that the method of all path feature extraction for XML document can represent the main feature of XML document effectively, and is an important work for latter handling XML documents of power grid efficiently
TL;DR: Antora as discussed by the authors is a static site generator for software documentation that runs on Node.js and uses AsciiDoc for its source content, but it does not natively handle complex content generation needs.
Abstract: Static website generation has long been an effective use case for XML and XSLT. Today, static site generators remain popular, but they rarely use XML. Antora is a static site generator for software documentation. It runs on Node.js and uses AsciiDoc for its source content. It has desirable features including git integration, site versioning, and pluggable modern UI bundles. However, Antora doesn't natively handle complex content generation needs. Now, thanks to SaxonJS and Antora's new extension mechanism, we can weave in the power of XML and XSLT. The docca project generates reference documentation for Boost C++ libraries via an Antora extension that invokes SaxonJS, seamlessly integrating auto-generated and manually-authored content into the result. This presentation introduces key project components (Doxygen, Antora, AsciiDoc, and SaxonJS running on Node.js) and includes sample code, a demo, and a brief discussion of other ways XML and SaxonJS might complement AsciiDoc and Antora.
TL;DR: This study focuses on constructing a separate XML document validator and validating XML documents against the defined XSD rules and the critical differences between XSD and DTD.
Abstract: Extensible Markup Language (XML) is a markup language that is developed to organize the structure of information in a text file. The data in XML formatted documents are represented by specifying a number of tags and determining the structural relationship between those tags. It has a simple structure and can be handled by any text editor. Therefore, XML formatted data is being commonly used to transfer and share data between different applications and organizations without having to convert the format of the data (Yang, 2019).
In the XML world, “well-formed” and “valid” are the two most frequently used terms. A well-formed XML document is free from errors that can cause the document to not parse, such as: spelling, punctuation, grammar, and syntax errors. While in addition to having a well-formed markup, a valid XML must conform to a document type definition, this means the document must be semantically correct and matches a described standard of schemas and relationships (Appel, 2020).There are two standards of document type definition that can be used to validate an XML document, one is DTD or Document Type Definition which is used to identify the legal structure and names the legal elements of an XML document (Dykes and Tittel, 2011), and the other is XSD or XML Schema Definition. XSD is a diagrammatic representation that defines the valid structure of an XML document, it enables specifying the building blocks of an XML data set such as elements and attributes and their data types, number of child elements, fixed and default values of the elements and attributes that can appear in the documents (XML Schema Tutorial, 2020). In some applications the process of validating XML documents is combined with parsing the document. However, in some other cases the process of parsing and validating the XML documents need to be separated. This study focuses on constructing a separate XML document validator and validating XML documents against the defined XSD rules. A Java program is used to perform this experiment. Furthermore, the critical differences between XSD and DTD are also mentioned.
TL;DR: In this article , a framework that flattens and converts tree structured data into structured data, while maintaining the information of architecture and the composition of XML format is proposed to gain more information from event logs.
Abstract: Abstract A lot of applications including event logs and web pages uses XML format for utilizing, keeping, transferring and displaying data. Thus, volume of data expressed in XML has increase rapidly. Numerous research has been done to extract and mine information from XML documents. Mining XML documents allows an understanding to the architecture and composition of XML documents. Generally, frequent subtree mining is one of the methods to mine XML documents. Frequent subtree mining searches the relation between data in a tree structured database. Due to the architecture and the composition of XML format, normal data mining and statistical analysis difficult to be performed. This paper suggests a framework that flattens and converts tree structured data into structured data, while maintaining the information of architecture and the composition of XML format. To gain more information from event logs, converting into structured data from semistructured format grants more ability to perform variety data mining techniques and statistical test. Keywords: Flatten Sequential Structure Model, XML Format Event Logs, Data Mining, Statistical Analysis.
TL;DR: In this article , the authors present a formal description of Simple Link and Extended Link semantics, based on a specification as an abstract data type (ADT), and providing Extended Links with a 3rd Party Link semantics.
Abstract: XML (short for eXtensible Markup Language) is a meta-language for the representation of digital data. XML has had an enormous impact on modern computer science and IT industry since its advent in 1997, for several reasons: XML is simple and easily accessible. Using Unicode as encoding, XML can be viewed and authored/edited with common text editors, and due to the context-free and well-formed structure of XML document types, it is easy to provide efficient parsers for processing XML documents. Also, XML"s concept of definable document types enables for a structured representation of almost arbitrary digital data, with the document type modeling the domain of the data, which makes XML a very powerful and flexible standard for data representation, particularly regarding the Web. The XLink standard is an extension to XML for defining references between XML documents, inspired by the hyperlink concept from hypertext. XLink defines two types of links: Simple Links are unidirectional links from one document to another, similar to HTML hyperlinks. Extended Links create graph-based relationships (arcs) between portions of XML (resources) over multiple XML documents. Within the LinXIS project, models and query evaluation for XLink have been investigated: in a logical data model, a Simple Link is given the semantics of an embedded view that "imports" the referenced data from a remote document into the link-defining document. The participating XML data, together with the Simple Links define a virtual instance (a single-document view on the distributed data) according to the logical data model. Extended Links define relations between XML resources, but in contrast to Simple Links, they are not defined inside the participating resources but apart of them. This allows to define a semantics for Extended Links, with an Extended Link defining views that combine and extend the participating resources from a 3rd party perspective, without need for write access to them, and thus extending the Simple Links logical data model. The above described logical data model provides a semantics for the evaluation of XPath queries over distributed XML data: A query may be evaluated not on a (physical) XML document, but on the virtual instance defined by the given Simple and Extended Links. The query evaluation may "follow" along a Simple Link, continuing the evaluation process on the referenced, physically remote data. For Extended Links, queries can be evaluated on the integrated view combining the sources referenced by an Extended Link, based on the 3rd party semantics of the link. A previous PhD thesis, which also emerged from the LinXIS project, introduced the data model for Simple Links and investigated techniques and algorithms for XPath query evaluation on the linked XML data. As part of the work, the data model was implemented on base of the Open Source XML database system eXist, thus creating a Simple-Link-enhanced XML database prototype. The present work extends the focus from Simple to Extended Links: The work includes a formal description of both Simple Link and Extended Link semantics, based on a specification as an abstract data type (ADT), and providing Extended Links with a 3rd Party Link semantics. Also, the basic concepts for query evaluation with respect to 3rd Party Links are investigated. The algorithms as well as the logical data model for 3rd Party Links are implemented by further enhancement of the eXist-based prototype, providing the query evaluation unit with that semantics. The prototype is tested within a case study, evaluating the prototype"s functional behavior and performance. The case study is followed by a discussion of the proposed 3rd Party Link approach, addressing its applicability in terms of its design, performance and its relevance within a rapidly evolving Web infrastructure. The work is completed by a conclusion addressing the previously discussed issues, and giving an overview over related research as well as over perspectives and further work.
TL;DR: In this paper , a fault diagnosis IETM display system based on XML is preliminarily developed by combining the related technologies mentioned above, and experiments show that XML technology can effectively manage the information in fault diagnosis and display it to users in an interactive way.
Abstract: Abstract Fault diagnosis IETM is a digital technical manual which integrates editing, management, display and publication and faces the field of fault diagnosis. In its development process, there are some problems such as complex data management and difficult interaction. The successful application of XML technology becomes the key to solve these problems. In this paper, based on S1000D IETM development standard, the main characteristics of XML and its related technologies are first introduced, and then the application of XML and its related technologies in fault diagnosis IETM is analyzed and studied, including the creation of fault information data module, the display control of fault information and the comprehensive management of data. A conversion method between XML document data and database data is proposed, and the design process of fault diagnosis IETM based on XML is discussed. Finally, a fault diagnosis IETM display system based on XML is preliminarily developed by combining the related technologies mentioned above. Experiments show that XML technology can effectively manage the information in fault diagnosis IETM and display it to users in an interactive way, which has important theoretical value and certain practical significance for developing fault diagnosis IETM and exploring the deeper application of XML in IETM.
TL;DR: In this article , a clustering-based labeling scheme (CLS) was proposed to address some limitations and challenges of indexing XML data, and a set of experiments were identified and designed for the evaluation purpose.
Abstract: XML is a common technique for formatting and exchanging data across the Internet world. Updating and retrieving XML data is an active research area. In addition, indexing XML data is a significant task to improve the efficiency of XML queries. Labelling nodes is the used technique for indexing XML data efficiently. The Clustering-based Labelling Scheme (CLS) was proposed to address some limitations and challenges of indexing XML data. This paper aims to evaluate the performance of the clustering-based labelling scheme. Thus, a set of experiments were identified and designed for the evaluation purpose. Moreover, an appropriate dataset for testing the CLS scheme was identified. Subsequently, the experiments were carried out, and then the results were analysed and assessed to gain the findings of the experiments. Consequently, the results of these experiments suggest that the CLS achieved the target results with few expectations. Therefore, this evaluation showed an improvement in the performance and oefficiency of labelling XML documents.
Abstract: One of the main aims of the so-called Web of Data is to be able to handle heterogeneous resources where data can be expressed in either XML or RDF. The design of programming languages able to handle both XML and RDF data is a key target in this context. In this paper we present a framework called XQOWL that makes possible to handle XML and RDF/OWL data with XQuery. XQOWL can be considered as an extension of the XQuery language that connects XQuery with SPARQL and OWL reasoners. XQOWL embeds SPARQL queries (via Jena SPARQL engine) in XQuery and enables to make calls to OWL reasoners (HermiT, Pellet and FaCT++) from XQuery. It permits to combine queries against XML and RDF/OWL resources as well as to reason with RDF/OWL data. Therefore input data can be either XML or RDF/OWL and output data can be formatted in XML (also using RDF/OWL XML serialization).
Abstract: Query evaluation in an XML database requires reconstructing XML subtrees rooted at nodes found by an XML query. Since XML subtree reconstruction can be expensive, one approach to improve query response time is to use reconstruction views - materialized XML subtrees of an XML document, whose nodes are frequently accessed by XML queries. For this approach to be efficient, the principal requirement is a framework for view selection. In this work, we are the first to formalize and study the problem of XML reconstruction view selection. The input is a tree $T$, in which every node $i$ has a size $c_i$ and profit $p_i$, and the size limitation $C$. The target is to find a subset of subtrees rooted at nodes $i_1,\cdots, i_k$ respectively such that $c_{i_1}+\cdots +c_{i_k}\le C$, and $p_{i_1}+\cdots +p_{i_k}$ is maximal. Furthermore, there is no overlap between any two subtrees selected in the solution. We prove that this problem is NP-hard and present a fully polynomial-time approximation scheme (FPTAS) as a solution.
Abstract: A distributed XML document is an XML document that spans several machines. We assume that a distribution design of the document tree is given, consisting of an XML kernel-document T[f1,...,fn] where some leaves are "docking points" for external resources providing XML subtrees (f1,...,fn, standing, e.g., for Web services or peers at remote locations). The top-down design problem consists in, given a type (a schema document that may vary from a DTD to a tree automaton) for the distributed document, "propagating" locally this type into a collection of types, that we call typing, while preserving desirable properties. We also consider the bottom-up design which consists in, given a type for each external resource, exhibiting a global type that is enforced by the local types, again with natural desirable properties. In the article, we lay out the fundamentals of a theory of distributed XML design, analyze problems concerning typing issues in this setting, and study their complexity.
De Meo, Pasquale, Ferrara, Emilio, Ursino, Domenico
9 Mar 2022
Abstract: Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.
TL;DR: In this paper , the authors propose a pattern that abstracts the similarity metrics that have been applied by the existing XML matching approaches and provides to data/software engineers a skeleton of abstract classes that the engineers should extend for implementing the automated calculation of element similarity.
Abstract: XML is one of the standard ways for representing and exchanging information on the Web. However, XML schemas that represent the same/similar information are usually heterogeneous. To reconcile the heterogeneity of XML schemas, there are many approaches in the literature that match the elements of XML schemas to each other. The calculation step that is in common among all the matching approaches is the similarity calculation among XML elements. We extract a pattern that abstracts the similarity metrics that have been applied by the existing XML matching approaches. The pattern provides to data/software engineers a skeleton of abstract classes that the engineers should extend for implementing the automated calculation of element similarity.
TL;DR: This paper proposes within this paper a new XML diff algorithm called jats‐diff, able to support bijection between higher‐level modifications made by the authors, such as structural changes and restyling, and the changes detected between XML documents.
Abstract: The writing of digital text documents has become a longer process that usually goes through revision rounds. Document comparison is important for the human reader interested in changes made by the authors. These documents contain structural data using text‐centric XML as one of their main storage systems. Current XML diff algorithms are able to represent differences with a limited number of edit operations: insert, delete, move and update. This approach does not fit the scope of digital text document comparison where the human reader needs to understand actual modifications made by the author. With JATS being a text‐centric XML vocabulary, we propose within this paper a new XML diff algorithm called jats‐diff, able to support bijection between higher‐level modifications made by the authors, such as structural changes and restyling, and the changes detected between XML documents. In addition, jats‐diff provides similarity information between different nodes in order to measure the impact of the text changes on the XML tree.
Abstract: Data validation is becoming more and more important with the ever-growing amount of data being consumed and transmitted by systems over the Internet. It is important to ensure that the data being sent is valid as it may contain entry errors, which may be consumed by different systems causing further errors. XML has become the defacto standard for data transfer. The XML Schema Definition language (XSD) was created to help XML structural validation and provide a schema for data type restrictions, however it does not allow for more complex situations. In this article we introduce a way to provide rule based XML validation and correction through the extension and improvement of our SRML metalanguage. We also explore the option of applying it in a database as a trigger for CRUD operations allowing more granular dataset validation on an atomic level allowing for more complex dataset record validation rules.
Abstract: "The ever-increasing adoption of XML has created a need to ensure that XML query languages perform efficiently. Query optimization and transformation for XML query languages, both syntactically and semantically, have received much attention from research communities in recent years. However, due to the fast progress of the application of XML data management solutions, XML-Enabled Database Management Systems still face several challenges. Among these challenges is query processing, especially the processing of XML queries specified with XPath axes and redundancies that may exist in predicates used in XML queries. Semantic query optimization utilizes constraints in XML schemas to directly optimize a given query with a set of optimization rules. Due to the current complexity of the XML data structure which is enabled by rich semantics in XML Schemas, semantic query optimization should be performed in a more systematic manner. For a complete solution, this research proposes a series of semantic transformations to transform given XML queries to semantically equivalent, but more efficient, XML queries for optimization purposes, by using the semantics provided in XML Schemas. The proposed semantic transformations are grouped into three categories: (1) Semantic Path Transformations, (2) Semantic Transformations for XPath Queries Specified with Predicates, and (3) Semantic Transformations for XPath Queries Specified with XPath Axes. After a semantic transformation is applied to an XML query, the equivalent semantic XML query can be processed more efficiently by an XML data management system and returns the same result set. The proposed semantic transformations are then translated into a series of algorithms which are implemented and empirically evaluated for their efficiency and effectiveness. The experimental studies were carried out by using both real data sets (DBLP) and Benchmark data sets (Michigan) to illustrate that the majority of semantic transformations achieved significantly improved performance in XML query processing; this also enabled the research presented here to identify semantic transformations as optimization devices". -- Abstract.