TL;DR: In this article , the authors proposed an approach to securely parse XML document and prevent large number of XML and XXE attacks using the incremental genetic algorithm (IGA) and self-evolving feature set.
Abstract: XML document provides a platform independent data representation and transportation facility to enable communication among heterogeneous Web application and Web and cloud computing services. The wide usage of XML document and its parsing makes it prone to cyberattacks. The attacker exploits the hidden vulnerability in the document parsers and injects ever changing malicious payload posing a threat to confidentiality, integrity, and availability of Web resources. In this paper, the authors propose an approach to securely parse XML document and prevent large number of XML and XXE attacks. The detection rules for preventing malicious document are self-evolving and update its feature set using the incremental genetic algorithm. The secure XML parsing pattern will provide a security guideline and supplement the already existing parser by facilitating detection of malicious payloads.
TL;DR: In this article , the authors use ODM-XML as the instrument to represent the SDTM annotated case report form (CRF) as both machine readable and human readable metadata for specification and documentation.
Abstract: Using ODM-XML as the instrument to represent the SDTM annotated case report form (CRF) as both machine readable and human readable metadata for specification and documentation yields vast savings in time, cost, effort, and aggravation over other ways of producing these documents. By creating the human readable versions by simply adding a translating style sheet to the ODM-XML in exactly the same ways as the familiar creation of the human readable Define-xml, those savings are self-evident as the document creation becomes completely automatic. Additionally, the style sheet automates the cumbersome task of creating links between the define-xml document and the SDTM annotated CRF document, provided the SDTM origins in the define-xml are created as named destinations. By creating one document (the PDF portable document format version) from the other (the XML version) via validated code, multiple control and verification activities can be reduced to just a few samples.
TL;DR: Transforming lax XML element trees for simplified processing of HTML tables.
Abstract: When processing HTML tables within XML documents, we generally find the lax element schema for the HTML table specification increases the complexity of the processing system. By first transforming HTML tables to a normalized representation we avoid the need for different logic paths deep inside our code. Ensuring there is a single normalized element hierarchy for any given table also simplifies the task of comparing different versions of each table. This paper describes the design and implementation of XSLT that transforms lax HTML table element structures within an XML document to conform to a strict content model.
Reza Samavi, Mariano P. Consens, Shahan Khatchadourian, Thodoros Topaloglou
13 Sep 2023
TL;DR: DescribeX is a novel visualization technique for analyzing PSI-MI XML collections at the instance level, providing insights into schema usage, common patterns, and evolution.
Abstract: <p>PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a challenging task that is unsupported by existing tools. In this study we use DescribeX, a novel visualization technique of (semi-)structured XML formats, to quantitatively and qualitatively analyze PSI-MI XML collections at the instance level with the goal of gaining insights about schema usage and to study specific questions such as: adequacy of controlled vocabularies, detection of common instance patterns, and evolution of different data collections. Our analysis shows DescribeX enhances understanding the instance-level structure of PSI-MI data sources and is a useful tool for standards designers, software developers, and PSI-MI data providers. </p>
TL;DR: The XML binding technology covered in Chapter 22 introduced a valuable data exchange format, simplifying and standardizing the communication in heterogeneous and wide-spread computer networks as mentioned in this paper . But with all those additions, XML documents became more and more complex, and in order to send a few bytes formatted as XML, a lot of boilerplate data had to be transmitted.
Abstract: The XML binding technology covered in Chapter 22 introduced a valuable data exchange format, simplifying and standardizing the communication in heterogeneous and wide-spread computer networks. By using schemas and namespaces for elements and attributes, this enabled semantics and version consistency in XML, thus allowing for XML document validity checks. Unfortunately, with all those additions, XML documents became more and more complex, and in order to send a few bytes formatted as XML, a lot of boilerplate data had to be transmitted, too. Especially with browser-to-server communication, XML became overkill. For this reason, web application developers switched to the leaner JSON (JavaScript Object Notation) data protocol, making it necessary for the Java world to develop tools and libraries that could handle JSON.
TL;DR: In this article , a new XML stream structure for broadcasting XML data by compressing and summarizing the information about how XML nodes are put together is proposed, which can be used to answer a wide range of XML questions.
Abstract: XML data warehouses give decision-support systems a chance to use complicated data by giving them a place to start. But native-XML database management systems are slow right now, so it is important to look into ways to make them faster. This study presents two strategies to think about. To start, we suggest using a link index that was made with the fact that XML warehouses have a lot of dimensions in mind. The join procedures are taken out, but the data from the first warehouse stays the same. Second, we show how to choose XML materialized views by grouping the query load to show how this can be done. To prove that these ideas work, we built a set of decision support XQuery tools and ran them against an XML data warehouse. We compared the results obtained with and without our optimization methods. Our tests show that it works, even though the queries themselves are a little hard to understand and the datasets themselves are very big. XML has emerged as the widely accepted standard for transmitting data over mobile wireless networks. In these kinds of networks, mobile clients can use a wireless broadcast channel to send queries to get the XML data they need. Because mobile devices are so small and have so little storage space and a short battery life, it may be hard for customers to download the whole XML data set on one of these devices. To solve this problem, you need to index XML data so that mobile clients only have to download the parts of the file they need. Users who want to access only certain parts of the XML content in an XML stream could use one of several indexing methods. Still, the indexing methods that are used now add more data to an XML stream that is already very large. This research comes up with a new XML stream structure for broadcasting XML data by compressing and summarizing the information about how XML nodes are put together. This study was conducted in the United Kingdom. When data is summed up before being sent, the time it takes to get it in XML format over a wireless broadcast channel can be cut down. The recommended XML stream structure also has indexes that will help you skip over any data that is not important. So, it could make it less likely that XML query results will drain the batteries of mobile devices when they are being processed. We also found that our suggested XML stream design was better than its predecessors in terms of access and tuning times for processing XML queries over the XML data stream. So, our architecture can be used to answer a wide range of XML questions.
TL;DR: In this paper , the authors present a case study on examining the curricula of Hungarian and foreign universities and educational orgaizations and how they are teaching XML and related technologies, and summarize their efforts in teaching XML, using XML in education and other methodological innovations.
Abstract: In my dissertation, I summarized my efforts in teaching XML, using XML in education and other methodological innovations. The paper starts with a literature review and study of information technology as a profession. This is followed by the section about XML education, in which at first I wrote a case study on examining the curricula of Hungarian and foreign universities and educational orgaizations and how they are teaching XML and related technologies. Next I presented my subject „Adatkezelés – XML” („Data Management – XML”) which is taught in ELTE and I formulated my first thesis: Basics of XML technologies could be presented at the early stages of the current universtiy information technology training with my curriculum on XML and related technologies, so the students could build to this during their later studies. I have written about the usage of XML in education in next part of my dissertation. I have introduced the curriculum I have developed and tested in which I used XML to teach text processing. My second thesis is about this: XML provides a good example to training text processing and project work. I have studied the creation of a new markup language by extended XML to not only use XML in some tasks, but also to apply it. With the presented algorithm markup language, the syntax of coding can be taught where the possible native language keywords and their related documentation can help to understand the algorithmic thinking and the structure of source codes. My third thesis was drawn up from this: Education of beginner programmers can be helped by Algorithm Markup Language (AML), which is an extension of XML with defining its own types. I have implemented and tested with my students a web application for visualizing and editing algorithms defined in this new markup language. My fourth thesis is: The effectiveness of learning can be enhanced by my XML-based application for visualising and editing algorithms. Not only to manage algorithms but specification – so data, types, pre- and postconditions –, testing and documentation, I continued the development. At first I elaborated an Excel-based toolkit to lead my students through the most coordinated steps of coding process. Then I build together this and the previous application to a new web-application. I published this application which helps students’ work from the design to the documentation through the coding. So my fifth thesis: My XML-based application built to the analogy of the tools used by the methodologies and spreaded in industry supports teaching of systematic programming. In the paper I wrote about the application methodology of the presented softwares. based on experiences of lessons and feedbacks, my theses have been verified and my students have succesfully learned the required knowledge.
TL;DR: In this paper , a method to convert XML format document into a structured format without ignoring the structural information is proposed, which can allow more data mining techniques and statistical test be conducted and extract information from the business process log data.
Abstract: The volume of extensible markup language (XML) format documents is increasing every day due to the development of internet and the use of XML format in business process log file. Storing business process log data in XML format is preferable due to the ability of extensible and storing data irrespective of how it will be represented. However, mining XML format data poses challenges due to its complex data structure and dimensions. This paper proposes a method to convert XML format document into a structured format without ignoring the structural information. Converting semi-structured business process log data into structured format will allow more data mining techniques and statistical test be conducted and extract information from the business process log data. The experiment in this study performs t-test on a set of synthetic data and a set of real-world data to prove that information in business process log can be extracted through normal statistical test. Empirical results show that statistical analysis can be conducted on business process log data especially in XML format after flatten sequential structure model (FSSM) is used.
TL;DR: In this paper , a web-based server application is developed as an XML records editor that provides display forms for the creation and editing of XML documents and is able to adapt to the internal resources of the system used.
Abstract: The ability of the end user to work with a large amount of data from a large number of heterogeneous sources and at the same time get an effective result from the work is carried out through the use of graphical web interfaces built on the basis of XML technologies that allow displaying any structure of a file presented in XML format. As a data exchange method between applications on the Web, XML still lacks capabilities for identification of web resources and a system that uses them, and capabilities to express the knowledge provided by XLM documents. In this study, a web interface has been developed (a web-based server application), as an XML records editor that provides display forms for the creation and editing of XML documents and is able to adapt to the internal resources of the system used. The technology is based on the XSD data set schema transformation by the way of XSLT transformations. Screen forms are generated on the server side and are provided to the user with all the necessary tools for correct input and/or editing of heterogeneous data. A distinguishing characteristic of this technology is the ability to display both properly and improperly formed XML data. The developed graphical interface allows any application to automatically exchange and read information from other applications without human intervention, which significantly improves performance and ease of use. This software solution could be used both as an independent data building and editing module presented in the XML format, and as a built-in module plugged into various server software for heterogeneous information management systems
Abstract: XML is nowadays considered the standard meta-language for document markup and data representation. XML is widely employed in Web-related applications as well as in database applications, and there is also a growing interest for it by the literary community to develop tools for supporting document-oriented retrieval operations. The purpose of this article is to show the basic new requirements of this kind of applications and to present the main features of a typed query language, called Tequyla-TX, designed to support them.
Abstract: Submission note: A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy to the School of Engineering and Mathematical Sciences, Faculty of Science, Technology and Engineering, La Trobe University, Bundoora.
Keyword searches in XML data have attracted much attention because it can liberate users from the steep learning curve of query languages and the schemas of XML data. However, due to the inherent ambiguity of keyword queries, using keyword searches to query XML data poses several challenges. This thesis proposes approaches to deal with the three main problems in XML keyword searches: identifying relevant results of a query; result ranking for top-k query; and keyword searching over multiple XML databases. First, this thesis introduces a novel semantics called dominant lowest common ancestor (DLCA) to define relevant results of XML keyword searches. Ranking criteria are proposed, and algorithms are introduced to rapidly identify the dominant results. The experiments have been conducted to demonstrate the superiority of our work in comparison with some state-of-the-art approaches in the literature. Second, this thesis also proposes an approach to tackle the many-result problem which is caused by short and ambiguous keyword queries. It proposes a ranking function by applying principles of probabilistic models with a consideration of data dependencies in XML data. An algorithm is developed to efficiently retrieve top-k results based on the well-known threshold algorithm. The empirical results show that the proposed approach outperforms existing approaches in a variety of situations. Finally, this thesis proposes an approach to resolve the problem of searching over multiple XML data sources to avoid the high cost of searching in numerous, potentially irrelevant data sources. The approach summarizes the data sources as succinct synopses for the rapid filtering of non-promising sources. A ranking function is introduced to effectively rank the relevance of the data source to the given query. Experiments are conducted to confirm the superiority of the proposed approach.
Reza Samavi, Mariano P. Consens, Shahan Khatchadourian, Thodoros Topaloglou
13 Sep 2023
TL;DR: DescribeX is a novel visualization technique for analyzing PSI-MI XML collections at the instance level, providing insights into schema usage, common patterns, and evolution.
Abstract: <p>PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a challenging task that is unsupported by existing tools. In this study we use DescribeX, a novel visualization technique of (semi-)structured XML formats, to quantitatively and qualitatively analyze PSI-MI XML collections at the instance level with the goal of gaining insights about schema usage and to study specific questions such as: adequacy of controlled vocabularies, detection of common instance patterns, and evolution of different data collections. Our analysis shows DescribeX enhances understanding the instance-level structure of PSI-MI data sources and is a useful tool for standards designers, software developers, and PSI-MI data providers. </p>