TL;DR: A large-scale analysis of 30 different XML parsers of six different programming languages and an evaluation framework that applies different variants of 17 XML parser attacks to provide a valuable insight into a parser's configuration is conducted.
Abstract: The Extensible Markup Language (XML) has become a widely used data structure for web services, Single-Sign On, and various desktop applications. The core of the entire XML processing is the XML parser. Attacks on XML parsers, such as the Billion Laughs and the XML External Entity (XXE) Attack are known since 2002. Nevertheless even experienced companies such as Google, and Facebook were recently affected by such vulnerabilities.
In this paper we systematically analyze known attacks on XML parsers and deal with challenges and solutions of them. Moreover, as a result of our in-depth analysis we found three novel attacks.
We conducted a large-scale analysis of 30 different XML parsers of six different programming languages. We created an evaluation framework that applies different variants of 17 XML parser attacks and executed a total of 1459 attack vectors to provide a valuable insight into a parser's configuration. We found vulnerabilities in 66 % of the default configuration of all tested parses. In addition, we comprehensively inspected parser features to prevent the attacks, show their unexpected side effects, and propose secure configurations.
TL;DR: A taxonomy of the types of XML injection attacks is discussed, and a constraint solver is used to derive four different ways to mutate XML messages, turning them into attacks (tests) automatically, a result that is much better than what a state-of-the-art tool based on fuzz testing could achieve.
Abstract: XML is extensively used in web services for integration and data exchange. Its popularity and wide adoption make it an attractive target for attackers and a number of XML-based attack types have been reported recently. This raises the need for cost-effective, automated testing of web services to detect XML-related vulnerabilities, which is the focus of this paper. We discuss a taxonomy of the types of XML injection attacks and use it to derive four different ways to mutate XML messages, turning them into attacks (tests) automatically. Further, we consider domain constraints and attack grammars, and use a constraint solver to generate XML messages that are both malicious and valid, thus making it more difficult for any protection mechanism to recognise them. As a result, such messages have a better chance to detect vulnerabilities. Our evaluation on an industrial case study has shown that a large proportion (78.86%) of the attacks generated using our approach could circumvent the first layer of security protection, an XML gateway (firewall), a result that is much better than what a state-of-the-art tool based on fuzz testing could achieve.
TL;DR: A new mapping approach, known as XAncestor, which consists of two algorithms: an XML mapping algorithm (XtoDB) and a query mapping algorithm that translates XPath queries into corresponding SQL queries based on the constructed RDB in order to reduce the query response time.
Abstract: XML has become a common language for data exchange on the Web, so it needs to be managed effectively. There are four central problems in XML data management: capture, storage, retrieval, and exchange. Even though numerous database systems are available, the relational database (RDB) is often used to store and query the content of XML documents. Therefore the processes of mapping from XML to RDB and vice versa occur frequently. Numerous researchers have proposed approaches to map hierarchically structured XML documents into the tabular format of a RDB. However, the previously developed approaches have faced problems in terms of storage and query response time. If the design of a RDB is inefficient, the number of join operations between tables increases when a query is executed, which affects the query response time. To overcome this limitation, this paper proposes a new mapping approach, known as XAncestor, which consists of two algorithms: an XML mapping algorithm (XtoDB) and a query mapping algorithm (XtoSQL). XtoDB maps XML documents to a fixed RDB with less storage space. XtoSQL translates XPath queries into corresponding SQL queries based on the constructed RDB in order to reduce the query response time i.e., the time taken to execute the translated SQL query. XAncestor is then developed as a prototype in order to test its effectiveness. The results of XAncestor are compared with those produced by five similar approaches. The comparison proves that XAncestor performs better than the previously developed approaches in terms of effectiveness and scalability. The correctness of XAncestor is also verified. The paper concludes with some recommendations for further work.
TL;DR: Two algorithms that are based on either traditional inverted lists or newly proposed LLists to improve the overall performance and several algorithms based on hash search to simplify the operation of finding CA nodes from all involved LLists are proposed.
Abstract: Efficiently answering XML keyword queries has attracted much research effort in the last decade. The key factors resulting in the inefficiency of existing methods are the common-ancestor-repetition (CAR) and visiting-useless-nodes (VUN) problems. To address the CAR problem, we propose a generic top-down processing strategy to answer a given keyword query w.r.t. LCA/SLCA/ELCA semantics. By “ top-down ”, we mean that we visit all common ancestor (CA) nodes in a depth-first, left-to-right order; by “ generic ”, we mean that our method is independent of the query semantics. To address the VUN problem, we propose to use child nodes, rather than descendant nodes to test the satisfiability of a node $v$ w.r.t. the given semantics. We propose two algorithms that are based on either traditional inverted lists or our newly proposed LLists to improve the overall performance. We further propose several algorithms that are based on hash search to simplify the operation of finding CA nodes from all involved LLists. The experimental results verify the benefits of our methods according to various evaluation metrics.
TL;DR: The proposed technique proved that using xml entities and XSLT transforms is more efficient in terms of coding effort and deployment complexity when compared to mapping the schema using object oriented scripting language such as C#.
Abstract: This paper proposed xml entities based architectural implementation to improve integration between multiple third party vendor software systems with incompatible xml schema. The xml entity architecture implementation showed that the lines of code change required for mapping the schema between in house software and three other vendor schema, decreased by 5.2%, indicating an improvement in quality. The schema mapping development time decreased by 3.8% and overall release time decreased by 5.3%, indicating an improvement in productivity. The proposed technique proved that using xml entities and XSLT transforms is more efficient in terms of coding effort and deployment complexity when compared to mapping the schema using object oriented scripting language such as C#.
TL;DR: A security specification model (SecFHIR) is proposed to support the development of intuitive policy schemes that are mapping directly to the healthcare environment and efficiently simplify the security administration and achieve fine-grained access control.
Abstract: Patients taking medical treatment in distinct healthcare institutions have their information deeply fragmented between very different locations. All this information --- probably with different formats --- may be used or exchanged to deliver professional healthcare services. As the exchange of information/ interoperability is a key requirement for the success of healthcare process, various predefined e-health standards have been developed. Such standards are designed to facilitate information interoperability in common formats. Fast Healthcare Interoperability Resources (FHIR) is a newly open healthcare data standard that aims to providing electronic healthcare interoperability. FHIR was coined in 2014 to address limitations caused by the ad-hoc implementation and the distributed nature of modern medical care information systems. Patient’s data or resources are structured and standard in FHIR through a highly readable format such as XML or JSON. However, despite the unique features of FHIR, it is not a security protocol, nor does it provide any security-related functionality. In this paper, we propose a security specification model (SecFHIR) to support the development of intuitive policy schemes that are mapping directly to the healthcare environment. The formal semantics for SecFHIR are based on the well-established typing and the independent platform properties of XML. Specifically, patients’ data are modeled in FHIR using XML documents. In our model, we assume that these XML resources are defined by a set of schemes. Since XML Schema is a well-formed XML document, the permission specification can be easily integrated to the schema itself, then the specified permissions are applied to instance objects without any change. In other words, our security model (SecFHIR) defines permissions on XML schemes level, which implicitly specify the permissions on XML resources. Using these schemes, SecFHIR can combine them to support complex constraints over XML resources. This will result in reusable permissions, which efficiently simplify the security administration and achieve fine-grained access control. We also discuss the core elements of the proposed model, as well as the integration with the FHIR framework
TL;DR: A survey on keyword search over XML document mainly focuses on the topics of defining semantics for XML keyword search and the corresponding algorithms to find answers based on these semantics.
Abstract: Since XML has become a standard for information exchange over the Internet, more and more data are represented as XML. XML keyword search has been attracted a lot of interests because it provides a simple and user-friendly interface to query XML documents. This paper provides a survey on keyword search over XML document. We mainly focus on the topics of defining semantics for XML keyword search and the corresponding algorithms to find answers based on these semantics. We classify existing works for XML keyword search into three main types, which are tree-based approaches, graph-based approaches and semantics-based approaches. For each type of approaches, we further classify works into sub-classes and especially we summarize, make comparison and point out the relationships among sub-classes. In addition, for each type of approach, we point out the common problems they suffer
TL;DR: Thank you for reading securing web services with ws security demystifying w's security ws policy saml xml signature and xml encryption, but end up in infectious downloads.
TL;DR: This research investigates compressing XML labels via different prefix-encoding methods in order to reduce the occurrence of any overflow problems and improve query performance.
Abstract: XML is the de-facto standard for data representation and communication over the web, and so there is a lot of interest in querying XML data and most approaches require the data to be labelled to indicate structural relationships between elements. This is simple when the data does not change but complex when it does. In the day-to-day management of XML databases over the web, it is usual that more information is inserted over time than deleted. Frequent insertions can lead to large labels which have a detrimental impact on query performance and can cause overflow problems. Many researchers have shown that prefix encoding usually gives the highest compression ratio in comparison to other encoding schemes. Nonetheless, none of the existing prefix encoding methods has been applied to XML labels. This research investigates compressing XML labels via different prefix-encoding methods in order to reduce the occurrence of any overflow problems and improve query performance. The paper also pre sents a comparison between the performances of several prefix-encodings in terms of encoding/decoding time and compressed code size.
TL;DR: An XML file format for storing data from computations in algebra and geometry is described and a formal specification based on a RELAX-NG schema is presented.
Abstract: We describe an XML file format for storing data from computations in algebra and geometry. We also present a formal specification based on a RELAX-NG schema.
TL;DR: This paper argues that the LCA based techniques still require users to be well versed with the XML schema and also the data to be able to obtain meaningful query results, and presents a novel system, Generic Keyword Search (GKS), which returns ‘meaningful’ information from any XML node, which contains a subset of keywords in the search query Q.
Abstract: XML and JSON have become the default formats to exchange the information for web application or within enterprises. Keyword Search over XML data has been motivated by the need to relieve users from writing difficult XQueries since otherwise users are required to know the complex XML schema. In existing XML keyword search techniques the XML nodes returned for a keyword query are the Lowest Common Ancestor (LCA) nodes for the query keywords. In this paper, we argue that the LCA based techniques still require users to be well versed with the XML schema and also the data to be able to obtain meaningful query results. To address these shortcomings, we present a novel system, Generic Keyword Search (GKS), for a given keyword query Q, instead of identifying (and returning information) only from LCA nodes, GKS returns ‘meaningful’ information from any XML node, which contains a subset of keywords in the search query Q. GKS response includes LCA nodes, if any, that would have been returned by LCA based techniques. GKS is also able to find highly relevant keywords and XML schema elements, deeper analytical insights called DI in the XML data in the context of the user query. DI enables users to navigate the XML data and to refine their queries even if they are not familiar with the data and the schema. Our experiments on real data sets show that GKS is able to return highly relevant responses to keyword queries efficiently.
TL;DR: This paper identifies the problems of existing keyword search methods and points out that the main reason of these problems is due to the unawareness of the Object-Relationship-Attribute (ORA) semantics in XML/RDB, and proposes an ORA-Semantics based keyword search in XML and RDB.
Abstract: Keyword search in XML and relational databases (RDB) has gained popularity as it provides a user-friendly way to explore structured data. Existing works on XML and RDB keyword search only rely on the structures of XML/RDB data and/or schemas, and this causes serious problems of returning incomplete answers, meaningless answers and overwhelming answers. In this paper, we identify the problems of existing keyword search methods and point out that the main reason of these problems is due to the unawareness of the Object-Relationship-Attribute (ORA) semantics in XML/RDB. We exploit the ORA semantics in XML and RDB, and capture these semantics by constructing the Object tree for XML, and the Object-Relationship-Mixed (ORM) data graph for RDB, respectively. Based on the Object tree and the ORM data graph, we propose an ORA-Semantics based keyword search in XML and RDB. Our semantic approach can avoid the problems of existing methods and improves the completeness and correctness of keyword search. In addition, we extend the keyword query language to include keywords that match the metadata, i.e., the names of tags in XML and the names of relations and attributes in RDB. These keywords reduce the ambiguities of queries and enable us to infer user' search intention more precisely. Finally, we incorporate aggregate functions and GROUPBY into keyword queries to retrieve statistical information from XML and RDB.
TL;DR: S2CX is presented, an approach that allows to efficiently evaluate SQL/XML queries on any relational database system, no matter whether it supports SQL/ XML or not, and whose approach to query evaluation scales better, i.e., the larger the dataset, the faster is the approach compared to SQL/xML query evaluation in Oracle 11 g and in DB2.
TL;DR: This paper presents TwigStack-MR, which simultaneously processes several twig pattern queries for a massive volume of XML data based on MapReduce framework, and uses the MapReduced framework, full characteristics of distributed environments, to process twig query efficiently.
Abstract: Twig pattern query is the core operation of XML process, which directly affects the efficiency of XML data query. It is a challenge to manipulate massive XML data, especially on distributed cluster, such as how to effectively ensure the completeness and correctness of the query results, and minimize communication costs between the various machines. In this paper, we present TwigStack-MR, which simultaneously processes several twig pattern queries for a massive volume of XML data based on MapReduce framework. We first split the large scale XML data file into file-splits as input to the distributed storage system. Then we present the distributed twig algorithm, processing different subtrees of the document tree in parallel. Finally we use the MapReduce framework, full characteristics of distributed environments, to process twig query efficiently. The experimental results show that our approach is efficient and scalable on this issue.
TL;DR: Web application, web services and mobile computing are extensively using sensitive data, in some scenario, confidential data will be stored into the XML node, and encryption is the solution to keep confidential data in the XML file.
Abstract: Web application, web services and mobile computing are extensively using sensitive data. In some scenario, confidential data will be stored into the XML node. Encryption is the solution to keep confidential data in the XML file. Encryption and decryption uses same the key in symmetric algorithm. Blowfish is a proven symmetric encryption algorithm. Encrypting entire XML document using Blowfish leads to performance degradation. Encrypting entire document is not necessary. Semi-confidential information can be encrypted by custom encryption method by using key parameter. The parameter indicates the sensitivity of the data.
TL;DR: XML-based publish/subscribe (pub/sub) systems have been receiving a great deal of attention from the academic community and the industry as mentioned in this paper, however, not much research has considered using the system or the communication model in the context of XML publication messages delivery.
TL;DR: This paper proposes an approach for managing changes to XML namespaces defined in XML Schemas, and their effects on XML documents that are valid to these schemas, while keeping track of all XML schema and XML instance versions.
Abstract: In XML databases, several works have dealt with changes of basic components of XML Schemas: element and attribute declarations, simple types, and complex type definitions. However, there is no work that has dealt with changes to advanced concepts of XML Schemas like XML namespaces, local/global qualified/unqualified declarations, and schema definition styles. In this paper, we deal with XML namespace evolution. To the best of our knowledge, we are the first to study such a topic (and in an environment that supports schema versioning). More precisely, we propose an approach for managing changes to XML namespaces defined in XML Schemas, and their effects on XML documents that are valid to these schemas, while keeping track of all XML schema and XML instance versions.
TL;DR: A temporal extension of the W3C XQuery Update Facility (XUF) language, named tauXUF (Temporal XUF), which allows manipulating temporal XML data in tauZSchema, and both the syntax and the semantics of the update expressions of the XUF language are extended to support temporal aspects.
Abstract: Although temporal XML data are being stored and manipulated by several XML-based applications in different domains (e.g., e-commerce, e-health), there is neither a temporal XML update language proposed by researchers nor built-in support provided by existing XML DBMSs and tools, for maintaining such data. Furthermore, in the well known temporal XML framework tauXSchema, there are no features for inserting, deleting or updating temporal XML instances. In this paper, we bridge these gaps by proposing a temporal extension of the W3C XQuery Update Facility (XUF) language, named tauXUF (Temporal XUF), which allows manipulating temporal XML data in tauXSchema. With tauXUF both the syntax and the semantics of the update expressions of the XUF language are extended to support temporal aspects. Examples are also provided to motivate and illustrate our proposal.
TL;DR: This paper implements the method to identify DO and sibling relationship using EDC and SDC labels for various real-time XML documents and results show the identification of DO andibling relationship using S DC labels performs better than EDC labels for processing XML queries.
Abstract: XML emerged as a de-facto standard for data representation and information exchange over the World Wide Web. By utilizing document object model DOM, XML document can be viewed as XML DOM tree. Nodes of an XML tree are labeled to uniquely identify every node by following a labeling scheme. This paper proposes a method to efficiently identify the two structural relationships namely document order DO and sibling relationship that exist between the XML nodes using two secure labeling schemes specifically enhanced Dewey coding EDC and secure Dewey coding SDC. These structural relationships influence the performance of XML queries so they need to be identified in efficient time. This paper implements the method to identify DO and sibling relationship using EDC and SDC labels for various real-time XML documents. Experiment results show the identification of DO and sibling relationship using SDC labels performs better than EDC labels for processing XML queries.
TL;DR: The workflow implemented to convert a dictionary saved as a PDF file into an XML document and posterior importation into a XML aware database, and the process to edit, add and delete new entries is described.
Abstract: In this article we describe the workflow implemented to convert a dictionary saved as a PDF file into an XML document and posterior importation into an XML aware database, and the process to edit, add and delete new entries. The conversion process was challenging given the format of the PDF file, and the fine grained detail of the XML schema that was used. For that, an iterative filtering approach was used. To store the dictionary we decided to use an XML aware database (eXist-DB), that stores each dictionary entry as a separate resource. It can be queried used a web interface developed using XQuery. The lexicographers can edit entries using the oXygen XML editor, reading and storing them directly in the database. In order to guarantee incremental backups, it was defined a mechanism to import the XML database into a GIT repository. Finally, a couple of programs were created in order to prepare regular reports on the dictionary revision process, as well as to backup it in a GIT repository.
TL;DR: This work proposes two new approaches for XML data compression and compares their solutions with three algorithms: WAP Binary Extensible Markup Language (WBXML), Xmill and Efficient XML Interchange (EXI).
Abstract: Integration of information systems is essential to organizations. Therefore, it is necessary to make different technologies interoperate. Extensible Markup Language (XML) is often used for data exchange because it is self-descriptive and platform-independent. However, XML is a verbose language which may bring problems related to the size of documents. This work proposes two new approaches for XML data compression and compares our solutions with three algorithms: WAP Binary Extensible Markup Language (WBXML), Xmill and Efficient XML Interchange (EXI). The comparison is based on compression rate and compression time for files with different sizes.
TL;DR: This paper proposes a cluster-based technique wherein all parent nodes for a node are aggregated to compute its label by two-step MapReduce jobs, and shows the advantages over a single machine-based system.
Abstract: Massive XML (Extensible Markup Language) data are available on the web. XML data labeling schemes have been suggested for structural query processing of massive XML data. Notable schemes include interval- based, prefix-based, and prime number-based labeling schemes. Of these, the prime number labeling scheme has the advantage of query processing by simple arithmetic operations. However, a parallel algorithm for this scheme does not exist. The requirement that all parents' labels have to be multiplied to obtain the label of a node makes it difficult to label XML data in a parallel fashion. To address the issue, in this paper, we propose a cluster-based technique wherein all parent nodes for a node are aggregated to compute its label by two-step MapReduce jobs. Our experiments on real-world XML datasets showed the advantages over a single machine-based system.
TL;DR: A technique was developed during the SIMU project, which allows a lossless transformation between XML and the relatively new specification called CBOR (RFC-7049), so that mobile terminals may be connected to such systems with high performance and low bandwidth usage.
Abstract: Many systems use XML as a standardized information transfer between different components. The possibility to describe the data structure through definition documents enables both sender and receiver to validate information sent and received. However, the XML format requires relatively much bandwidth for a transfer via a network. This feature leads to a very high network load in case of applications with a huge amount of data to be transferred. The developed SIEM-like system from the SIMU project (www.simu-project.de) is based on the IF-MAP protocol and therefore uses the XML-based SOAP to represent and transfer data. To reduce network load, a technique was developed during the SIMU project, which allows a lossless transformation between XML and the relatively new specification called CBOR (RFC-7049). Thus, mobile terminals may be connected to such systems with high performance and low bandwidth usage.
TL;DR: A new technique called BFilter is proposed which performs the XML message filtering and matching operation by leveraging branch points in both the XML publication document and user requests or queries, and has a better performance than the well-known YFilter.
Abstract: XML message filtering and matching are important operations for the application layer XML message multicast. As a publish/subscribe system and a specific case of content-based multicast in the application layer, XML message multicast depends highly on the data filtering and matching processes. As the XML applications emerge, efficient XML message filtering and matching become more desirable. Many XML filtering techniques have been proposed in the literature. Most of those techniques do not address complex queries with predicates, twig patterns or branches; some require post-processing or a special coding scheme, which is either time consuming or becomes difficult for management for dynamic changes of user queries. This paper addresses the existing gap in the literature and proposes a new technique called BFilter which performs the XML message filtering and matching operation by leveraging branch points in both the XML publication document and user requests or queries. BFilter evaluates user queries that use backward matching branch points to delay further matching processes until branch points match in the XML publication document and the user query. Using the backward branch point matching technique, XML message filtering can be performed more efficiently as the probability of mismatching in the matching process is reduced. A number of experiments have been conducted and the results demonstrate that for complex queries, BFilter has a better performance than the well-known YFilter.
TL;DR: This work proposes an ontology-based query refinement model for semi-structured information retrieval that consists in reformulating a query by adding attributes from domain ontologies extracted from XML schemas.
Abstract: The characteristics of XML (eXtensible Markup Language) documents have favored the need to develop specific and flexible querying systems while taking into account the coexistence of both structural and content information. The ultimate goal of these systems is to respond to different user expectations which tend to return appropriate answers to their preferences. However, people have often insufficient knowledge about XML data structure and contents, thus frequently obtaining empty answers or having to reformulate the queries several times. To solve this problem, we propose an ontology-based query refinement model for semi-structured information retrieval. It consists in reformulating a query by adding attributes from domain ontologies extracted from XML schemas.
TL;DR: This work introduces Seshat -- the content-based Domain Specific XML stream processing engine for meeting the needs of different subscribing applications, which provides the full support for Boolean logic operators including negation and also supports supplemental operators, such as substring search.
Abstract: Modern applications often have to process and filter information in XML format for reasons of interoperability. Often times those XML messages arrive from publisher at unpredictable rates and must be processed in near real-time to answer complex filtering queries. Towards this end, we introduce Seshat -- the content-based Domain Specific XML stream processing engine for meeting the needs of different subscribing applications. Seshat provides the full support for Boolean logic operators including negation and also supports supplemental operators, such as substring search. Its simple query framework enables filtering queries with variable substitution predicates. We describe the query processing engine and also the implementation details. Seshat engine can be potentially deployed in publish-subscribe brokers for selective message filtering and replication as well as in subscribing applications that need to process the arriving XML messages independently for the purpose of validation. We provide preliminary performance results of the filtering engine and its simple Domain Specific Language processing queries on several real-world XML datasets.
TL;DR: This paper presents an efficient distributed XPath query processing using MapReduce, which simultaneously processes queries for a massive volume of XML data.
Abstract: The volume of XML data is tremendous in many areas, especially in data logging and scientific areas. XPath query is the core operation of XML process. It is a challenge to query massive XML data stored in a distributed manner. In this paper, we present an efficient distributed XPath query processing using MapReduce, which simultaneously processes queries for a massive volume of XML data. We first use virtual nodes to split the large scale XML data file into filesplits to the distributed storage system. Then we present the distributed XPath query algorithm to compute different fragments of the document tree in parallel using the MapReduce framework. Furthermore, in order to handle the large XML data efficiently, we build the partitional index and use random access mechanism to perform the query. The experimentation shows that our approach is efficient and scalable on this issue.
TL;DR: An optimization model for XML data processing based on a heuristic algorithm to extract data from XPath views is proposed and experimental results reveal the effectiveness of the heuristic method used to solve queries on XML documents.
Abstract: Web services allow middleware access to a relational database and require data representation in XML format. The XML views obtained from relational databases can be accessed by using XPath queries. This article proposes an optimization model for XML data processing based on a heuristic algorithm to extract data from XPath views. To this end, the author uses various XPath query classes temporarily stored in cache, as XPath views. For each view selected from cache, a compensation query can be found and composed with in order to solve an XML data query. Experimental results reveal the effectiveness of the heuristic method used to solve queries on XML documents.
TL;DR: This research work generates equivalent XML schema from the existing data warehouse schema for an organization which does not has the XML platform to manage the web data.
Abstract: Data Warehouse is one of the powerful tools for analytical processing. XML on the other hand is widely used to handle data in web environment. XML to data warehouse integration is a subject of interest for the business organization to use the semi-structured XML for analytical processing. However, in this research work we approached the problem in reverse direction. Here we generate equivalent XML schema from the existing data warehouse schema for an organization which does not has the XML platform to manage the web data. The proposed reverse engineering framework uses one of the existing methodologies of converting the XML schema to data warehouse schema. However, we have applied it in a reverse approach. Moreover we have established a formalism to prove the soundness and correctness of both the conversion mechanisms.