TL;DR: Two databases approaches namely Extensible Markup Language (XML) and Java Object Notation (JSON) were investigated to evaluate their suitability for handling thousands records of publication data and showed JSON is the best choice for query retrieving speed and CPU usage.
Abstract: Big data is the latest industry buzzword to describe large volume of structured and unstructured data that can be difficult to process and analyze. Most of organization looking for the best approach to manage and analyze the large volume of data especially in making a decision. XML is chosen by many organization because of powerful approach during retrieval and storage processes. However, XML approach, the execution time for retrieving large volume of data are still considerably inefficient due to several factors. In this contribution, two databases approaches namely Extensible Markup Language (XML) and Java Object Notation (JSON) were investigated to evaluate their suitability for handling thousands records of publication data. The results showed JSON is the best choice for query retrieving speed and CPU usage. These are essential to cope with the characteristics of publication’s data. Whilst, XML and JSON technologies are relatively new to date in comparison to the relational database. Indeed, JSON technology demonstrates greater potential to become a key database technology for handling huge data due to increase of data annually.
TL;DR: This contribution looks at and evaluates new ways to map XML documents to relational database systems preserving their hierarchical structure using the advanced features available following the SQL:2003 standard, which defines complex structure and collection types.
TL;DR: The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types.
Abstract: While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.
TL;DR: This paper proposes an efficient mapping approach, the mini-XML, to mapping XML into the relational database, and path technique and position information are used to indicate the complex node relationship.
Abstract: In recent years, XML technology has won wide attention from both industry and academic. It can be used to mark data, define the data type and their own markup language. It is a cross-platform, context-dependent technology in the Internet environment and an effective tool for todays distributed structure information. The S-XML is a new approach for storing semi-structured data, and it supports query of the node in XML with SQL statements, which has shown impressive performance on many classic data sets. However, it is difficult to store XML data into a relational database, and the S-XML spends much more time and space to store the data. In this paper, we propose an efficient mapping approach, the mini-XML, to mapping XML into the relational database. In addition, path technique and position information are used to indicate the complex node relationship. Finally, two experiments are conducted to prove that the proposed method can achieve better performance in the decreasing of the storage time and storage space, especially dealing with the large amount of data.
TL;DR: XML Data provides a new approach to data management, which positively impact many organization manage and exchange and shows tremendous performance improvement in handling large XML documents.
Abstract: Now-a-days, there is a authentic need for a database system for storing, retrieving and manipulate XML based data to the purpose of exchange data over the web with an efficient manner. XML provides a noteworthy boost to web-based and business-to-business (B2B) applications. Data were normally stored in relational databases and XML was used as a medium to transport data between web-based and business-to-business (B2B) applications. XML quickly became the de facto standard for deploying applications that managed large volumes of data and either wanted to be able to communicate with other businesses or to expose their data on the web. The paper helps to explore and compare between the performance of XML-based database and native XML database. This model was supported by XML-enabled databases (XEDs), which adopted a simple strategy to store XML data: each XML document is decomposed and its data are stored within tables. The fundamental difference between XEDs and NXDs is that the latter adopt the XML data model for storing XML data. Much like hierarchical and object databases, they are able to preserve the hierarchy and ordering of nodes of XML documents in a much more efficient manner than XMLenabled databases, hence the tremendous performance improvement in handling large XML documents. XML Data provides a new approach to data management, which positively impact many organization manage and exchange
TL;DR: The major objective of this paper is to explore and compare between the two approaches and reach to some criteria to have a suitable guideline to select the best approach in each circumstance.
Abstract: With the increasing popularity of XML data and a great need for a database management system able to store, retrieve and manipulate XML-based data in an efficient manner, database research communities and software industries have tried to respond to this requirement. XML-enabled database and native XML database are two approaches that have been proposed to address this challenge. These two approaches are a legacy database systems which are extended to store, retrieve and manipulate XML-based data. The major objective of this paper is to explore and compare between the two approaches and reach to some criteria to have a suitable guideline to select the best approach in each circumstance. In general, native XML database systems have more ability in comparison with XML-enabled database system for managing XML-based data
TL;DR: This paper devise a holistic algorithm for matching tree-patterns over heterogeneous fuzzy XML data and generates the matches by one scan on the relevant data associated with the tree- pattern, which eliminates re-scanning unnecessary portions of XML documents and redundant intermediate results.
Abstract: Dealing with heterogeneous data underlying fuzzy XML databases is challenging for any task of document management and knowledge discovery, since the structural heterogeneity and uncertainty of the large number of XML data sources make it difficult to effectively answer the structured query, especially the tree-pattern query. To address this issue, we propose a novel framework for managing fuzzy XML queries in a heterogeneous environment in this paper. In particular, we devise a holistic algorithm for matching tree-patterns over heterogeneous fuzzy XML data. Our approach adopts a compact stack technique and generates the matches by one scan on the relevant data associated with the tree-pattern, which eliminates re-scanning unnecessary portions of XML documents and redundant intermediate results. Finally, a comprehensive experimental evaluation conducted on real and synthetic data sets is carried out to show the significance of our approach as a solution for querying heterogeneous data in fuzzy XML documents.
TL;DR: Two new algorithms and the associated indexing structures are developed and shown to perform correctly in processing both independent and/or inter-linked XML documents.
TL;DR: A new structure for streaming the XML data is proposed which guarantees confidentiality of thexml data over the wireless stream and an access mechanism is proposed to efficiently process XML queries over the encrypted XML stream.
TL;DR: In this study, three categories of XML node labelling will be analysed to address the open problem of each category and performance of time execution and storage space required for labelling XML tree is compared.
Abstract: The flexibility nature of XML documents has motivated researchers to use it for data transmission and storage in different domains. The hierarchical structure of XML documents is an attractive point to be researched for processing a user query based on labelling where each label describes the node structure in the tree. In this study, three categories of XML node labelling will be analysed to address the open problem of each category. A number of experiments are executed to compare performance of time execution and storage space required for labelling XML tree.
TL;DR: This paper proposes a novel framework called XClusterMaint which serves for both clustering and maintenance of the XML documents and proposes an improved approach which uses a lazy maintenance scheme to improve the performance of the clusters maintenance.
Abstract: Web data clustering has been widely studied in the data mining communities. However, dynamic maintenance of the web data clusters is still a challenging task. In this paper, we propose a novel framework called XClusterMaint which serves for both clustering and maintenance of the XML documents. For clustering, we take both structure and content into account and propose an efficient solution for grouping the documents based on the combination of structure and content similarity. For maintenance, we propose an incremental approach for maintaining the existing clusters dynamically when we receive new incoming XML documents. Since the dynamic maintenance of the clusters is computationally expensive, we also propose an improved approach which uses a lazy maintenance scheme to improve the performance of the clusters maintenance. The experimental results on real datasets verify the efficiency of the proposed clustering and maintenance model.
TL;DR: This work aims to define a system to extract data regardless of the nature of their model and make one query enough to retrieve data from different models, which are XML and relational in this case.
TL;DR: This paper tries to make an attempt to review various XML keyword query processing techniques and highlight some of the important issues associated with respective techniques and improvements done in order to address the issues and thereby improving overall efficiency of the XML keyword search query processing.
Abstract: Keyword search is gaining popularity for querying XML data now days as it relieves user from understanding the complex schemas of XML document and query languages such as XQuery and XPath. Various query processing techniques and efficient algorithms have been proposed in recent days to address the keyword search over XML data. The most popular techniques for XML keyword search today use query semantics ELCA (Exclusive LCA) and SLCA (Smallest LCA), both based on LCA (Lowest Common Ancestor). Among these ELCA captures more meaningful results compared with LCA and ELCA. However these techniques can result in redundant computation due to problems like common-ancestor-repetition (CAR) and visiting-useless-node (VUN). Irregular schemas of given XML document and missing elements in it are also problems of consideration in keyword query processing over XML data. In this paper we try to make an attempt to review various XML keyword query processing techniques. We also highlight some of the important issues associated with respective techniques and improvements done in order to address the issues and thereby improving overall efficiency of the XML keyword search query processing.
TL;DR: The graph modleing, storage and processing possibilities of XML data are analysed and it is shown that modeling XML data as a graph and processing it with graph processors are benaficial in many contests.
Abstract: XML is a standard format for data exchange overinternet. Also huge amount of information is tagged and storedin XML format. Processing XML data has its difficulties due tothe schema centric and semi-structured nature of the majorportion of existing XML data. The data embeded tree stucturemakes it more complicated to process. XML processing usingRDBMS systems and Native XML databases like BaseX, eXist-DBhas its own limitations. Native XML databases are not suitablefor distributed processing. So they just have to bound withsingle systems resources, which are not enough for big dataprocessing. Graph databases and Graph database technologies areemmerging in the recent past. They are also suitable to process bigdata due to the extension of parallel processing features in graphdata processors. Modeling XML data as a graph and processingit with graph processors are benaficial in many contests. In thispaper the graph modleing, storage and processing possibilities ofXML data are analysed. The major graph database Neo4j andthe GraphX graph processor extension embeded with ApacheSpark distributed in-memory processing system are utilized forquerying XML data.
TL;DR: This paper surveys state-of-the-art XML indices and discusses the main issues, tradeoffs and future trends in XML indexing, and presents an in-dex that is specifically designed for the particular architecture of XML data warehouses.
Abstract: With XML becoming a standard for business information representation and exchange, stor-ing, indexing, and querying XML documents have rapidly become major issues in database research. In this context, query processing and optimization are primordial, native-XML data-bases not being mature yet. Data structures such as indices, which help enhance performances substantially, are extensively researched, especially since XML data bear numerous specifici-ties with respect to relational data. In this paper, we survey state-of-the-art XML indices and discuss the main issues, tradeoffs and future trends in XML indexing. We also present an in-dex that we specifically designed for the particular architecture of XML data warehouses.
TL;DR: This investigation has resulted in the proposal of a novel prefix-encoding method named “Elias-Fibonacci of order 3”, which has achieved the fastest encoding time of all prefix- Encoding methods studied in this thesis, whereas Fibonacci encoding was found to require the minimum storage.
Abstract: The flexibility and self-describing nature of XML has made it the most common mark-up language used for data representation over the Web. XML data is naturally modelled as a tree, where the structural tree information can be encoded into labels via XML labelling scheme in order to permit answers to queries without the need to access original XML files. As the transmission of XML data over the Internet has become vibrant, it has also become necessary to have an XML labelling scheme that supports dynamic XML data. For a large-scale and frequently updated XML document, existing dynamic XML labelling schemes still suffer from high growth rates in terms of their label size, which can result in overflow problems and/or ambiguous data/query retrievals.
This thesis considers the compression of XML labels. A novel XML labelling scheme, named “Base-9”, has been developed to generate labels that are as compact as possible and yet provide efficient support for queries to both static and dynamic XML data. A Fibonacci prefix-encoding method has been used for the first time to store Base-9’s XML labels in a compressed format, with the intention of minimising the storage space without degrading XML querying performance. The thesis also investigates the compression of XML labels using various existing prefix-encoding methods. This investigation has resulted in the proposal of a novel prefix-encoding method named “Elias-Fibonacci of order 3”, which has achieved the fastest encoding time of all prefix-encoding methods studied in this thesis, whereas Fibonacci encoding was found to require the minimum storage.
Unlike current XML labelling schemes, the new Base-9 labelling scheme ensures the generation of short labels even after large, frequent, skewed insertions. The advantages of such short labels as those generated by the combination of applying the Base-9 scheme and the use of Fibonacci encoding in terms of storing, updating, retrieving and querying XML data are supported by the experimental results reported herein.
TL;DR: A new index structure which combines siblings of the terminal nodes as one path which efficiently processes twig queries with less number of lookups and joins is proposed.
Abstract: Querying nested data has become one of the most challenging issues for retrieving desired information from the Web. Today diverse applications generate a tremendous amount of data in different formats. These data and information exchanged on the Web are commonly expressed as nested representation such as XML, JSON, etc. Unlike the traditional database system, they don't have a rigid schema. In general, the nested data is managed by storing data and its structures separately which significantly reduces the performance of data retrieving. Ensuring efficiency of processing queries which locates the exact positions of the elements has become a big challenging issue. There are different indexing structures which have been proposed in the literature to improve the performance of the query processing on the nested structure. Most of the past researches on nested structure concentrate on the structure alone. This paper proposes new index structure which combines siblings of the terminal nodes as one path which efficiently processes twig queries with less number of lookups and joins. The proposed approach is compared with some of the existing approaches. The results also show that they are processed with better performance compared to the existing ones.
TL;DR: This paper presents an approach for design and development of the custom notation for existing XML-based language together with a translator between the new notation and XML that supports iterative design of the language concrete syntax, allowing its modification based on users feedback.
Abstract: In spite of its popularity, XML provides poor user experience and a lot of domain-specific languages can be improved by introducing custom, more humanfriendly notation. This paper presents an approach for design and development of the custom notation for existing XML-based language together with a translator between the new notation and XML. The approach supports iterative design of the language concrete syntax, allowing its modification based on users feedback. The translator is developed using a model-driven approach. It is based on explicit representation of language abstract syntax (metamodel) that can be augmented with mappings to both XML and the custom notation. We provide recommendations for application of the approach and demonstrate them on a case study of a language for definition of graphs.
TL;DR: This article transforms a large collection of XPath expressions into multiple FSA-based query indexes and then process XML streams in parallel by virtue of the index-level parallelism, and presents an in-memory MapReduce model that enables to process a largeCollection of twig pattern joins over XML streams simultaneously.
Abstract: The multicore architecture has been the norm for all computing systems in recent years as it provides the CPU-level support of parallelism. However, existing algorithms for processing XML streams do not fully take advantage of the facility since they have not been devised to run in parallel. In this article, we propose several methods to parallelize the finite state automata (FSA)-based XML stream processing technique efficiently. We transform a large collection of XPath expressions into multiple FSA-based query indexes and then process XML streams in parallel by virtue of the index-level parallelism. Each core works only with its own query index so that no synchronization issue occurs while filtering XML streams with multiple path patterns given by users. We also present an in-memory MapReduce model that enables to process a large collection of twig pattern joins over XML streams simultaneously. Twig pattern joins in our approach are performed by multiple H/W threads in a shared and balanced way. Extensive experiments show that our algorithm outperforms conventional algorithms with an 8-core CPU by up to ten times for processing 10 million XPath expressions over XML streams.
TL;DR: This chapter will clearly show the need for better mapping techniques for Relational Database (RDB) all the way to Resource Description Framework (RDF), including coverage of each data model limitations and benefits for getting better results.
Abstract: This chapter will clearly show the need for better mapping techniques for Relational Database (RDB) all the way to Resource Description Framework (RDF). This includes coverage of each data model limitations and benefits for getting better results. Here, each form of data being transform has its own importance in the field of data science. As RDB is well known back end storage for information used to many kinds of applications; especially the web, desktop, remote, embedded, and network-based applications. Whereas, EXtensible Markup Language (XML) in the well-known standard for data for transferring among all computer related resources regardless of their type, shape, place, capability and capacity due to its form is in application understandable form. Finally, semantically enriched and simple of available in Semantic Web is RDF. This comes handy when with the use of linked data to get intelligent inference better and efficient. Multiple Algorithms are built to support this system experiments and proving its true nature of the study.
TL;DR: A semi-automatic solution that is applied to introduce the semantics in the XML database and to enrich the answers to the queries, and which gave very encouraging preliminary results.
Abstract: The introduction of data semantics in various fields of science by referring to the ontological database is becoming more and more necessary. With the proliferation of domain ontologies and the large volume of data to be processed, it has become necessary to have data management systems based on ontological systems. Such a system can be exploited via the web as is the case with XML databases, which will allow us to: - use semantic databases via the Internet. - To enrich responses to XML queries by using domain terminology ontology. Also, XML present a flexible hierarchical model suitable to represent huge amounts of data with no absolute and fixed schema, In order to highlight the usefulness of introducing the semantics in the XML database and to enrich the answers to the queries, we have proposed a semi-automatic solution that we applied it for pharmaceutical databases, and which gave very encouraging preliminary results.
TL;DR: XMLValue is presented, to automatically recommend XML attribute values using association rules and NLP techniques, and is general enough to support a variety of frameworks, and has real time performance for code assistance.
Abstract: Frameworks are popularly used to reduce implementation complexity and improve productivity. Unfortunately, most frameworks are quite complex and not well documented. Hence, correctly and effectively programming with Framework is still a great challenge. One of the significant obstacles for us to smoothly use Framework is the complicated attribute value configuration of XML files. To overcome these difficulties, we present XMLValue to automatically recommend XML attribute values using association rules and NLP techniques. Experimental results show that our tool is efficient and effective for mining reusable configuration snippets, and has significantly shorten development time for framework based programming, and is general enough to support a variety of frameworks, and has real time performance for code assistance.
TL;DR: In this paper, a computer-implemented method for offloading extensible markup language (XML) data to a distributed file system may include receiving a command to populate an XML table of a database with XML tables.
Abstract: A computer-implemented method for offloading extensible markup language (XML) data to a distributed file system may include receiving a command to populate a distributed file system with an XML table of a database. The XML table may be queried in response to the command. The source data in the XML table may be offloaded, by a computer processor, to the distributed file system in response to the querying. The offloading may include converting the source data to a string version of the source data and converting the string version of the source data back into XML format.
TL;DR: This work proposes a new hybrid matcher algorithm, called TRC-matcher, that is targeted for matching business oriented XML schemas with none or minor user assistance and the efficiency of the new algorithm is based on a new content profiling algorithm and on intelligent combination of matching results of multiple matching algorithms.
Abstract: Modern society depends on the access to a wide range of information that is located in heterogeneous data sources. Schema matching is a task of finding relationships among data source elements automatically. However, most of the existing schema matching software are semi-automatic meaning that they need a lot of interaction from an expert familiar with the systems being integrated. In this work, we propose a new hybrid matcher algorithm, called TRC-matcher, that is targeted for matching business oriented XML schemas with none or minor user assistance. When compared to previously published schema matching methods, the efficiency of the new algorithm is based on a new content profiling algorithm and on intelligent combination of matching results of multiple matching algorithms. In addition, an enhanced version of the TRC-Matcher is introduced that combines machine learning methods together with few new matching algorithms.
TL;DR: Simulation of Json and XML output based web service was carried out to choose better communication service in the native apps and observes that, Json can largely be adopted over the web application, it being very thin object which can be communicated and conveyed to any native application.
Abstract: Json is broadly known as one of the restless Service. Its standard for communicating the data objects is, the Key-Value pair format. The above cited restless service largely adopted allochronic forms with the browser and server. Currently it’s used by AJAX. In rest services, a request can be identified by a particular request to URL, which gives responses in XML, JSON or HTML. This is possible since rest services are able to cope up with large data processing tasks. Simulation of Json and XML output based web service was carried out. The simulation results helps out to choose better communication service in the native apps. With the help of simulation results, we observe that, Json can largely be adopted over the web application, it being very thin object which can be communicated and conveyed to any native application.
TL;DR: Today, digital watermarking technology has emerged as an effective tool for relational databases and eXtensible Mark-up Language (XML) data in order to protect the copyright, detect tamper, trace traitor, and maintain the integrity of the data.
Abstract: Today, digital watermarking technology has emerged as an effective tool for relational databases and eXtensible Mark-up Language (XML) data in order to protect the copyright, detect tamper, trace traitor, and maintain the integrity of the data.
TL;DR: This work designs and develops a data exchange system that supports the mapping and integration of different structures of literature resource, and parsing resources at the same time, so that users can upload and verify the XML schema files according to their individual demands for data exchange.
Abstract: With the rapid development of computer technology, the barriers of communication among different systems caused by system heterogeneity or data structure have been broken down. However, the demands for personalized content for accuracy in resource exchange and delivery are becoming increasingly high. The structures of existing literature resources like papers, patents and books, with different formats and structures, leading to lots of problems in content delivery and inheritance. Thus, based on the XML technology, we design and develop the data exchange system. This system supports the mapping and integration of different structures of literature resource, and parsing resources at the same time, so that users can upload and verify the XML schema files according to their individual demands for data exchange.