TL;DR: A novel labeling scheme called DDE (for Dynamic DEwey) which is tailored for both static and dynamic XML documents which can completely avoid re-labeling and its label quality is most resilient to the number and order of insertions compared to the existing approaches.
Abstract: Labeling schemes lie at the core of query processing for many XML database management systems. Designing labeling schemes for dynamic XML documents is an important problem that has received a lot of research attention. Existing dynamic labeling schemes, however, often sacrifice query performance and introduce additional labeling cost to facilitate arbitrary updates even when the documents actually seldom get updated. Since the line between static and dynamic XML documents is often blurred in practice, we believe it is important to design a labeling scheme that is compact and efficient regardless of whether the documents are frequently updated or not. In this paper, we propose a novel labeling scheme called DDE (for Dynamic DEwey) which is tailored for both static and dynamic XML documents. For static documents, the labels of DDE are the same as those of dewey which yield compact size and high query performance. When updates take place, DDE can completely avoid re-labeling and its label quality is most resilient to the number and order of insertions compared to the existing approaches. In addition, we introduce Compact DDE (CDDE) which is designed to optimize the performance of DDE for insertions. Both DDE and CDDE can be incorporated into existing systems and applications that are based on dewey labeling scheme with minimum efforts. Experiment results demonstrate the benefits of our proposed labeling schemes over the previous approaches.
TL;DR: This article focuses on the verification of temporal properties of runs of Active XML systems, specified in a tree-pattern-based temporal logic, Tree-LTL, which allows expressing a rich class of semantic properties of the application.
Abstract: Active XML is a high-level specification language tailored to data-intensive, distributed, dynamic Web services. Active XML is based on XML documents with embedded function calls. The state of a document evolves depending on the result of internal function calls (local computations) or external ones (interactions with users or other services). Function calls return documents that may be active, and so may activate new subtasks. The focus of this article is on the verification of temporal properties of runs of Active XML systems, specified in a tree-pattern-based temporal logic, Tree-LTL, which allows expressing a rich class of semantic properties of the application. The main results establish the boundary of decidability and the complexity of automatic verification of Tree-LTL properties.
TL;DR: In this article, a semiconductor device and method for fabricating the same, which can maintain a threshold voltage constant despite of decreased channel width, is disclosed, and the device including a first, and a second conductive type wells in a substrate, a first gate electrode on the first gate insulating film, the second gate electrode being doped with a secondconductive type except for edges of the first gateway electrode in a channel width direction counter, and isolating regions formed between the first-and second-gate electrodes.
Abstract: Semiconductor device and method for fabricating the same, is disclosed, which can maintain a threshold voltage constant despite of decreased channel width, the device including a first, and a second conductive type wells in a substrate, a first, and a second gate insulating films on the first, and the second conductive type wells, a first gate electrode on the first gate insulating film, the first gate electrode being doped with a second conductive type except for edges of the first gate electrode in a channel width direction counter doped with a first conductive type, a second gate electrode on the second gate insulating film, the second gate electrode being doped with a first conductive type except for edges of the second gate electrode in a channel width direction counter doped with a second conductive type, and isolating regions formed between the first, and second conductive type wells, the first, and second gate insulating films, and the first, and second gate electrodes.
TL;DR: An attempt has been made to evaluate the quality of XML schema documents (XSD) written in W3C XML Schema language with a metric, which measures the complexity due to the internal architecture of XSD components, and due to recursion.
Abstract: The eXtensible Markup Language (XML) has been gaining extraordinary acceptance from many diverse enterprise software companies for their object repositories, data interchange, and development tools. Further, many different domains, organizations and content providers have been publishing and exchanging information via internet by the usage of XML and standard schemas. Efficient implementation of XML in these domains requires well designed XML schemas. In this point of view, design of XML schemas plays an extremely important role in software development process and needs to be quantified for ease of maintainability. In this paper, an attempt has been made to evaluate the quality of XML schema documents (XSD) written in W3C XML Schema language. We propose a metric, which measures the complexity due to the internal architecture of XSD components, and due to recursion. This is the single metric, which cover all major factors responsible for complexity of XSD. The metric has been empirically and theoretically validated, demonstrated with examples and supported by comparison with other well known structure metrics applied on XML schema documents.
TL;DR: This work proposes a "pure hardware" based solution, which utilizes XPath query blocks on FPGA to solve the filtering problem and achieves drastically better through put than the existing software or mixed (hardware/software) architectures.
Abstract: growing amount of XML encoded data exchanged over the In- ternet increases the importance of XML based publish-subscribe (pub-sub) and content based routing systems. The input in such systems typically consists of a stream of XML documents and a set of user subscriptions expressed as XML queries. The pub-sub system then filters the published documents and passes them t o the subscribers. Pub-sub systems are characterized by very high input ratios, therefore the processing time is critical. In this p aper we propose a "pure hardware" based solution, which utilizes XPath query blocks on FPGA to solve the filtering problem. By utiliz - ing the high throughput that an FPGA provides for parallel pro- cessing, our approach achieves drastically better through put than the existing software or mixed (hardware/software) architectures. The XPath queries (subscriptions) are translated to regular expres- sions which are then mapped to FPGA devices. By introducing stacks within the FPGA we are able to express and process a wide range of path queries very efficiently, on a scalable environ ment. Moreover, the fact that the parser and the filter processing a re per- formed on the same FPGA chip, eliminates expensive communi- cation costs (that a multi-core system would need) thus enabling very fast and efficient pipelining. Our experimental evalua tion re- veals more than one order of magnitude improvement compared to traditional pub/sub systems.
TL;DR: An apparatus, system, and method for efficient content indexing of streaming XML document content is described in this paper, where a tree generator generates XML pattern forests from a set of structured index path expressions, the XML pattern forest includes trees and twigs generated from structured index expressions uniquely associated with a namespace indicator for an XML node.
Abstract: An apparatus, system, and method are disclosed for efficient content indexing of streaming XML document content A forest generator generates an XML pattern forest from a set of structured index path expressions, the XML pattern forest includes trees and twigs generated from structured index path expressions uniquely associated with a namespace indicator for an XML node The XML node is identified in a stream of at least one XML document A comparison module compares the XML node to nodes of trees and twigs of the XML pattern forest A determination module determines a match between the XML node and an index node in one of a tree and a twig of the XML pattern forest The index node has a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions A storage module stores an index entry for the XML node in response to the determined match, the index entry includes a XML document identifier, an XML node name, a namespace indicator for the XML node, and XML node content
TL;DR: TwigX-Guide is presented, a hybrid system, which takes advantage of the beautiful features of path summary in DataGuide and region encoding in TwigStack to improve complex query processing.
TL;DR: This dissertation aims to provide a history of web exceptionalism from 1989 to 2002, a period chosen in order to explore its roots as well as specific cases up to and including the year in which descriptions of “Web 2.0” began to circulate.
Abstract: Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: http://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.
TL;DR: This work identifies XPath fragments which are ideally coupled with the newly introduced P(k)-partition which has its definition grounded in the well-known A(k) structural index and its associated partition.
TL;DR: This work proposes an automatic query refinement method to transform a keyword query into structured XML queries that capture the original information need and conform to the underlying XML data.
Abstract: The structural heterogeneity and complexity of XML repositories makes query formulation challenging for users who have little knowledge of XML. To assist its users, an XML retrieval system can have a keyword-based interface, relegating the task of combining textual and structural clues to the retrieval algorithm. In this work, we propose an automatic query refinement method to transform a keyword query into structured XML queries that capture the original information need and conform to the underlying XML data. We formulate query generation as a search problem, and show the effectiveness of the method in generating accurate content-and-structure queries.
TL;DR: A generic transformation of XML data into the Resource Description Framework (RDF) and its implementation by XSLT transformations is presented to solve the problem of semantic computing.
Abstract: A generic transformation of XML data into the Resource Description Framework (RDF) and its implementation by XSLT transformations is presented. It was developed by the grid integration project for robotic telescopes of AstroGrid-D to provide network communication through the Remote Telescope Markup Language (RTML) to its RDF based information service. The transformation's generality is explained by this example. It automates the transformation of XML data into RDF and thus solves this problem of semantic computing. Its design also permits the inverse transformation but this is not yet implemented.
TL;DR: This paper addresses the problem of efficiently locating relevant XML documents in a P2P network, where a user poses queries in a language such as XPath, and develops a new system called psiX that runs on top of an existing distributed hashing framework.
Abstract: One of the key challenges in a peer-to-peer (P2P) network is to efficiently locate relevant data sources across a large number of participating peers With the increasing popularity of the extensible markup language (XML) as a standard for information interchange on the Internet, XML is commonly used as an underlying data model for P2P applications to deal with the heterogeneity of data and enhance the expressiveness of queries In this paper, we address the problem of efficiently locating relevant XML documents in a P2P network, where a user poses queries in a language such as XPath We have developed a new system called psiX that runs on top of an existing distributed hashing framework Under the psiX system, each XML document is mapped into an algebraic signature that captures the structural summary of the document An XML query pattern is also mapped into a signature The query's signature is used to locate relevant document signatures Our signature scheme supports holistic processing of query patterns without breaking them into multiple path queries and processing them individually The participating peers in the network collectively maintain a collection of distributed hierarchical indexes for the document signatures Value indexes are built to handle numeric and textual values in XML documents These indexes are used to process queries with value predicates Our experimental study on PlanetLab demonstrates that psiX provides an efficient location service in a P2P network for a wide variety of XML documents
TL;DR: In this paper, a novel method is proposed, called S^3, which can selectively process the document's nodes and substantially outperform previous QTP processing methods w.r.t. response time, I/O overhead, and memory consumption - critical parameters in any real multi-user environment.
Abstract: XML queries are frequently based on path expressions where their elements are connected to each other in a tree-pattern structure, called query tree pattern (QTP). Therefore, a key operation in XML query processing is finding those elements which match the given QTP. In this paper, we propose a novel method, called S^3, which can selectively process the document's nodes. In S^3, unlike all previous methods, path expressions are not directly executed on the XML document, but first they are evaluated against a guidance structure, called QueryGuide. Enriched by information extracted from the QueryGuide, a query execution plan, called SMP, is generated to provide focused pattern matching and avoid document access as far as possible. Moreover, our experimental results confirm that S^3 and its optimized version OS^3 substantially outperform previous QTP processing methods w.r.t. response time, I/O overhead, and memory consumption - critical parameters in any real multi-user environment.
TL;DR: A method for planarizing metal plugs for device interconnections by providing a semiconductor structure with at least one device thereon and planarized using a first chemical mechanical polishing process.
TL;DR: This paper evaluates the efficient access of rich internet applications especially in the domain of resource limited embedded devices such as digital picture frames and proposes a generic adaption of the EXI format to even increase the efficiency.
Abstract: The tremendous acceptance of web applications, or more specifically, rich media applications are about to be extended to embedded devices such as mobile phones, digital picture frames or TV sets. The Extensible Markup Language (XML) is one important pillar when we deal with such internet applications. XML is known as the interchange language of the web. Besides its outstanding features, in the domain of embedded devices, XML is difficult to handle due to the processing overhead and the verbosity associated with its use. This paper evaluates the efficient access of rich internet applications especially in the domain of resource limited embedded devices such as digital picture frames. For this purpose, typical XML source information models for rich internet applications, such as Silverlight and SVG, are evaluated. In this context, the new Efficient XML Interchange (EXI) format is applied and studied. Finally a generic adaption of the EXI format is developed to even increase the efficiency. The paper concludes with the proposal for further studies on an integration of EXI-based typed interfaces to reduce the processing complexity for rich media applications on embedded devices.
TL;DR: It is shown that the interplay of XML Signature, XPath, and the XML namespace concept has severe flaws that can be exploited for an attack, and that XML namespaces in general pose real troubles to digital signatures in the XML domain.
Abstract: The XML signature wrapping attack is one of the most discussed security issues of the Web Services security community during the last years. Until now, the issue has not been solved, and all countermeasure approaches proposed so far were shown to be insufficient.In this paper, we present yet another way to perform signature wrapping attacks by using the XML namespace injection technique. We show that the interplay of XML Signature, XPath, and the XML namespace concept has severe flaws that can be exploited for an attack, and that XML namespaces in general pose real troubles to digital signatures in the XML domain. Additionally, we present and discuss some new approaches in countering the proposed attack vector.
TL;DR: In this paper, a method and apparatus for building and using a persistent XML tree index for navigating an XML document is described, which is stored separately from the XML document content, and thus is able to optimize performance through the use of fixed-sized index entries.
Abstract: A method and apparatus are provided for building and using a persistent XML tree index for navigating an XML document. The XML tree index is stored separately from the XML document content, and thus is able to optimize performance through the use of fixed-sized index entries. The XML document hierarchy need not be constructed in volatile memory, so creating and using the XML tree index scales even for large documents. To evaluate a path expression including descendent or ancestral syntax, navigation links can be read from persistent storage and used directly to find the nodes specified in the path expression. The use of an abstract navigational interface allows applications to be written that are independent of the storage implementation of the index and the content. Thus, the XML tree index can index documents stored at least in a database, a persistent file system, or as a sequence of in memory.
TL;DR: Through empirical evaluation, it is shown that ParDOM yields better scalability than PXP on commodity multicore processors, and can process a wide-variety of XML datasets with complex structures which PXP fails to parse.
Abstract: The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme --- each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] --- a recently proposed parallel DOM parsing algorithm --- on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.
TL;DR: This paper designs an adaptive XML keyword search approach, called XBridge, that can derive the semantics of a keyword query and generate a set of effective structured queries by analyzing the given keywords and the schemas of XML data sources.
Abstract: Recently, keyword search has attracted a great deal of attention in XML database. It is hard to directly improve the relevancy of XML keyword search because lots of keyword-matched nodes may not contribute to the results. To address this challenge, in this paper we design an adaptive XML keyword search approach, called XBridge , that can derive the semantics of a keyword query and generate a set of effective structured queries by analyzing the given keyword query and the schemas of XML data sources. To efficiently answer keyword query, we only need to evaluate the generated structured queries over the XML data sources with any existing XQuery search engine. In addition, we extend our approach to process top-k keyword search based on the execution plan to be proposed. The quality of the returned answers can be measured using the context of the keyword-matched nodes and the contents of the nodes together. The effectiveness and efficiency of XBridge is demonstrated with an experimental performance study on real XML data.
TL;DR: This work has adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective and presents both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved.
Abstract: An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.
TL;DR: This paper proposes a new information-access paradigm for XML data, called "Inks," in which the system searches on the underlying data "on the fly" as the user types in query keywords, and implemented the algorithm.
Abstract: In a traditional keyword-search system over XML data, a user composes a keyword query, submits it to the system, and retrieves relevant subtrees. In the case where the user has limited knowledge about the data, often the user feels "left in the dark" when issuing queries, and has to use a try-and-see approach for finding information. In this paper, we study a new information-access paradigm for XML data, called "Inks," in which the system searches on the underlying data "on the fly" as the user types in query keywords. Inks extends existing XML keyword search methods by interactively answering queries. We propose effective indices, early-termination techniques, and efficient search algorithms to achieve a high interactive speed. We have implemented our algorithm, and the experimental results show that our method achieves high search efficiency and result quality.
TL;DR: This work proposes that semantically similar data are harmonized when extracting data from XML-based data sources and introduces a constructor algebra, which is a powerful tool in the harmonization of XML data.
Abstract: There are numerous approaches for integrating data from heterogeneous data sources. A common background assumption is that the data sources remain quite stable and are known in advance. Hence an integration system can be built to manipulate them. In practice there is, however, often a demand for supporting ad hoc information needs concerning unexpected autonomous data sources containing volatile data. A different approach is therefore needed. We propose that semantically similar data are harmonized when extracting data from XML-based data sources. We introduce a constructor algebra, which is a powerful tool in the harmonization of XML data. This algebra is able to form for any XML data source a unique relational representation, called an XML relation. We demonstrate that the XML relation representation supports grouping and aggregation of data needed, for example, in OLAP online analytical processing -style applications.
TL;DR: A general view, an X-Ray, on Web-available XSD files by identifying which XSD constructs are more and less frequently used, and an evolution perspective, showing results from X SD files collected in 2005 and 2008 are provided.
Abstract: XML has conquered its place as the most used standard for representing Web data. An XML schema may be employed for similar purposes of those from database schemas. There are different languages to write an XML schema, such as DTD and XSD. In this paper, we provide a general view, an X-Ray, on Web-available XSD files by identifying which XSD constructs are more and less frequently used. Furthermore, we provide an evolution perspective, showing results from XSD files collected in 2005 and 2008. Hence, we can also draw some conclusions on what trends seem to exist in XSD usage. The results of such study provide relevant information for developers of XML applications, tools and algorithms in which the schema has a distinguished role.
TL;DR: The framework ensures high autonomy to participating sources as it does not rely on a global schema or on semantic mappings between schemas, and defines a query language and its associated semantics that allows to collect as much information as possible from several heterogeneous XML sources.
Abstract: We propose a framework for querying heterogeneous XML data sources. The framework ensures high autonomy to participating sources as it does not rely on a global schema or on semantic mappings between schemas. The basic intuition is that of extending traditional approaches for approximate query evaluation, by providing techniques for combining partial answers coming from different sources, possibly on the basis of limited knowledge about the local schemas (i.e., key constraints). We define a query language and its associated semantics, that allows us to collect as much information as possible from several heterogeneous XML sources. We provide algorithms for query evaluation and characterize the complexity of the query language. Finally, we validate the approach in a medical application scenario.
TL;DR: A novel approach for mapping an existing object-oriented database into XML and vice versa, where the object graph is derived based on characteristics of the XML schema and the links are simulated in terms of nesting to get a simulated object graph.
Abstract: This paper presents a novel approach for mapping an existing object-oriented database into XML and vice versa. The major motivation to carry out this study is the fact that it is necessary to facilitate platform independent exchange of the content of object oriented databases and the need to store XML in a structured database. There are more common features between the object-oriented model and XML and thus the the two-way mapping from object-oriented databases into XML (and vice versa) should be less problematic. To achieve the mapping, what we call the object graph is derived based on characteristics of the schema to be mapped. For object-oriented schema, the object graph simply summarizes and includes all nesting and inheritance links, which are the basics of the object-oriented model. Then, the inheritance is simulated in terms of nesting to get a simulated object graph. This way, everything in a simulated object graph is directly representable in XML format. Finally, we handle the mapping of the actual data from the objectoriented database into corresponding XML document(s). On the other hand, the common features between the object-oriented model and XML make it is more attractive to map from XML into object-oriented database; such mapping preserves database specifics. To achieve the mapping, the object graph is derived based on characteristics of the XML schema; it simply summarizes and includes all complex and simple elements and the links, which are the basics of the XML schema. Then, the links are simulated in terms of nesting to get a simulated object graph. This way, everything in a simulated object graph is directly representable in object-oriented database. Finally, we handle the mapping of the actual data from XML document(s) into the corresponding object-oriented database. Povzetek: Prispevek predstavlja izvirno dvostransko preslikavo med objektnimi podatkovnimi bazami in XML.
TL;DR: It is shown that, given a suitable translation of LTL formulæ into XQuery expressions, such runtime monitoring of choreography constraints is possible by feeding the trace of messages to a streaming XQuery processor.
Abstract: A wide range of web service choreography constraints on the content and sequentiality of messages can be translated into Linear Temporal Logic (LTL). Although they can be checked statically on abstractions of actual services, it is desirable that violations of these specifications be also detected at runtime. In this paper, we show that, given a suitable translation of LTL formulae into XQuery expressions, such runtime monitoring of choreography constraints is possible by feeding the trace of messages to a streaming XQuery processor. The forward-only fragment of LTL is introduced; it represents the fragment of LTL supported by available streaming engines.
TL;DR: This article considers the problem of filtering a streaming XML data efficiently against a large number of branch XPath queries, and presents how to efficiently return all matching elements for each matching branch query.
Abstract: Efficient XML filtering has been the fundamental technique in recent Web service and XML publish/subscribe applications. In this article, we consider the problem of filtering a streaming XML data efficiently against a large number of branch XPath queries. To improve the performance of XML filtering, branch queries are grouped into similar queries, and the common paths between queries in the same group are identified. After performing structural matching of queries, queries are organized in a way that multiple queries can be evaluated simultaneously in the post-processing phase. In the post-processing phase, join operations are executed in a pipeline fashion, and intermediate join results are shared amongst the queries in the same group. As a result, the total number of join operations performed in the post-processing phase is significantly reduced. In addition, we also present how to efficiently return all matching elements for each matching branch query. Experiments show that our proposal is efficient and scalable compared to previous work.
TL;DR: The value of managing XML in databases, the current challenges and improvements that will hopefully promote future research directions are shown and a timely checkpoint of XML data management from industrial perspective is provided with experience of developing and supporting Oracle XML products.
Abstract: XML and its related technologies have now been in use for almost a decade. There has been considerable amount of effort both from research and industry focusing on XML, XQuery/XPath, XSLT and SQL/XML processing in the database. Many research prototypes and industrial products have been built to satisfy the XML use cases. This paper reviews several use cases where XML databases are leveraged to build real-world XML applications. We discuss the lessons learnt in supporting both data-centric and document-centric XMLDB applications within a single database system and the need for the implementation of different XML storage, index and query optimisation techniques for different XML use cases. We show the value of managing XML in databases, the current challenges and improvements that will hopefully promote future research directions. This paper also provides a timely checkpoint of XML data management from industrial perspective with experience of developing and supporting Oracle XML products.
TL;DR: It turns out that efficient evaluation of a large class of queries is realizable in models where distributional nodes are probabilistically independent, which makes the evaluation of twig patterns with projection tractable in the most expressive family of p-documents, among those considered.
Abstract: We survey recent results on modeling and querying probabilistic XML data. The literature contains a plethora of probabilistic XML models [2, 13, 14, 18, 21, 24, 27], and most of them can be represented by means of p-documents [18] that have, in addition to ordinary nodes, distributional nodes that specify the probabilistic process of generating a random document. The above models are families of p-documents that differ in the types of distributional nodes in use. The focus of this survey is on the tradeoff between the ability to express real-world probabilistic data (in particular, by taking correlations between atomic events into account) and the efficiency of query evaluation. We concentrate on two important issues. The first is the ability to efficiently translate a pdocument of one family into that of another. The second is the complexity of query evaluation over pdocuments (under the usual semantics of querying probabilistic data, e.g., [4, 9, 10]). It turns out that efficient evaluation of a large class of queries (i.e., twig patterns with projection and aggregate functions) is realizable in models where distributional nodes are probabilistically independent. In other models, the evaluation of a query with projection is very often intractable. In comparison, very simple conjunctive queries are intractable over probabilistic models of relational databases, even when the tuples are probabilistically independent [9, 10]. To handle the limitation exhibited by the above tradeoff, various approaches have been proposed. The first is to allow query answers to be approximate [18], which makes the evaluation of twig patterns with projection tractable in the most expressive family of p-documents, among those considered. This tractability, however, does not carry over to nonmonotonic queries, such as twig patterns with negation or aggregation. The approach presented in [7]
TL;DR: A systematic approach to reverse engineer arbitrary XML documents to their conceptual schema–extended DTD graphs―which is a DTD graph with data semantics, which determines the structure of the XML document, but also derives candidate data semantics from the XML element instances.
Abstract: Extensible markup language (XML) has become a standard for persistent storage and data interchange via the Internet due to its openness, self-descriptiveness, and flexibility This article proposes a systematic approach to reverse engineer arbitrary XML documents to their conceptual schema–extended DTD graphs?which is a DTD graph with data semantics The proposed approach not only determines the structure of the XML document, but also derives candidate data semantics from the XML element instances by treating each XML element instance as a record in a table of a relational database One application of the determined data semantics is to verify the linkages among elements Implicit and explicit referential linkages are among XML elements modeled by the parent-children structure and ID/IDREF(S) respectively As a result, an arbitrary XML document can be reverse engineered into its conceptual schema in an extended DTD graph format