TL;DR: It is shown that the emptiness and finiteness problems are decidable for ranges of DTLdmso programs and that the ranges are closed under intersection with generalized Document Type Definitions (DTDs).
Abstract: Based on the recursion mechanism of the XML transformation language XSL, the document transformation language DTL is defined. First the instantiation DTLreg is considered that uses regular expressions as pattern language. This instantiation closely resembles the navigation mechanism of XSL. For DTLreg the complexity of relevant decision problems such as termination of programs, usefulness of rules and equivalence of selection patterns, is addressed. Next, a much more powerful abstraction of XSL is considered that uses monadic second-order logic formulas as pattern language (DTLmso). If DTLmso is restricted to top-down transformations (DTLdmso), then a computational model can be defined which is a natural generalization to unranked trees of topdown tree transducers with look-ahead. The look-ahead can be realized by a straightforward bottom-up pre-processing pass through the document. The size of the output of an XSL program is at most exponential in the size of the input. By restricting copying in XSL a decidable fragment of DTLdmso programs is obtained which induces transformations of linear size increase (safe DTLdmso). It is shown that the emptiness and finiteness problems are decidable for ranges of DTLdmso programs and that the ranges are closed under intersection with generalized Document Type Definitions (DTDs).
TL;DR: It is shown that in contrast to the undecidability result of [11], the implication and finite implication problems for Pc are decidable in cubic time and are finitely axiomatizable.
Abstract: Path constraints have been studied for semistructured data modeled as a rooted edge-labeled directed graph [4, 11-13]. In this model, the implication problems associated with many natural path constraints are undecidable [11, 13]. A variant of the graph model, called the deterministic data model, was recently proposed in [10]. In this model, data is represented as a graph with deterministic edge relations, i.e., the edges emanating from any node in the graph have distinct labels. This model is more appropriate for representing, e.g., ACeDB [27] databases and Web sites. This paper investigates path constraints for the deterministic data model. It demonstrates the application of path constraints to, among others, query optimization. Three classes of path constraints are considered: the language Pc introduced in [11], an extension of Pc, denoted by Pcw, by including wildcards in path expressions, and a generalization of Pcw, denoted by P*c, by representing paths as regular expressions. The implication problems for these constraint languages are studied in the context of the deterministic data model. It is shown that in contrast to the undecidability result of [11], the implication and finite implication problems for Pc are decidable in cubic-time and are finitely axiomatizable. Moreover, the implication problems are decidable for Pcw. However, the implication problems for P*c are undecidable.
TL;DR: A system of untagged union types that can accommodate variations in structure while still allowing a degree of static type checking is described, and algorithms for subtyping and typechecking are developed.
Abstract: Semistructured databases are treated as dynamically typed: they come equipped with no independent schema or type system to constrain the data. Query languages that are designed for semistructured data, even when used with structured data, typically ignore any type information that may be present. The consequences of this are what one would expect from using a dynamic type system with complex data: fewer guarantees on the correctness of applications. For example, a query that would cause a type error in a statically typed query language will return the empty set when applied to a semistructured representation of the same data.
Much semistructured data originates in structured data. A semistructured representation is useful when one wants to add data that does not conform to the original type or when one wants to combine sources of different types. However, the deviations from the prescribed types are often minor, and we believe that a better strategy than throwing away all type information is to preserve as much of it as possible. We describe a system of untagged union types that can accommodate variations in structure while still allowing a degree of static type checking.
A novelty of this system is that it involves non-trivial equivalences among types, arising from a law of distributivity for records and unions: a value may be introduced with one type (e.g., a record containing a union) and used at another type (a union of records). We describe programming and query language constructs for dealing with such types, prove the soundness of the type system, and develop algorithms for subtyping and typechecking.
TL;DR: A framework for optimizing the physical distribution of workflow schemas, and the mapping of sub-workflow schemas into flowcharts, and a set of formulas to quantify the metrics used for choosing a near optimal set of CF clusters for executing a workflow are presented.
Abstract: A central problem in workflow concerns optimizing the distribution of work in a workflow: how should the execution of tasks and the management of tasks be distributed across multiple processing nodes (i.e., computers). In some cases task management or execution may be at a processing node with limited functionality, and so it is useful to optimize translations of (sub-)workflow schemas into flowcharts, that can be executed in a restricted environment, e.g., in a scripting language or using a flowchart-based workflow engine.
This paper presents a framework for optimizing the physical distribution of workflow schemas, and the mapping of sub-workflow schemas into flowcharts. We provide a general model for representing essentially any distribution of a workflow schema, and for representing a broad variety of execution strategies. The model is based on families of "communicating flowcharts" (CFs). In the framework, a workflow schema is first rewritten as a family of CFs that are essentially atomic and execute in parallel. The CFs can be grouped into "clusters". Several CFs can be combined to form a single CF, which is useful when executing a sub-schema on a limited processor. Local rewriting rules are used to specify equivalence-preserving transformations. We developed a set of formulas to quantify the metrics used for choosing a near optimal set of CF clusters for executing a workflow. The current paper focuses primarily on ECA-based workflow models, such as Flowmark, Meteor and Mentor, and condition-action based workflow models, such as ThinkSheet and Vortex.
TL;DR: The SQL-AG prototype that overcomes limitations by supporting UDAs as originally proposed in Postgres and SQL3 is described, and a unified solution to both the theoretical and practical problems of UDAs is proposed.
Abstract: User-defined aggregates (UDAs) can be the linchpin of sophisticated data mining functions and other advanced database applications, but they find little support in current database systems. In this paper, we describe the SQL-AG prototype that overcomes these limitations by supporting UDAs as originally proposed in Postgres and SQL3. Then we extend the power and flexibility of UDAs by adding (i) early returns, (to express online aggregation) and (ii) syntactically recognizable monotonic UDAs that can be used in recursive queries to support applications, such as Bill of Materials (BoM) and greedy algorithms for graph optimization, that cannot be expressed under stratified aggregation. T his paper proposes a unified solution to both the theoretical and practical problems of UDAs, and demonstrates the power of UDAs in dealing with advanced database applications.
TL;DR: In this paper, the authors show that the areas of semistructured databases and mobile computation have some surprising similarities at the technical level, such as the fact that both areas have to deal with extreme dynamicity of data and behavior.
Abstract: This paper is based on the observation that the areas of semistructured databases [1] and mobile computation [3] have some surprising similarities at the technical level. Both areas are inspired by the need to make better use of the Internet. Despite this common motivation, the technical similarities that arise seem largely accidental, but they should still permit the transfer of some techniques between the two areas. Moreover, if we can take advantage of the similarities and generalize them, we may obtain a broader model of data and computation on the Internet. The ultimate source of similarities is the fact that both areas have to deal with extreme dynamicity of data and behavior. In semistructured databases, one cannot rely on uniformity of structure because data may come from heterogeneous and uncoordinated sources. Still, it is necessary to perform searches based on whatever uniformity one can find in the data. In mobile computation, one cannot rely on uniformity of structure because agents, devices, and networks can dynamically connect, move around, become inaccessible, or crash. Still, it is necessary to perform computations based on whatever resources and connections one can find on the network. We will develop these similarities throughout the paper. As a sample, consider the following arguments. First, one can regard data structures stored inside network nodes as a natural extension of network structures, since on a large time/space scale both networks and data are semistructured and dynamic. Therefore, one can think of applying the same navigational and code mobility techniques uniformly to networks and data. Second, since networks and their resources are semistructured, one can think of applying semistructured database searches to network structure. This is a well-known major problem in mobile computation, going under the name of resource discovery.
TL;DR: IES(SQL), the incremental evaluation system over an SQL-like language with grouping, arithmetics, and aggregation, is considered and it is shown that every second order query is in Ies(SQL) and that there are PSPACE-complete queries in IES( SQL).
Abstract: We consider IES(SQL), the incremental evaluation system over an SQL-like language with grouping, arithmetics, and aggregation. We show that every second order query is in IES(SQL) and that there are PSPACE-complete queries in IES(SQL). We further show that every PSPACE query is in IES(SQL) augmented with a deterministic transitive closure operator. Lastly, we consider ordered databases and provide a complete analysis of a hierarchy on IES(SQL) defined with respect to arity-bounded auxiliary relations.
TL;DR: A framework for modelling the execution of active rules, based on abstract interpretation, is defined, which affords the opportunity to compare and verify existing methods for termination analysis of active Rules, and also to develop new ones.
Abstract: A crucial requirement for active databases is the ability to analyse the behaviour of the active rules. A particularly important type of analysis is termination analysis. We define a framework for modelling the execution of active rules, based on abstract interpretation. Specific methods for termination analysis are modelled as specific approximations within the framework. The correctness of a method can be established by proving two generic requirements provided by the framework. This affords the opportunity to compare and verify existing methods for termination analysis of active rules, and also to develop new ones.
TL;DR: A theme emerging from the popular query languages such as Lorel, UnQL, StruQL, WebOQL, and the Ulixes/Penelope pair of the ADM model, is that navigation is considered an integral and essential part of querying and that query expressions are often dependent on the particular instance they are applied to.
Abstract: Currently, there is tremendous interest in semi-structured (SS)data management. This is spurred by data sources, such as the ACeDB [29], that are inherently less rigidly structured than traditional DBMS, by WWW documents where no hard rules or constraints are imposed and “anything goes,” and by integration of information coming from disparate sources exhibiting considerable differences in the way they structure information. Significant strides have been made in the development of data models and query languages [2, 11, 17, 6, 7], and to some extent, the theory of queries on semi-structured data [1, 23, 3, 13, 9]. The OEM model of the Stanford TSIMMIS project [2] (equivalently, its variant, independently developed at U.Penn. [11]) has emerged as the de facto standard model for semi-structured data. OEM is a light-weight object model,which unlike the ODMG model that it extends, does not impose the latter’s rigid type constraints. Both OEM and the Penn model essentially correspond to labeled digraphs. Amain theme emerging from the popular query languages such as Lorel [2], UnQL [11], StruQL [17], WebOQL [6], and the Ulixes/Penelope pair of the ADM model [7], is that navigation is considered an integral and essential part of querying. Indeed, given the lac of rigid schema of semi-structured data, navigation brings many benefits, including the ability to retrieve data regardless of the depth at which it resides in a tree (e.g.,see [4]). This is achieved with programming primitives such as regular path expressions and wildcards. A second, somewhat subtle, aspect of the emerging trend is that query expressions are often dependent on the particular instance they are applied to. This is not surprising, given the lac of rigid structure and the absence of the notion of a predefined schema for semi-structured data. In fact, it has been argued [4] that it is unreasonable to impose a predefined schema.
TL;DR: A spatial Datalog program is presented that correctly tests topological connectivity of arbitrary compact (i.e., closed and bounded) spatial databases and is guaranteed to terminate on this class of databases.
Abstract: We consider two-dimensional spatial databases defined in terms of polynomial inequalities and focus on the potential of programming languages for such databases to express queries related to topological connectivity. It is known that the topological connectivity test is not first-order expressible. One approach to obtain a language in which connectivity queries can be expressed would be to extend FO+Poly with a generalized (or Lindstrom) quantifier expressing that two points belong to the same connected component of a given database. For the expression of topological connectivity, extensions of first-order languages with recursion have been studied (in analogy with the classical relational model). Two such languages are spatial Datalog and FO+Poly+WHILE. Although both languages allow the expression of non-terminating programs, their (proven for FO+Poly+WHILE and conjectured for spatial Datalog) computational completeness makes them interesting objects of study.
Previously, spatial Datalog programs have been studied for more restrictive forms of connectivity (e.g., piece-wise linear connectivity) and these programs were proved to correctly test connectivity on restricted classes of spatial databases (e.g., linear databases) only.
In this paper, we present a spatial Datalog program that correctly tests topological connectivity of arbitrary compact (i.e., closed and bounded) spatial databases. In particular, it is guaranteed to terminate on this class of databases. This program is based on a first-order description of a known topological property of spatial databases, namely that locally they are conical.
We also give a very natural implementation of topological connectivity in FO+Poly+WHILE, that is based on a first-order implementation of the curve selection lemma, and that works correctly on arbitrary spatial databases inputs. Finally, we raise the question whether topological connectivity of arbitrary spatial databases can also be expressed in spatial Datalog.
TL;DR: This paper explains why the currently widely-accepted use of the transient keyword is not appropriate in the context of orthogonal persistence, a more detailed definition for it is presented, and it is shown how the handling of transient fields can be efficiently implemented in an orthogonally persistent system, while preserving the desired semantics.
Abstract: The transient keyword of the Java™ programming language was originally introduced to prevent specific class fields from being stored by a persistence mechanism. In the context of orthogonal persistence, this is a particularly useful feature, since it allows the developer to easily deal with state that is external to the system. Such state is inherently transient and should not be stored, but instead re-created when necessary. Unfortunately, the Java Language Specification does not accurately define the semantics and correct usage of the transient keyword. This has left it open to misinterpretation by third parties and its current meaning is tied to the popular Java Object Serialisation mechanism. In this paper we explain why the currently widely-accepted use of the transient keyword is not appropriate in the context of orthogonal persistence, we present a more detailed definition for it, and we show how the handling of transient fields can be efficiently implemented in an orthogonally persistent system, while preserving the desired semantics.
TL;DR: Souk is a language-independent, component-based paradigm for data integration designed to allow the rapid construction of data integration solutions from off-the-shelf components, and to allow flexible evolution.
Abstract: Construction of complex software systems with off-the-shelf components has become a reality. Component-based frameworks tailored specifically for the domain of database integration are lacking, however. To use an existing component framework, data integrators must construct custom components specialized to the tasks of the data integration problem at hand. This approach allows other components provided by the framework to be reused, but is overly tedious and requires the integrator to employ the programming paradigms assumed by the component framework for interconnection and intercommunication between components, and manipulation of data provided by them. An alternate approach would employ a framework containing components tailored to data integration and which allows them to be interconnected using programming methods that are more natural to the domain of data integration. Souk is a language-independent, component-based paradigm for data integration. It is designed to allow the rapid construction of data integration solutions from off-the-shelf components, and to allow flexible evolution. This paper gives an overview of this paradigm.
TL;DR: This paper proposes an alternative approach and develops an algebraic language in which the traditional operators and Euclidean constructions work directly on the data represented by "semi-circular" constraints and shows that the language is closed under these operations.
Abstract: Linear constraint databases and query languages are appropriate for spatial database applications. Not only the data model is natural to represent a large portion of spatial data such as in GIS systems, but also there exist efficient algorithms for the core operations in the query languages. However, an important limitation of the linear constraint data model is that it cannot model constructs such as "Euclidean distance." A previous attempt to expend linear constraint languages with the ability to express Euclidean distance, by Kuijpers, Kuper, Paredaens, and Vandeurzen is to adapt two fundamental Euclidean constructions with ruler and compass in a first order logic over points. The language, however, requires the input database to be encoded in an ad hoc LPC representation so that the logic operations can apply. This causes a problem that sometimes queries in their language may depend on the encoding and thus do not have any natural meaning. In this paper, we propose an alternative approach and develop an algebraic language in which the traditional operators and Euclidean constructions work directly on the data represented by "semi-circular" constraints. By avoiding the encoding step, our language do not suffer from this problem. We show that the language is closed under these operations.
TL;DR: An algorithm is obtained which does not infer arithmetical or aggregation constraints, and reduces optimization of such query blocks to the well-studied problem of tableau minimization.
Abstract: We present a new optimization method for nested SQL query blocks with aggregation operators. The method is derived from the theory of dependency implication and tableau minimization. It unifies and generalizes previously proposed (seemingly unrelated) algorithms, and can incorporate general database dependencies given in the database schema.
We apply our method to query blocks with max, min aggregation operators. We obtain an algorithm which does not infer arithmetical or aggregation constraints, and reduces optimization of such query blocks to the well-studied problem of tableau minimization. We prove a completeness result for this algorithm: if two max, min blocks can be merged, the algorithm will detect this fact
TL;DR: In this paper, the authors introduce a new form of attribute grammars (extended AGs) that work directly over extended context-free grammar, rather than over standard context free grammar, and characterize the expressiveness of extended AGs in terms of monadic second-order logic.
Abstract: Document specification languages like XML, model documents using extended context-free grammars. These differ from standard context-free grammars in that they allow arbitrary regular expressions on the right-hand side of productions. To query such documents, we introduce a new form of attribute grammars (extended AGs) that work directly over extended context-free grammars rather than over standard context-free grammars. Viewed as a query language, extended AGs are particularly relevant as they can take into account the inherent order of the children of a node in a document. We show that two key properties of standard attribute grammars carry over to extended AGs: efficiency of evaluation and decidability of well-definedness. We further characterize the expressiveness of extended AGs in terms of monadic second-order logic and establish the complexity of their non-emptiness and equivalence problem to be complete for EXPTIME. As an application we show that the Region Algebra expressions can be efficiently translated into extended AGs. This translation drastically improves the known upper bound on the complexity of the emptiness and equivalence test for Region Algebra expressions.
TL;DR: The structured object database model ODMG is extended with the ability to handle semistructured data based on the OEM model and Lorel language, and the extensions are implemented in a system called Ozone, which enhances both ODMG/OQL and OEM/Lorel by virtue of their combination.
Abstract: Applications have an increasing need to manage semistructured data (such as data encoded in XML) along with conventional structured data. We extend the structured object database model ODMG and its query language OQL with the ability to handle semistructured data based on the OEM model and Lorel language, and we implement our extensions in a system called Ozone. In our approach, structured data may contain entry points to semistructured data, and vice-versa. The unified representation and querying of such "hybrid" data is the main contribution of our work. We retain strong typing and access to all properties of structured portions of the data while allowing flexible navigation of semistructured data without requiring full knowledge of structure. Ozone also enhances both ODMG/OQL and OEM/Lorel by virtue of their combination. For instance, Ozone allows OEM semantics to be applied to ODMG data, thus supporting semistructured-style navigation of structured data. Ozone also enables ODMG views of OEM data, allowing standard ODMG applications to access semistructured data without losing the benefits of structure. Ozone is implemented on top of the ODMG-compliant O2 database system, and it fully supports our extensions to the ODMG model and OQL.
TL;DR: In this article, a query is a pair consisting of what we want and how we want it, which can be achieved by matching a (partial) schema and the latter by specifying additional operations.
Abstract: Traditional database management requires design and ensures declarativity. In the context of semistructured data a more flexible approach is appropriate due to missing schema information. In this paper we present a query language based on schema matching. Intuitively, a query is a pair consisting of what we want and how we want it. We propose that the former can be achieved by matching a (partial) schema and the latter by specifying additional operations. We describe in some detail our notion of schema covering various concepts typically found in query languages, such as predicates, variables and paths. We outline the optimization potential that this modular approach offers and discuss how we use constraints for query processing.
TL;DR: This paper addresses the issue of designing a simple, user friendly query language for string databases with a focus on the language FO(•), which is classical first order logic extended with a concatenation operator, and where quantifiers range over the set of all strings.
Abstract: A string database is simply a collection of tables, the columns of which contain strings over some given alphabet. We address in this paper the issue of designing a simple, user friendly query language for string databases. We focus on the language FO(•), which is classical first order logic extended with a concatenation operator, and where quantifiers range over the set of all strings. We wish to capture all string queries, i.e., well-typed and computable mappings involving a notion of string genericity. Unfortunately, unrestricted quantification may allow some queries to have infinite output. This leads us to study the "safety" problem for FO(•), that is, how to build syntactic and/or semantic restrictions so as to obtain a language expressing only queries with finite output, hopefully all string queries. We introduce a family of such restrictions and study their expressivness and complexity. We prove that none of these languages express all string queries. We prove that a family of these languages is equivalent to a simple, tractable language that we call SriQueL, standing for String Query Language, which thus emerges a robust and natural language suitable for string querying.