TL;DR: This paper proposes a simple data structure, called a join index, for improving the performance of joins in the context of complex queries, and analysis of the join algorithm using join indices shows its excellent performance.
Abstract: In new application areas of relational database systems, such as artificial intelligence, the join operator is used more extensively than in conventional applications. In this paper, we propose a simple data structure, called a join index, for improving the performance of joins in the context of complex queries. For most of the joins, updates to join indices incur very little overhead. Some properties of a join index are (i) its efficient use of memory and adaptiveness to parallel execution, (ii) its compatibility with other operations (including select and union), (iii) its support for abstract data type join predicates, (iv) its support for multirelation clustering, and (v) its use in representing directed graphs and in evaluating recursive queries. Finally, the analysis of the join algorithm using join indices shows its excellent performance.
TL;DR: The different kinds of joins and the various implementation techniques are surveyed and they are classified based on how they partition tuples from different relations.
Abstract: The join operation is one of the fundamental relational database query operations. It facilitates the retrieval of information from two different relations based on a Cartesian product of the two relations. The join is one of the most diffidult operations to implement efficiently, as no predefined links between relations are required to exist (as they are with network and hierarchical systems). The join is the only relational algebra operation that allows the combining of related tuples from relations on different attribute schemes. Since it is executed frequently and is expensive, much research effort has been applied to the optimization of join processing. In this paper, the different kinds of joins and the various implementation techniques are surveyed. These different methods are classified based on how they partition tuples from different relations. Some require that all tuples from one be compared to all tuples from another; other algorithms only compare some tuples from each. In addition, some techniques perform an explicit partitioning, whereas others are implicit.
TL;DR: The experiments show that, contrary to claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved.
Abstract: In this paper we experimentally study the performance of main-memory, parallel, multi-core join algorithms, focusing on sort-merge and (radix-)hash join. The relative performance of these two join approaches have been a topic of discussion for a long time. With the advent of modern multi-core architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide), and NUMA awareness (sort-merge is superior to hash join in NUMA architectures). We conduct extensive experiments on the original and optimized versions of these algorithms. The experiments show that, contrary to these claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved. The paper also provides the fastest implementations of these algorithms, and covers many aspects of modern hardware architectures relevant not only for joins but for any parallel data processing operator.
TL;DR: A survey of join algorithms with provable worst-case optimality runtime guarantees can be found in this paper, where the authors provide a simpler and unified description of these algorithms that they hope is useful for theory-minded readers, algorithm designers, and systems implementors.
Abstract: Evaluating the relational join is one of the central algorithmic and most well-studied problems in database systems. A staggering number of variants have been considered including Block-Nested loop join, Hash-Join, Grace, Sort-merge (see Grafe [17] for a survey, and [4, 7, 24] for discussions of more modern issues). Commercial database engines use finely tuned join heuristics that take into account a wide variety of factors including the selectivity of various predicates, memory, IO, etc. This study of join queries notwithstanding, the textbook description of join processing is suboptimal. This survey describes recent results on join algorithms that have provable worst-case optimality runtime guarantees. We survey recent work and provide a simpler and unified description of these algorithms that we hope is useful for theory-minded readers, algorithm designers, and systems implementors. Much of this progress can be understood by thinking about a simple join evaluation problem that we illustrate with the so-called triangle query, a query that has become increasingly popular in the last decade with the advent of social networks, biological motifs, and graph databases [36, 37]
TL;DR: This paper investigates the problem of incremental joins of multiple ranked data sets when the join condition is a list of arbitrary user-defined predicates on the input tuples and proposes an algorithm that enables querying of ordered data sets by imposing arbitrary userdefined join predicates.
Abstract: This paper investigates the problem of incremental joins of multiple ranked data sets when the join condition is a list of arbitrary user-defined predicates on the input tuples. This problem arises in many important applications dealing with ordered inputs and multiple ranked data sets, and requiring the top solutions. We use multimedia applications as the motivating examples but the problem is equally applicable to traditional database applications involving optimal resource allocation, scheduling, decision making, ranking, etc. We propose an algorithm that enables querying of ordered data sets by imposing arbitrary userdefined join predicates. The basic version of the algorithm does not use any random access but a variation can exploit available indexes for efficient random access based on the join predicates. A special case includes the join scenario considered by Fagin [1] for joins based on identical keys, and in that case, our algorithms perform as efficiently as Fagin’s. Our main contribution, however, is the generalization to join scenarios that were previously unsupported, including cases where random access in the algorithm is not possible due to lack of unique keys. In addition, can support multiple join levels, or nested join hierarchies, which are the norm for modeling multimedia data. We also give -approximation versions of both of the above algorithms. Finally, we give strong optimality results for some of the proposed algorithms, and we study their performance empirically.