TL;DR: Multidimensional hierarchical clustering (MHC) of OLAP data is introduced as a way to speed up aggregation queries without additional storage cost for materialization and performance measurements on real world data for a typical star schema are presented.
Abstract: Data warehousing applications cope with enormous data sets in the range of Gigabytes and Terabytes. Queries usually either select a very small set of this data or perform aggregations on a fairly large data set. Materialized views storing pre-computed aggregates are used to efficiently process queries with aggregations. This approach increases resource requirements in disk space and slows down updates because of the view maintenance problem. Multidimensional hierarchical clustering (MHC) of OLAP data overcomes these problems while offering more flexibility for aggregation paths. Clustering is introduced as a way to speed up aggregation queries without additional storage cost for materialization. Performance and storage cost of our access method are investigated and compared to current query processing scenarios. In addition performance measurements on real world data for a typical star schema are presented.
TL;DR: The difficult index selection problem [GHRU97] largely disappears and the UB-Tree offers the potential to integrate OLAP with OLTP in the same processing environment.
Abstract: We investigate the usability and performance of the UB-Tree (universal B-Tree) for multidimensional data, as they arise in all relational databases and in particular in data-warehousing and data-mining applications. The UB-Tree is balanced and has all the guaranteed performance characteristics of B-Trees, i.e., it requires linear space for storage and logarithmic time for the basic operations of insertion, retrieval and deletion. Therefore it can efficiently support OLTP. In addition the UB-Tree preserves clustering of objects with respect to Cartesian distance. Therefore, it shows its main strengths for multidimensional data. It has very high potential for parallel processing. A single UB-Tree can replace a large number of secondary indexes and join indexes including foreign column join indexes (FCJ). For updates this means that only one UB-Tree must be managed instead of several secondary indexes. This reduces runtime and storage requirements substantially. For retrieval the UB-Tree has multiplicative complexity with respect to the relative size of the ranges for range queries, resulting in a dramatic performance improvement over multiple secondary indexes which have additive range query complexity. Furthermore, using the Tetris-Algorithm the UB-Tree enables reading data in any arbitrary sort order without the necessity of external sorting. Thus data need to be read only once to perform most of the operations of the relational algebra, such as ordering, grouping, aggregation, projection and joining. Therefore, the UB-Tree can support OLAP very efficiently. It is useful for geometric databases, data-warehousing and data-mining applications, but even more for databases in general, where multiple secondary indexes on one relation or FCJ-indexes to join several relations are widespread, which can all be replaced by a single UB-Tree index. Therefore, the difficult index selection problem [GHRU97] largely disappears and the UB-Tree offers the potential to integrate OLAP with OLTP in the same processing environment.
TL;DR: This work proposes two techniques for bulk loading large data sets for the UB-Tree, a multidimensional index structure, which try to minimize I/O and CPU cost and are easily integrated into a RDBMS.
Abstract: We consider the issue of bulk loading large data sets for the UB-Tree, a multidimensional index structure. Especially in data warehousing (DW), data mining and OLAP it is necessary to have efficient bulk loading techniques, because loading occurs not continuously, but only from time to time with usually large data sets. We propose two techniques, one for initial loading, which creates a new UB-Tree, and one for incremental loading, which adds data to an existing UB-Tree. Both techniques try to minimize I/O and CPU cost. Measurements with artificial data and data of a commercial data warehouse demonstrate that our algorithms are efficient and able to handle large data sets. As well as the UB-Tree, they are easily integrated into a RDBMS.
TL;DR: The evaluation results prove that the vertical splitting scheme based on the history-offset encoding can reduce retrieval I/O cost, while expanding the required logical address space to store large scale multidimensional datasets.
Abstract: History-offset encoding we are proposing is a scheme for encoding multidimensional datasets. In general, significant problems in implementing multidimensional databases include the saturation of address space for addressing multidimensional data. One of the solutions against this problem is splitting the dimension attributes of the multidimensional data into more than one group; i.e., vertical splitting. We have implemented the vertical splitting scheme for large scale multidimensional datasets based on the history-offset encoding. In this paper, we describe implementation of the constructed prototype system and experimentally evaluate and compare the system with other systems. These systems include PostgreSQL, which is a relational DBMS conventionally implemented, and UB tree, which is organized in a similar kind of multidimensional approach with our history-offset encoding. The evaluation results prove that our vertical splitting scheme can reduce retrieval I/O cost, while expanding the required logical address space to store large scale multidimensional datasets. Our method far outperforms PostgreSQL and is fairly better than UB tree in retrieval time. The splitting causes increase of storage cost but the cost is not so large compared with those of them.
TL;DR: The UB-Tree comes closer to being an universal index than any other competing index structure, and is flexible, dynamic, relatively easy to integrate into a DBMS kernel, and provides logarithmic worst case guarantees for the basic operations of insertion, deletion, and update.
Abstract: The UB-Tree is an index structure for multidimensional point data. By name, it claims to be universal, but this imposes a huge burden, as there are few things which really prove to be universal. This thesis takes a closer look at aspects where the UB-Tree is not universal at a first glance. The first aspect is the discussion of space filling curves (SFCs), in particular comparing the Z-curve and the Hilbert-curve. The Z-curve is used to cluster data indexed by the UB-Tree and we highlight its advantages in comparison to other SFCs. While the Hilbert-curve provides better clustering, the Z-curve is superior w.r. to other metrics, i.e. it is significantly more efficient to calculate addresses and the mapping of queries to SFC-segments, and it is able to space efficiently index arbitrary universes. Thus the Z-curve is more universal here. The second aspect are bulk operations on UB-Trees. Especially for data warehousing the bulk insertion and deletion are crucial operations. We present efficient algorithms for incremental insertion and deletion. The third aspect is the comparison of the UB-Tree with bitmap indexes used for an example data warehousing application. We show how performance of bitmap indexes is increased by clustering the base table according to a SFC. Still the UB-Tree proves to be superior. The fourth aspect is the efficient management of data with skewed data distributions. The UB-Tree adapts its partitioning to the actual data distribution, but in comparison to the R-Tree, it suffers from being not able to prune search path leading to unpopulated space (= dead space). This is caused by partitioning the complete universe with separators. We present a novel index structure, the bounding UB-Tree (BUB-Tree), which is a variant of the UB-Tree inheriting its worst case guarantees for the basic operations while efficiently addressing queries on dead space. In comparison to R-Trees, its query performance is similar while offering superior maintenance performance and logarithmic worst case guarantees, thus being more universal than the R*-Tree. The last aspect addressed in this thesis is the management of spatial data. The UB-Tree is an index designed for point data, however also spatial objects can be indexed efficiently with it by mapping them to higher dimensional points. We discuss different mapping methods and their performance in comparison to the RI-Tree and R*-Tree. Our conclusion: The UB-Tree comes closer to being an universal index than any other competing index structure. It is flexible, dynamic, relatively easy to integrate into a DBMS kernel, and provides logarithmic worst case guarantees for the basic operations of insertion, deletion, and update. By extending its concepts to the BUB-Tree it is able to efficiently support skewed queries on skewed data distributions.