TL;DR: This work introduces two classifiers: naive Bayes and a classifier based on class decomposition using K-means clustering and achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability in SQL.
Abstract: The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.
TL;DR: This paper presents a technique called WideTable, which is built by denormalizing the database, and then converting complex queries into simple scans on the underlying (wide) table, to improve the speed of analytical data processing systems.
Abstract: This paper presents a technique called WideTable that aims to improve the speed of analytical data processing systems. A WideTable is built by denormalizing the database, and then converting complex queries into simple scans on the underlying (wide) table. To avoid the pitfalls associated with denormalization, e.g. space overheads, WideTable uses a combination of techniques including dictionary encoding and columnar storage. When denormalizing the data, WideTable uses outer joins to ensure that queries on tables in the schema graph, which are now nested as embedded tables in the WideTable, are processed correctly. Then, using a packed code scan technique, even complex queries on the original database can be answered by using simple scans on the WideTable(s). We experimentally evaluate our methods in a main memory setting using the queries in TPC-H, and demonstrate the effectiveness of our methods, both in terms of raw query performance and scalability when running on many-core machines.
TL;DR: In this article, a system is proposed to enable a database administrator to selectively denormalize a database transparently to users and programmers by keeping a record of the mapping between the denormalized fields and the base fields from which they are derived.
Abstract: A system may be used to enable a database administrator to selectively denormalize a database transparently to users and programmers. The system keeps a record of the mapping between the denormalized fields and the base fields from which they are derived. Processors access those recorded links to keep the database self-consistent and to retrieve data from denormalized fields whenever possible.
TL;DR: This article presents a summary of the experience and recommendations to compute data set preprocessing and transformation inside a database system, which is the most time-consuming task in data mining projects, and identifies advantages and disadvantages from a practical standpoint based on data mining users feedback.
Abstract: In general, there is a significant amount of data mining analysis performed outside a database system, which creates many data management issues This article presents a summary of our experience and recommendations to compute data set preprocessing and transformation inside a database system (ie data cleaning, record selection, summarization, denormalization, variable creation, coding), which is the most time-consuming task in data mining projects This aspect is largely ignored in the literature We present practical issues, common solutions and lessons learned when preparing and transforming data sets with the SQL language, based on experience from real-life projects We then provide specific guidelines to translate programs written in a traditional programming language into SQL statements Based on successful real-life projects, we present time performance comparisons between SQL code running inside the database system and external data mining programs We highlight which steps in data mining projects become faster when processed by the database system More importantly, we identify advantages and disadvantages from a practical standpoint based on data mining users feedback
TL;DR: In this article, a method for denormalizing a floating point result is presented, which uses the same pipeline resources by means of the floating point unit feedback path and uses one of the exponent equalizing alignment shifters and an incrementor to round the denormalized result.
Abstract: A system and method for denormalizing a floating point result is disclosed. Denormalized operands are capable of representing much smaller values than can be represented by a number normalized under the ANSI/IEEE standard 754-1985 that governs the representation of numbers in floating point notation to ensure uniformity among floating point notation users. The majority of results will be normalized operands and therefore the floating point unit pipeline is optimized to produce normalized results but contains wider exponent fields in order to represent values received as denormalized numbers. In order to return the result as a denormalized number with the smaller ANSI/IEEE exponent field, denormalization is accomplished by using the same pipeline resources by means of the floating point unit feedback path and uses one of the exponent equalizing alignment shifters and an incrementor in order to round the denormalized result. In this way, denormalized results can be provided without stopping the dispatching of instructions, without providing status bits in the register files and rename registers and without the hold signals often present in other floating point units to accomplish denormalization.