TL;DR: An approach to visual analysis of an especially challenging set of problems that exhibit a complex internal data structure and supports iterative visual analysis by providing means to create complex, composite brushes that span multiple views and that are constructed using different combination schemes.
Abstract: The analysis and exploration of multidimensional and multivariate data is still one of the most challenging areas in the field of visualization. In this paper, we describe an approach to visual analysis of an especially challenging set of problems that exhibit a complex internal data structure. We describe the interactive visual exploration and analysis of data that includes several (usually large) families of function graphs fi(x, t). We describe analysis procedures and practical aspects of the interactive visual analysis specific to this type of data (with emphasis on the function graph characteristic of the data). We adopted the well-proven approach of multiple, linked views with advanced interactive brushing to assess the data. Standard views such as histograms, scatterplots, and parallel coordinates are used to jointly visualize data. We support iterative visual analysis by providing means to create complex, composite brushes that span multiple views and that are constructed using different combination schemes. We demonstrate that engineering applications represent a challenging but very applicable area for visual analytics. As a case study, we describe the optimization of a fuel injection system in diesel engines of passenger cars
TL;DR: The paper reports on an implementation of the new aggregation operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.
Abstract: Business Intelligence solutions, encompassing technologies such as multi-dimensional data modeling and aggregate query processing, are being applied increasingly to non-traditional data. This paper extends multi-dimensional aggregation to apply to data with associated interval values that capture when the data hold. In temporal databases, intervals typically capture the states of reality that the data apply to, or capture when the data are, or were, part of the current database state.
This paper proposes a new aggregation operator that addresses several challenges posed by interval data. First, the intervals to be associated with the result tuples may not be known in advance, but depend on the actual data. Such unknown intervals are accommodated by allowing result groups that are specified only partially. Second, the operator contends with the case where an interval associated with data expresses that the data holds for each point in the interval, as well as the case where the data holds only for the entire interval, but must be adjusted to apply to sub-intervals. The paper reports on an implementation of the new operator and on an empirical study that indicates that the operator scales to large data sets and is competitive with respect to other temporal aggregation algorithms.
TL;DR: This paper utilizes two popularly employed indexing techniques, bitmaps and quantization, to correctly and efficiently answer queries in the presence of missing data.
Abstract: Incomplete databases, that is, databases that are missing data, are present in many research domains. It is important to derive techniques to access these databases efficiently. We first show that known indexing techniques for multi-dimensional data search break down in terms of performance when indexed attributes contain missing data. This paper utilizes two popularly employed indexing techniques, bitmaps and quantization, to correctly and efficiently answer queries in the presence of missing data. Query execution and interval evaluation are formalized for the indexing structures based on whether missing data is considered to be a query match or not. The performance of Bitmap indexes and quantization based indexes is evaluated and compared over a variety of analysis parameters for real and synthetic data sets. Insights into the conditions for which to use each technique are provided.
TL;DR: This work applies the concept of multidimensional independent component analysis to the frequency-domain blind source separation problem and proposes a new algorithm that separates independent groups of dependent source components.
Abstract: We tackle the frequency-domain blind source separation problem in a way to avoid permutation correction. By exploiting the facts that the frequency components of a signal have some dependency and that the mixing of sources is restricted to each frequency bin, we apply the concept of multidimensional independent component analysis to the problem and propose a new algorithm that separates independent groups of dependent source components. We introduce general entropic contrast functions for this analysis and a corresponding likelihood function with a multidimensional prior that models the dependent frequency components. We assume circularity for the complex variables and derive a fast algorithm by applying Newton’s method learning rule. The algorithm separates mixed sources even in very challenging acoustic settings.
TL;DR: The problem of grouping and aggregation in the concept-oriented data model, which is based on ordering its elements within a hierarchical multidimensional space, is described and two operations of projection and de-projection are defined.
Abstract: In the paper we describe the problem of grouping and aggregation in the concept-oriented data model. The model is based on ordering its elements within a hierarchical multidimensional space. This order is then used to define all its main properties and mechanisms. In particular, it is assumed that elements positioned higher are interpreted as groups for their lower level elements. Two operations of projection and de-projection are defined for one-dimensional and multidimensional cases. It is demonstrated how these operations can be used for multidimensional analysis.
TL;DR: In this article, multiple 2D graphs are combined into a single 2D graph, where correlations between dimensions are maintained by superimposed on each other and displayed while maintaining the 2D format.
Abstract: Multiple two dimensional graphs are combined into a single two dimensional graph wherein correlations between dimensions are maintained. Dimensions in the graph can be superimposed on each other and displayed while maintaining the two dimensional format. This enables the two dimensional graph to be converted to three or more dimensions while maintaining the format and presentation in a two dimensional visual display.
TL;DR: The paper presents dynamic self-organizing neural networks with one-dimensional neighbourhood that can be efficiently applied to complex, multidimensional cluster-analysis problems.
Abstract: The paper presents dynamic self-organizing neural networks with one-dimensional neighbourhood that can be efficiently applied to complex, multidimensional cluster-analysis problems. The proposed networks in the course of learning are able to disconnect their neuron chains into sub-chains, to reconnect some of the sub-chains again, and to dynamically adjust the overall number of neurons in the system; all of that to fit in the best way the structures encoded in data sets. The operation of the proposed technique has been illustrated by means of three synthetic data sets, and then. this technique has been tested with the use of two real-life, complex and multidimensional data sets (Optical Recognition of Handwritten Digits Database and Image Segmentation Database of Statlog Databases) available from the ftp-server of the University of California at Irvine (ftp.ics.uci.edu).
TL;DR: Analytical and experimental results demonstrate that a branch-and-bound method is highly effective in practice, outperforming alternative approaches by a significant factor.
Abstract: Given a set of N multi-dimensional points, we study the computation of φ-quantiles according to a ranking function F, which is provided by the user at runtime. Specifically, F computes a score based on the coordinates of each point; our objective is to report the object whose score is the φN-th smallest in the dataset. φ-quantiles provide a succinct summary about the F-distribution of the underlying data, which is useful for online decision support, data mining, selectivity estimation, query optimization, etc. Assuming that the dataset is indexed by a spatial access method, we propose several algorithms for retrieving a quantile efficiently. Analytical and experimental results demonstrate that a branch-and-bound method is highly effective in practice, outperforming alternative approaches by a significant factor.
TL;DR: In this article, a self-organizing neural network with one-dimensional neighborhood is proposed for complex multidimensional cluster-analysis problems, which can be efficiently applied to complex multi-dimensional data sets.
Abstract: The paper presents dynamic self-organizing neural networks with one-dimensional neighbourhood that can be efficiently applied to complex, multidimensional cluster-analysis problems. The proposed networks in the course of learning are able to disconnect their neuron chains into sub-chains, to reconnect some of the sub-chains again, and to dynamically adjust the overall number of neurons in the system; all of that – to fit in the best way the structures “encoded” in data sets. The operation of the proposed technique has been illustrated by means of three synthetic data sets, and then, this technique has been tested with the use of two real-life, complex and multidimensional data sets (Optical Recognition of Handwritten Digits Database and Image Segmentation Database of Statlog Databases) available from the ftp-server of the University of California at Irvine (ftp.ics.uci.edu).
TL;DR: A conceptual evolution model based on bitemporal versioning of multidimensional schemas is introduced, which allows one to modify the DW schema in an implementation-independent manner, and without affecting the operation of existing applications.
Abstract: A data warehouse (DW) is a vast collection of historical data built to support multidimensional data analysis applications. In this context, an important problem is that of evolving the implementation (multidimensional, relational) schema of a DW to incorporate new requirements. This paper introduces a conceptual evolution model based on bitemporal versioning of multidimensional schemas, which allows one to modify the DW schema (a) in an implementation-independent manner, and (b) without affecting the operation of existing applications. It also presents a SQL-like language associated to this model, which offers expressions to create and change versions of multidimensional schemas
TL;DR: Methods and algorithms for a new EIS architecture that takes advantage of a metadatabase to achieve adaptability and flexibility are developed.
Abstract: In tandem with the growth of the Internet and e-business, the number of digital data sources has increased immensely. These data sources contain important transactional data and are generally interconnected via a network. This has created a pressing need for a suitable executive information system (EIS) that is capable of extracting data from internal and external data sources and providing data analysis on demand for business executives. On-demand data analysis requires an information integration approach that can manage rapid changes in data sources. Existing EISs commonly adopt data warehousing technology to consolidate data from multiple sources in a tailor-made fashion, and support predefined multidimensional data analysis. However, this architecture is neither adaptable to changes in local sources nor flexible enough for ad hoc analyses. This paper develops methods and algorithms for a new EIS architecture that takes advantage of a metadatabase to achieve adaptability and flexibility. A PC-based prototype is built to prove the concept.
TL;DR: This paper introduces a multi data warehouse OLAP language for integrating, combining, and analysing data from several, independent data warehouses (DWHs) and may be best compared to multi-database query languages for database integration.
Abstract: Business cooperations frequently require to analyse data across enterprises, where there is no central authority to combine and manage cross-enterprise data Thus, rather than integrating independent data warehouses into a Distributed Data Warehouse (DDWH) for cross-enterprise analyses, this paper introduces a multi data warehouse OLAP language for integrating, combining, and analysing data from several, independent data warehouses (DWHs) The approach may be best compared to multi-database query languages for database integration The key difference to these prior works is that they do not consider the multi-dimensional organisation of data warehouses
TL;DR: The kernel principal components analysis-linear programming techniques for multidimensional analysis of preference (KPCA-LINMAP) evaluation model is presented, which is fitter for this situation than the traditional PCA.
Abstract: According to the limitation of Principal Components Analysis (PCA) in dealing with the nonlinear data, connecting with the Linear Programming Techniques for Multidimensional Analysis of Preference (LINMAP), this paper presents the Kernel Principal Components Analysis-Linear Programming Techniques for Multidimensional Analysis of Preference (KPCA-LINMAP) evaluation model. In addition, the weight of each index can be obtained in this model, thus it makes up another shortage of PCA. In reverse logistics evaluation, the indexes are numerous and the degree of correlation is not high, the model is fitter for this situation than the traditional PCA. At last, the validity and the advantage of this method are verified by an instance.
TL;DR: In this paper, a multi-dimensional OLAP language for integrating, combining, and analysing data from several independent data warehouses (DWHs) is proposed. But the approach may be best compared to multi-database query languages for database integration.
Abstract: Business cooperations frequently require to analyse data across enterprises, where there is no central authority to combine and manage cross-enterprise data. Thus, rather than integrating independent data warehouses into a Distributed Data Warehouse (DDWH) for cross-enterprise analyses, this paper introduces a multi data warehouse OLAP language for integrating, combining, and analysing data from several, independent data warehouses (DWHs). The approach may be best compared to multi-database query languages for database integration. The key difference to these prior works is that they do not consider the multi-dimensional organisation of data warehouses. The major problems addressed and solutions provided are: (1) a classification of DWH schema and instance heterogeneities at the fact and dimension level, (2) a methodology to combine independent data cubes taking into account the special characteristics of conceptual DWH schemata, i.e., OLAP dimension hierarchies and facts, and (3) a novel query language for bridging these heterogeneities in cross-DWH OLAP queries.
TL;DR: The proposed model is used to develop multi-dimensional algebra to facilitate operation in data warehouse and is able to discover the kernel knowledge from the current formulated knowledge.
TL;DR: The cascaded OLAP (COLAP) operations are proposed that enable ad-hoc specification of queries that encompass multiple stars and it is shown that COLAP can be represented by the relational algebra to demonstrate that the cascaded star can be built on top of the traditional star schema framework.
Abstract: A data warehouse is defined as subject-oriented, integrated, time-variant and nonvolatile collection of data. Often, the data representing different subjects is multi-dimensional in nature, where each dimension of each subject could again be multi-dimensional. We refer to this as hyper-dimensional nature of data. Traditional multi-dimensional data models (e.g., the star schema) cannot adequately model these data. This is because, a star schema models one single multi-dimensional subject, hence a complex query crossing different subjects at different dimensional levels has to be specified as multiple queries and the results of each query must be composed together manually. In this paper, we present a novel data model, called the cascaded star model, to model hyper-dimensional data, and propose the cascaded OLAP (COLAP) operations that enable ad-hoc specification of queries that encompass multiple stars. Specifically, our COALP operations include cascaded-roll-up, cascaded-drill-down, cascaded-slice, cascaded-dice and MCUBE. We show that COLAP can be represented by the relational algebra to demonstrate that the cascaded star can be built on top of the traditional star schema framework.
TL;DR: In this paper, the authors propose a novel data model, called the cascaded star model, to model hyper-dimensional data, and propose cascaded OLAP (COLAP) operations that enable ad-hoc specification of queries that encompass multiple stars.
Abstract: A data warehouse is defined as subject-oriented, integrated, time-variant and nonvolatile collection of data. Often, the data representing different subjects is multi-dimensional in nature, where each dimension of each subject could again be multi-dimensional. We refer to this as hyper-dimensional nature of data. Traditional multi-dimensional data models (e.g., the star schema) cannot adequately model these data. This is because, a star schema models one single multi-dimensional subject, hence a complex query crossing different subjects at different dimensional levels has to be specified as multiple queries and the results of each query must be composed together manually. In this paper, we present a novel data model, called the cascaded star model, to model hyper-dimensional data, and propose the cascaded OLAP (COLAP) operations that enable ad-hoc specification of queries that encompass multiple stars. Specifically, our COALP operations include cascaded-roll-up, cascaded-drill-down, cascaded-slice, cascaded-dice and MCUBE. We show that COLAP can be represented by the relational algebra to demonstrate that the cascaded star can be built on top of the traditional star schema framework.
TL;DR: Clustering and Discrimination, Genetic Algorithms-based Approaches for Clustering Time Series, and Multivariate Methods in Applied Science.
Abstract: Clustering and Discrimination.- Genetic Algorithms-based Approaches for Clustering Time Series.- On the Choice of the Kernel Function in Kernel Discriminant Analysis Using Information Complexity.- Growing Clustering Algorithms in Market Segmentation: Defining Target Groups and Related Marketing Communication.- Graphical Representation of Functional Clusters and MDS Configurations.- Estimation of the Structural Mean of a Sample of Curves by Dynamic Time Warping.- Sequential Decisional Discriminant Analysis.- Regularized Sliced Inverse Regression with Applications in Classification.- Multidimensional Data Analysis and Multivariate Statistics.- Approaches to Asymmetric Multidimensional Scaling with External Information.- Variable Architecture Bayesian Neural Networks: Model Selection Based on EMC.- Missing Data in Optimal Scaling.- Simple Component Analysis Based on RV Coefficient.- Baum-Eagon Inequality in Probabilistic Labeling Problems.- Monotone Constrained EM Algorithms for Multinormal Mixture Models.- Visualizing Dependence of Bootstrap Confidence Intervals for Methods Yielding Spatial Configurations.- Automatic Discount Selection for Exponential Family State-Space Models.- A Generalization of the Polychoric Correlation Coefficient.- The Effects of MEP Distributed Random Effects on Variance Component Estimation in Multilevel Models.- Calibration Confidence Regions Using Empirical Likelihood.- Robust Methods and the Forward Search.- Random Start Forward Searches with Envelopes for Detecting Clusters in Multivariate Data.- Robust Transformation of Proportions Using the Forward Search.- The Forward Search Method Applied to Geodetic Transformations.- An R Package for the Forward Analysis of Multivariate Data.- A Forward Search Method for Robust Generalised Procrustes Analysis.- A Projection Method for Robust Estimation and Clustering in Large Data Sets.- Robust Multivariate Calibration.- Data Mining Methods and Software.- Procrustes Techniques for Text Mining.- Building Recommendations from Random Walks on Library OPAC Usage Data.- A Software Tool via Web for the Statistical Data Analysis: R-php.- Evolutionary Algorithms for Classification and Regression Trees.- Variable Selection Using Random Forests.- Boosted Incremental Tree-based Imputation of Missing Data.- Sensitivity of Attributes on the Performance of Attribute-Aware Collaborative Filtering.- Multivariate Methods for Customer Satisfaction and Service Evaluation.- Customer Satisfaction Evaluation: An Approach Based on Simultaneous Diagonalization.- Analyzing Evaluation Data: Modelling and Testing for Homogeneity.- Archetypal Analysis for Data Driven Benchmarking.- Determinants of Secondary School Dropping Out: a Structural Equation Model.- Testing Procedures for Multilevel Models with Administrative Data.- Multidimensional Versus Unidimensional Models for Ability Testing.- Multivariate Methods in Applied Science.- A Spatial Mixed Model for Sectorial Labour Market Data.- The Impact of the New Labour Force Survey on the Employed Classification.- Using CATPCA to Evaluate Market Regulation.- Credit Risk Management Through Robust Generalized Linear Models.- Classification of Financial Returns According to Thresholds Exceedances.- Nonparametric Clustering of Seismic Events.- A Non-Homogeneous Poisson Based Model for Daily Rainfall Data.- A Comparison of Data Mining Methods and Logistic Regression to Determine Factors Associated with Death Following Injury.
TL;DR: This article examines an example of application to cross-national differences in time use in six European countries at different stages of life using the cross-sectional analysis using the explorative multidimensional data analysis.
Abstract: Important time-budget methodological issues are concerned with analysing time use tables, obtainable from time-budget diaries to face the multipurpose nature, the size and the complexity of time-budget data. After a brief introduction to the main time use analysis the paper focuses on the cross-sectional analysis using the explorative multidimensional data analysis. The paper deals with the multiway methods suitable for comparing statistical studies (i.e. countries) when each of them has many variables (i.e. activities) observed on many cases (i.e. categories of population). This article examines an example of application to cross-national differences in time use in six European countries at different stages of life. The results are exemplary of the applicational steps and statistical aspects of the methods proposed rather than definitive findings.
TL;DR: This paper focuses on the analysis of the quantity and type of interaction and cooperation between students in the asynchronous discussion forum of a virtual classroom and proposes to adopt the theoretical framework of Multidimensional Analysis of Textual Data in connection with the theoretical frameworks of Social Network Analysis.
Abstract: The quality of the cooperation and collaboration between members is one of the crucial factors in the development of an online learning community.
In this paper we focus on the analysis of the quantity and type of interaction and cooperation between students in the asynchronous discussion forum of a virtual classroom. In order to describe both the qualitative and quantitative measures of the interrelationships in the net structure we propose to adopt
the theoretical framework of Multidimensional Analysis of Textual Data in connection with the theoretical framework of Social Network Analysis. The tools made available by Correspondence Analysis of the lexical table are used to derive a semantic reference space in which to locate the nodes and
arcs of the communication network. The underlying interrelation structure and the evolution of the conversational themes are shown by visualizing the students that share the same vocabulary and patterns of frequent lemmas
used in the forum. The role of each student in the communication process is highlighted by suitable statistical indicators defi ned in the framework of Social Network Analysis.
TL;DR: A conceptual data description approach to multidimensional data model—the UML galaxy diagram is presented in order to conduct multiddimensional data analysis for multiple subjects in e-commerce.
TL;DR: A data-fitting based approach is described in order to analyse the relationships between an outer quantitative response variable and a set of principal axes and to represent by a surface response the information about the quality of the representation derived by a multidimensional data analysis.
Abstract: In this paper we propose to bring together some facets of the Response Surface Methodology used in Design of Experiments and the graphical displays arising from Multidimensional Data Analysis in order to enhance the interpretation of the traditional plots of the Factorial Techniques. By exploiting the peculiar interpretation of a response surface we describe a data-fitting based approach in order to analyse the relationships between an outer quantitative response variable and a set of principal axes. Moreover, we to use this interpretative-aid tool to represent by a surface response the information about the quality of the representation derived by a multidimensional data analysis.
TL;DR: The aim in the case studies was to use the same neuro-fuzzy classifier for clustering, generating understandable rules also for clusters, using Monte-Carlo random data to the original data and performing the clustering task with the present classifier in the medical, chemical, and biological domain.
Abstract: Classification, clustering and rule generation are important tasks in multidimensional data analysis. The combination of clustering or classification with rule generation gives an explanation for the achieved results. Especially in life science applications experts are interested in explanations to understand the underlying data. The usage of supervised neuro-fuzzy systems is a suitable approach for this combined task. Not always classification labels are available for the data when considering new problem areas in life science. Since we had already used a supervised neuro-fuzzy system for some applications, our aim in the case studies was to use the same neuro-fuzzy classifier for clustering, generating understandable rules also for clusters. To do so, we added Monte-Carlo random data to the original data and performed the clustering task with the present classifier in the medical, chemical, and biological domain.
TL;DR: In e-commerce the multidimensional data analysis for OLAP (on-line analytical processing) based on the web data needs integrating various data sources such as XML (extensible markup language) data and relational data on the conceptual level.
Abstract: In e-commerce the multidimensional data analysis for OLAP (on-line analytical processing) based on the web data needs integrating various data sources such as XML (extensible markup language) data and relational data on the conceptual level. A conceptual data description approach of multidimensional data model was presented in order to conduct multidimensional data analysis of OLAP for multiple subjects. The UML (unified modeling language) galaxy diagram, describing the multidimensional structure of the conceptual integrating data at the conceptual level, was constructed. The approach was illuminated using a case of 2__roots UML galaxy diagram that takes one retailer and several suppliers of PC products into consideration.
TL;DR: A1 Address Information A1 Name of College/University: A1 Mailing Address: A2 City/State/Zip/Country: A3 Street Address (if different): A1 City/ state/Zip-country: A4 Main Phone Number: A 1 WWW Home Page Address:
Abstract: A1 Address Information A1 Name of College/University: A1 Mailing Address: A1 City/State/Zip/Country: A1 Street Address (if different): A1 City/State/Zip/Country: A1 Main Phone Number: A1 WWW Home Page Address: A1 Admissions Phone Number: A1 Admissions Toll-Free Phone Number: A1 Admissions Office Mailing Address: A1 City/State/Zip/Country: A1 Admissions Fax Number: A1 Admissions E-mail Address: A1 If there is a separate URL for your school’s online application, please specify: ______________ A1 If you have a mailing address other than the above to which applications should be sent, please provide:
TL;DR: In this paper, the authors propose a dynamic zoning technique called DMD-Zoning that can be applied to a variety of multidimensional access methods and that can fully utilize the zoning characteristics of hard disks for busy multi-user database systems.
Abstract: In emerging database applications that deal with large sets of multi-dimensional data, the performance of the query system significantly depends on the performance of its access methods and the underlying disk system. In recent years, hard disks are manufactured with multiple physical zones, where seek times and data transfer rates vary significantly across the zones. However, there is a marked lack of investigation on how to optimize multidimensional access methods given a zoned disk model. The paper proposes a novel dynamic zoning technique called DMD-Zoning that can be applied to a variety of multidimensional access methods and that can fully utilize zoning characteristics of hard disks for busy multi-user database systems.
TL;DR: In this paper, the authors proposed a method that can overcome the insufficiencies of single self-organization feature map (SOFM) network, which can process noisy and multi-dimensional data well in multi-resolutions.
Abstract: This paper analyses some important characteristics of self-organization map network. Based on this analysis, we propose a method that can overcome the insufficiencies of single self-organization feature map (SOFM) network. The implementation detail of our proposed self-organizing feature map network algorithm is also discussed. Our proposed algorithm has a number of advantages. It can overcome the insufficiencies identified in other similar clustering algorithms. It is able to find clusters in different shapes and is insensitive to input data sequence. It can process noisy and multi-dimensional data well in multi-resolutions. Furthermore the proposed clustering method can find the dense or sparse areas with different data distributions. It will be convenient to discover the distribution mode and interesting relationship among data. We have conducted numerous experiments in order to justify this novel ideal of spatial data clustering. It has been shown that the proposed method can be applied to spatial clustering well.
TL;DR: The main goal was to test the efficiency of the method in the context of real-life medical data, and to present a hybrid method of multidimensional data analysis.
Abstract: Visualization techniques are especially relevant to multidimensional data, the analysis of which is limited by human perception abilities. The paper presents a hybrid method of multidimensional data analysis. The main goal was to test the efficiency of the method in the context of real-life medical data. A short survey of issues and techniques concerned with data visualization are also included.
TL;DR: This paper proposes a fundamentally new class of measures, compressible measures, in order to support efficient computation of the statistical models, and substantially reduces the memory usage and the overall response time for statistical analysis of multidimensional data.
Abstract: As OLAP engines are widely used to support multidimensional data analysis, it is desirable to support in data cubes advanced statistical measures, such as regression and filtering, in addition to the traditional simple measures such as count and average. Such new measures allow users to model, smooth, and predict the trends and patterns of data. Existing algorithms for simple distributive and algebraic measures are inadequate for efficient computation of statistical measures in a multidimensional space. In this paper, we propose a fundamentally new class of measures, compressible measures, in order to support efficient computation of the statistical models. For compressible measures, we compress each cell into an auxiliary matrix with a size independent of the number of tuples. We can then compute the statistical measures for any data cell from the compressed data of the lower-level cells without accessing the raw data. Time- and space-efficient lossless aggregation formulae are derived for regression and filtering measures. Our analytical and experimental studies show that the resulting system, regression cube, substantially reduces the memory usage and the overall response time for statistical analysis of multidimensional data
TL;DR: This paper proposes the first fast solution to the problem of tracking wavelet representations of one-dimensional and multi-dimensional data streams, based on a novel stream synopsis, the Group-Count Sketch (GCS), which imposes a hierarchical structure of groups over the data and applies the GCS to recover the most important wavelet coefficients with guaranteed accuracy.
Abstract: Recent years have seen growing interest in effective algorithms for summarizing and querying massive, high-speed data streams. Randomized sketch synopses provide accurate approximations for general-purpose summaries of the streaming data distribution (e.g., wavelets). The focus of existing work has typically been on minimizing space requirements of the maintained synopsis — however, to effectively support high-speed data-stream analysis, a crucial practical requirement is to also optimize: (1) the update time for incorporating a streaming data element in the sketch, and (2) the query time for producing an approximate summary (e.g., the top wavelet coefficients) from the sketch. Such time costs must be small enough to cope with rapid stream-arrival rates and the real-time querying requirements of typical streaming applications (e.g., ISP network monitoring). With cheap and plentiful memory, space is often only a secondary concern after query/update time costs.
In this paper, we propose the first fast solution to the problem of tracking wavelet representations of one-dimensional and multi-dimensional data streams, based on a novel stream synopsis, the Group-Count Sketch (GCS). By imposing a hierarchical structure of groups over the data and applying the GCS, our algorithms can quickly recover the most important wavelet coefficients with guaranteed accuracy. A tradeoff between query time and update time is established, by varying the hierarchical structure of groups, allowing the right balance to be found for specific data stream. Experimental analysis confirms this tradeoff, and shows that all our methods significantly outperform previously known methods in terms of both update time and query time, while maintaining a high level of accuracy.