TL;DR: A new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers, using the ratio of training and test data densities as an outlier score is proposed.
Abstract: We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.
TL;DR: The findings show that: (i) multidimensional hubs do exist and their characterization yields interesting insights and (ii) it is possible to detect the most influential dimensions that cause the different hub behaviors.
TL;DR: A new cluster-adaptive distance bound based on separating hyperplane boundaries of Voronoi clusters to complement the cluster based index is proposed, which enables efficient spatial filtering, with a relatively small preprocessing storage overhead and is applicable to euclidean and Mahalanobis similarity measures.
Abstract: We consider approaches for similarity search in correlated, high-dimensional data sets, which are derived within a clustering framework. We note that indexing by “vector approximation” (VA-File), which was proposed as a technique to combat the “Curse of Dimensionality,” employs scalar quantization, and hence necessarily ignores dependencies across dimensions, which represents a source of suboptimality. Clustering, on the other hand, exploits interdimensional correlations and is thus a more compact representation of the data set. However, existing methods to prune irrelevant clusters are based on bounding hyperspheres and/or bounding rectangles, whose lack of tightness compromises their efficiency in exact nearest neighbor search. We propose a new cluster-adaptive distance bound based on separating hyperplane boundaries of Voronoi clusters to complement our cluster based index. This bound enables efficient spatial filtering, with a relatively small preprocessing storage overhead and is applicable to euclidean and Mahalanobis similarity measures. Experiments in exact nearest-neighbor set retrieval, conducted on real data sets, show that our indexing method is scalable with data set size and data dimensionality and outperforms several recently proposed indexes. Relative to the VA-File, over a wide range of quantization resolutions, it is able to reduce random IO accesses, given (roughly) the same amount of sequential IO operations, by factors reaching 100X and more.
TL;DR: Exploring the foundations and recent breakthroughs in the field, Statistical Learning and Data Science demonstrates how data analysis can improve personal and collective health and the well-being of the authors' social, business, and physical environments.
Abstract: Data analysis is changing fast. Driven by a vast range of application domains and affordable tools, machine learning has become mainstream. Unsupervised data analysis, including cluster analysis, factor analysis, and low dimensionality mapping methods continually being updated, have reached new heights of achievement in the incredibly rich data world that we inhabit. Statistical Learning and Data Science is a work of reference in the rapidly evolving context of converging methodologies. It gathers contributions from some of the foundational thinkers in the different fields of data analysis to the major theoretical results in the domain. On the methodological front, the volume includes conformal prediction and frameworks for assessing confidence in outputs, together with attendant risk. It illustrates a wide range of applications, including semantics, credit risk, energy production, genomics, and ecology. The book also addresses issues of origin and evolutions in the unsupervised data analysis arena, and presents some approaches for time series, symbolic data, and functional data. Over the history of multidimensional data analysis, more and more complex data have become available for processing. Supervised machine learning, semi-supervised analysis approaches, and unsupervised data analysis, provide great capability for addressing the digital data deluge. Exploring the foundations and recent breakthroughs in the field, Statistical Learning and Data Science demonstrates how data analysis can improve personal and collective health and the well-being of our social, business, and physical environments.
TL;DR: In this article, the authors address the challenge of imputing the high-dimensional jointly distributed values of marker attributes based on overlapping marginal observations and introduce an alternative approach based on nearest neighbor imputation restricted to a cell's subpopulation.
TL;DR: This work provides an introduction to the design of video quality metrics by using data analysis methods, which are different from traditional approaches, and uses multidimensional data analysis, an extension of well-established data analysis techniques, to better exploit higher-dimensional data.
Abstract: Quality of experience (QoE) is becoming increasingly important in signal processing applications. In taking inspiration from chemometrics, we provide an introduction to the design of video quality metrics by using data analysis methods, which are different from traditional approaches. These methods do not necessitate a complete understanding of the human visual system (HVS). We use multidimensional data analysis, an extension of well-established data analysis techniques, allowing us to better exploit higher-dimensional data. In the case of video quality metrics, it enables us to exploit the temporal properties of video more properly; the complete three-dimensional structure of the video cube is taken into account in metrics' design. Starting with the well-known principal component analysis and an introduction to the notation of multiway arrays, we then present their multidimensional extensions, delivering better quality prediction results. Although we focus on video quality, the presented design principles can easily be adapted to other modalities and to even higher dimensional data sets as well.
TL;DR: This paper presents a theoretical framework called DWEVOLVE to support data warehouse evolution, which enhances the functionality of previously designed framework by taking into account the requirements specified by the users.
Abstract: Data warehouse integrate information from numerous data sources under a unified schema and format to provide effective results from multidimensional data analysis in order to facilitate reporting and trend analysis. These information sources are dynamic in nature and keep on changing owing to the autonomous nature of transactions being carried out in the organization along with the complexity involved in gathering requirements from the users. Requirements elicitation and collection is difficult to perform because user needs keep on changing. As a consequence, the data warehouse must evolve so that it improves the data quality by easily incorporating the changes in requirements as well as source schema. In this paper we present a theoretical framework called DWEVOLVE to support data warehouse evolution. The proposed framework enhances the functionality of previously designed framework by taking into account the requirements specified by the users. Provisions have also been made to define and generate customized reports according to the user needs.
TL;DR: In this paper, the authors present a framework for the creation, editing, manipulation and use of model-based, multidimensional analysis services (MAS) cubes, where a user can create a new MAS cube by targeting a set of facts and adding dimensions to the facts.
Abstract: Systems, methods and computer program products that provide a framework for the creation, editing, manipulation and use of model-based, multidimensional analysis services (MAS) cubes are disclosed. A method of generating a model-based MAS cube comprises creating a data source comprising a data warehouse in the memory via the processor, creating a data source view providing a dimension, a fact and an outrigger from the created data source, and creating the MAS cube comprising at least one measure group. A key performance indicator (KPI) may be calculated from the MAS cube as a scorecard of a display associated with the processor. A user of the model-based MAS cube may create a new cube by targeting a set of facts and adding dimensions to the facts.
TL;DR: To integrate and make full use of medical data effectively, a data warehouse modeling method is proposed for the hospital information system and can also be employed for a distributed-hospital medical service system.
Abstract: Background Existing hospital information systems with simple statistical functions cannot meet current management needs. It is well known that hospital resources are distributed with private property rights among hospitals, such as in the case of the regional coordination of medical services. In this study, to integrate and make full use of medical data effectively, we propose a data warehouse modeling method for the hospital information system. The method can also be employed for a distributed-hospital medical service system. Methods To ensure that hospital information supports the diverse needs of health care, the framework of the hospital information system has three layers: datacenter layer, system-function layer, and user-interface layer. This paper discusses the role of a data warehouse management system in handling hospital information from the establishment of the data theme to the design of a data model to the establishment of a data warehouse. Online analytical processing tools assist user-friendly multidimensional analysis from a number of different angles to extract the required data and information. Results Use of the data warehouse improves online analytical processing and mitigates deficiencies in the decision support system. The hospital information system based on a data warehouse effectively employs statistical analysis and data mining technology to handle massive quantities of historical data, and summarizes from clinical and hospital information for decision making. Conclusions This paper proposes the use of a data warehouse for a hospital information system, specifically a data warehouse for the theme of hospital information to determine latitude, modeling and so on. The processing of patient information is given as an example that demonstrates the usefulness of this method in the case of hospital information management. Data warehouse technology is an evolving technology, and more and more decision support information extracted by data mining and with decision-making technology is required for further research.
TL;DR: What partial order is needed for and how it is used to solve typical data analysis tasks like logical navigation, multidimensional analysis and reasoning about data are discussed.
Abstract: The concept-oriented data model (COM) is an emerging approach to data modeling which is based on three novel principles: duality, inclusion and order. These three structural principles provide a basis for modeling domain-speciflc identities, object hierarchies and data semantics. In this paper these core principles of COM are presented from the point of view of object data models (ODM). We describe the main data modeling construct, called concept, as well as two relations in which it participates: inclusion and partial order. Concepts generalize conventional classes by extending them with identity class. Inclusion relation generalizes inheritance by making objects elements of a hierarchy. We discuss what partial order is needed for and how it is used to solve typical data analysis tasks like logical navigation, multidimensional analysis and reasoning about data.
TL;DR: This paper proposes a conceptual ETL framework for an object oriented data warehouse design, the framework is called UCLEAN and takes into account the concept of requirements of the users.
Abstract: Data warehouse is used to provide effective results from multidimensional data analysis. The accuracy and correctness of these results depend on the quality of the data. To improve data quality, data must be properly extracted, transformed and loaded into the data warehouse. This ETL process is the key to the success of a data warehouse. In this paper we propose a conceptual ETL framework for an object oriented data warehouse design, the framework is called UCLEAN. This framework takes into account the concept of requirements of the users .The data is extracted from different UML sources and is converted into a multidimensional model. It is then cleaned and loaded in the data warehouse. We validate the effectiveness of the framework through a case study.
TL;DR: Using Multidimensional data analysis technology to statistic and analyze the Environmental Impact Assessment (EIA) basic data of electric industry can get the analysis results of the distribution situation, investment, resource consumption, pollutants, and the environmental impact of power construction projects in different periods and different regions.
Abstract: Multidimensional data analysis can observe and process data from several angles, obtaining useful information for management decision-making departments and providing effective support by turning business data into management data. Based on SQL Server 2008 and adopting Multidimensional data analysis technology to statistic and analyze the Environmental Impact Assessment (EIA) basic data of electric industry, we can get the analysis results of the distribution situation, investment, resource consumption, pollutants, and the environmental impact of power construction projects in different periods and different regions so as to make users analyze the construction and development of electric industry from multiple angles and provide effective scientific decisive data support for environmental management and decision-making.
TL;DR: A fast and computationally simple subspace-based algorithm for mobile positioning with the use of time-of-arrival (TOA) measurements of three base stations (BSs) is derived and analysed.
Abstract: The problem of locating and tracking a mobile station (MS) in which real-time computation is needed has received considerable attentions. In this letter, a fast and computationally simple subspace-based algorithm for mobile positioning with the use of time-of-arrival (TOA) measurements of three base stations (BSs) is derived and analysed. Since the Lagrange multiplier is introduced to avoid eigendecomposition of multidimensional similarity matrix, the proposed algorithm offers very competitive performance at low computational complexity.
TL;DR: This contribution covers both conceptual foundations and applications of such methods for pattern recognition and analysis to a wide scope of radiological data sets, such as structural and functional segmentation in Magnetic Resonance Imaging (MRI), ranging from functional MRI for human brain mapping to the monitoring of disease progression in multiple sclerosis by automatic lesion segmentation.
Abstract: Technical innovations in radiology, such as advanced cross-sectional imaging methods, have opened up new vistas for the exploration of structure and function of the human body enabling both high spatial and temporal resolution. However, these techniques have led to vast amounts of data whose precise and reliable visual analysis by radiologists requires a considerable amount of human intervention and expertise, thus resulting in a cost factor of substantial economic relevance. Hence, the computer-assisted analysis of biomedical image data has moved into the focus of interest as an issue of high priority research efforts. In this context, innovative approaches to exploratory analysis of huge complex spatio-temporal patterns play a key role to improve computer-assisted signal and image processing in radiology. Examples of such approaches are various unsupervised vector quantization methods or supervised function approximation techniques, such as Generalized Radial-Basis-Functions- (GRBF-) neural networks. Recent developments motivated by concepts of computational intelligence are the `Deformable Feature Map' (DM) as an algorithm for self-organized model adaptation, the `Mutual Connectivity Analysis' (MCA) as an instrument for the analysis of large time-series ensembles and the `Exploratory Observation Machine' (XOM) as a novel general framework for learning by self-organization--three methods that the author has invented and applied to biomedical real-world applications. This contribution covers both conceptual foundations and applications of such methods for pattern recognition and analysis to a wide scope of radiological data sets, such as structural and functional segmentation in Magnetic Resonance Imaging (MRI), ranging from functional MRI for human brain mapping to the monitoring of disease progression in multiple sclerosis by automatic lesion segmentation, as well as novel approaches to image time-series analysis in MRI mammography for breast cancer diagnosis. Current projects related to the modeling of speech production and to genome-wide expression analysis of microarray data in bioinformatics confirm the broad applicability of the presented methods.
TL;DR: This work provides an introduction to the design of video quality metrics by using data analysis methods, which are different from traditional approaches, and uses multidimensional data analysis, an extension of well established data analysis techniques, to exploit higher dimensional data better.
Abstract: Quality of Experience is becoming increasingly important in signal processing applications. In taking inspiration from chemometrics, we provide an introduction to the design of video quality metrics by using data analysis methods, which are different from traditional approaches. These methods do not necessitate a complete understanding of the human visual system. We use multidimensional data analysis, an extension of well established data analysis techniques, allowing us to exploit higher dimensional data better. In the case of video quality metrics, it enables us to exploit the temporal properties of video more properly, the complete three dimensional structure of the video cube is taken into account in metrics’ design. Starting with the well known principal component analysis and an introduction to the notation of multi-way arrays, we then present their multidimensional extensions, delivering better quality prediction results. Although we focus on video quality, the presented design principles can easily be adapted to other modalities and to even higher dimensional datasets as well.
TL;DR: From the perspective of MDA, a multidimensional data model is created, which is implemented by using J2EE architecture, and flash RIA technology, which provides users with good visual modeling and data display interfaces.
Abstract: The existing management information systems in universities are often designed for specific management applications, and there are still many problems and shortcomings on data analysis and decision support. This paper analyzes the main problems, then from the perspective of MDA, creates a multidimensional data model and builds a multidimensional data analysis system for educational data warehousing and data mining. This system is implemented by using J2EE architecture, and flash RIA technology, which provides users with good visual modeling and data display interfaces.
TL;DR: A novel method for selecting and materializing views based on OLAP signatures and GRASP (Greedy Randomized Adaptive Search) is proposed, which allows for a hybrid method, which traverses the solution space in a comprehensive manner as performed in purely random approaches.
Abstract: Although the materialization of views reduces the execution time of OLAP queries, the materialization of a large number of views may exceed computer storage thresholds. Thus, given a certain storage cost threshold, there is a need for selecting the best views to be materialized, i.e. views that t the storage requirements and provide the lowest response time to process OLAP queries. Several solutions have been proposed in the literature to solve this problem. However, most studies have adopted strictly greedy or purely random approaches. Also, most of them do not encompass the entire cycle of execution of multidimensional analysis, or do not specify and implement the whole cycle of multidimensional query execution. In this paper, we address these issues by proposing a novel method for selecting and materializing views based on OLAP signatures and GRASP (Greedy Randomized Adaptive Search). On the one hand, using OLAP signatures and their relationships with descriptions of the data cube, we are able to identify which views should be materialized for being more beneficial to the user query processing. On the other hand, using GRASP allows us to dene a hybrid method, which traverses the solution space in a comprehensive manner as performed in purely random approaches, while examines only the regions of the search space with a great concentration of good solutions generated by a greedy approach. GRASP was compared to other VSP algorithms, namely Pick by Size (PBS) and Ant Colony Optimization (ACO), and performance tests indicated that compared to PBS, the proposed method obtained a time reduction of about 20.4% in query processing. In addition, GRASP was more scalable than PBS, since it selected and materialized a smaller set of views, even when there was a wide range of possible views to be chosen. Also, GRASP obtained nearly the same query runtime of ACO (i.e. a small performance loss of about 2.84% was obtained by GRASP), but a shorter time for the selection of views than the ACO algorithm (i.e. a gain in processing time of about 77% was produced by GRASP).
TL;DR: In this paper, a dimension importance level calculation part 102 extracts a plurality of combinations between pieces of m-dimensional (m is an integer ≥ 1 and COPYRIGHT: (C)2011,JPO&INPIT
Abstract: PROBLEM TO BE SOLVED: To selectively present an important dimension by defining an importance level of a dimension, in general multidimensional data analysis. SOLUTION: A dimension importance level calculation part 102 extracts a plurality of combinations between pieces of m-dimensional (m is an integer ≥1 and COPYRIGHT: (C)2011,JPO&INPIT
TL;DR: This paper wishes to present advanced techniques for analysis and exploitation of data stored in a multidimensional database.
Abstract: Computerized accounting systems in recent years have seen an increase in complexity due to the competitive economic environment but with the help of data analysis solutions such as OLAP and Data Mining can be a multidimensional data analysis, can detect the fraud and can discover knowledge hidden in data, ensuring such information is useful for decision making within the organization. In the literature there are many definitions for data mining but all boils down to same idea: the process takes place to extract new information from large data collections, information without the aid of data mining tools would be very difficult to obtain. Information obtained by data mining process has the advantage that only respond to the question of what happens but at the same time argue and show why certain things are happening. In this paper we wish to present advanced techniques for analysis and exploitation of data stored in a multidimensional database.
TL;DR: A general model for supply chain management based on the first principles of linear algebra, in particular on tensorial calculus is proposed, capable of exploiting recent parallel and distributed technologies, and subdividing tensor objects into sub-blocks, and processing them independently.
Abstract: In current trends of consumer products market, there is a growing significance of the role of retailers in the governance of supply chains. RFID is a promising infrastructure-less technology, allowing to connect an object with its virtual counterpart, i.e., its representation within information systems. However, the amount of RFID data in supply chain management is vast, posing significant challenges for attaining acceptable performance on their analysis. Current approaches provide hard-coded solutions, with high consumption of resources; moreover, these exhibit very limited flexibility dealing with multidimensional queries, at various levels of granularity and complexity. In this paper we propose a general model for supply chain management based on the first principles of linear algebra, in particular on tensorial calculus. Leveraging our abstract algebraic framework, our technique allows both quick decentralized on-line processing, and centralized off-line massive business logic analysis, according to needs and requirements of supply chain actors. Experimental results show that our approach, utilizing recent linear algebra techniques can process analysis efficiently, when compared to recent approaches. In particular, we are able to carry out the required computations even in high memory constrained environments, such as on mobile devices. Moreover, when dealing with massive amounts of data, we are capable of exploiting recent parallel and distributed technologies, subdividing our tensor objects into sub-blocks, and processing them independently.
TL;DR: The idea is to approximate histogram data using B-spline functions in order to synthetize the information within data trough some characteristic function parameters that will be the new data that could be analyzed with methodologies of multidimensional data analysis.
Abstract: The aim of this work is to propose a new approach for dealing with histogram data in symbolic data analysis framework. The idea is to approximate histogram data using B-spline functions in order to synthetize the information within data trough some characteristic function parameters. This parameters will be the new data that could be, subsequently, analyzed with methodologies of multidimensional data analysis.
TL;DR: Computerized accounting systems have grown their complexity by means of data analyzing information solutions such as OLAP and Data Mining which help perform a multidimensional analysis of financial-accounting data, potential frauds can be detected and data hidden information can be revealed, trends for certain indicators can be set up, therefore ensuring useful information to a company’s decision making
Abstract: At present, one can say that a company’s good running largely depends on the information quantity and quality it relies on when making decisions. The information needed to underlie decisions and be obtained due to the existence of a high-performing information system which makes it possible for the data to be shown quickly, synthetically and truly, also providing the opportunity for complex analyses and predictions. In such circumstances, computerized accounting systems, too, have grown their complexity by means of data analyzing information solutions such as OLAP and Data Mining which help perform a multidimensional analysis of financial-accounting data, potential frauds can be detected and data hidden information can be revealed, trends for certain indicators can be set up, therefore ensuring useful information to a company’s decision making
TL;DR: It is concluded that data warehousing and data mining are essentials for an effective decision support system for handling uncertain climate data consolidated with SAS application system.
TL;DR: The usage of set theory is described for the data organization of the relational database into the required structure and the database of informational questionnaire system is taken for the input data.
Abstract: Summary . In this investigation they describe the usage of set theory for the data organization of the relational database into the required structure. They take the database of informational questionnaire system for the input data. The data is orga-nized for the further transferring into the multidimensional data analysis system. Key words: information questionnaire system; relational database; multidi-mensional analysis (MDA); set theory. Introduction Fast and flexible data analysis is one of the main requirements while constructing business-analytics architecture. There are all the transactions of the detail layer in the relational data storage. Using this data the users would like to get the final information, to add their own calculations and to analyze the data with help of the mechanism for spontaneous requests creating. Programs for multidimensional data analysis – are the products that stand in between the database and elec-tronic worksheet (according to the mechanism principle) and create a special product class of that kind. [1]. Historically they are the evolution of electronic worksheets, but nowadays they resemble databasesбольше напоминают базы данных. The most important feature of electronic worksheet is the possibility to start up links between the cells with help of the formulas. The most important feature of programs for multidimen-sional data analysis is the possibility to easily form the multidimensional received data cube and to modify its screen view. Nowadays there are programs of three types: 1) addition to the popular electronic worksheet systems; 2) addition to the popular database management system; 3) standalone products, that are capable to import data from differ-ent sources.
TL;DR: This talk presents various issues and solutions on scalable mining and analysis of information networks, and illustrates how to apply network analysis technique to solve classical frequent item-set mining in a more efficient top-down fashion.
Abstract: With the ubiquity of information networks and their broad applications, there have been numerous studies on the construction, online analytical processing, and mining of information networks in multiple disciplines, including social network analysis, World-Wide Web, database systems, data mining, machine learning, and networked communication and information systems. Moreover, with a great demand of research in this direction, there is a need to understand methods for analysis of information networks from multiple disciplines. In this talk, we will present various issues and solutions on scalable mining and analysis of information networks. These include data integration, data cleaning and data validation in information networks, summarization, OLAP and multidimensional analysis in information networks. Finally, we illustrate how to apply network analysis technique to solve classical frequent item-set mining in a more efficient top-down fashion.
TL;DR: This paper designed a data warehouse model of agricultural production and built an effective and viable agricultural production data warehouse, by using some key technologies: multidimensional data analysis, cube, materialization view selection, materialized view maintenance.
Abstract: This paper designed a data warehouse model of agricultural production. And it built an effective and viable agricultural production data warehouse, by using some key technologies: multidimensional data analysis, cube, materialized view selection, materialized view maintenance. Finally, it provided a solution for the effective management and maintenance problems about high-capacity heterogeneous data.
TL;DR: This work demonstrates the most efficient and accurate initialization strategy for MDS algorithms, reducing considerably computational load and rendering MDS methodology much more useful in the analysis of high-dimensional data such as functional genomics datasets.
Abstract: Motivation: Multidimensional scaling (MDS) is a well-known multivariate statistical analysis method used for dimensionality reduction and visualization of similarities and dissimilarities in multidimensional data. The advantage of MDS with respect to singular value decomposition (SVD) based methods such as principal component analysis is its superior fidelity in representing the distance between different instances specially for high-dimensional geometric objects. Here, we investigate the importance of the choice of initial conditions for MDS, and show that SVD is the best choice to initiate MDS. Furthermore, we demonstrate that the use of the first principal components of SVD to initiate the MDS algorithm is more efficient than an iteration through all the principal components. Adding stochasticity to the molecular dynamics simulations typically used for MDS of large datasets, contrary to previous suggestions, likewise does not increase accuracy. Finally, we introduce a k nearest neighbor method to analyze the local structure of the geometric objects and use it to control the quality of the dimensionality reduction.
Results: We demonstrate here the, to our knowledge, most efficient and accurate initialization strategy for MDS algorithms, reducing considerably computational load. SVD-based initialization renders MDS methodology much more useful in the analysis of high-dimensional data such as functional genomics datasets.
Contact: arndt@ihes.fr
TL;DR: This work proposes two categories of novel anonymization methods based on approximate nearest-neighbor (NN) search in high-dimensional spaces, which is efficiently performed through locality-sensitive hashing (LSH) and two data transformations that capture the correlation in the underlying data: reduction to a band matrix and Gray encoding-based sorting.
Abstract: Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is to enforce privacy-preserving paradigms, such as k-anonymity and l-diversity, while minimizing the information loss incurred in the anonymizing process (i.e., maximize data utility). Existing techniques work well for fixed-schema data, with low dimensionality. Nevertheless, certain applications require privacy-preserving publishing of transactional data (or basket data), which involve hundreds or even thousands of dimensions, rendering existing methods unusable. We propose two categories of novel anonymization methods for sparse high-dimensional data. The first category is based on approximate nearest-neighbor (NN) search in high-dimensional spaces, which is efficiently performed through locality-sensitive hashing (LSH). In the second category, we propose two data transformations that capture the correlation in the underlying data: 1) reduction to a band matrix and 2) Gray encoding-based sorting. These representations facilitate the formation of anonymized groups with low information loss, through an efficient linear-time heuristic. We show experimentally, using real-life data sets, that all our methods clearly outperform existing state of the art. Among the proposed techniques, NN-search yields superior data utility compared to the band matrix transformation, but incurs higher computational overhead. The data transformation based on Gray code sorting performs best in terms of both data utility and execution time.