TL;DR: This paper proposes a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results, and enables annotating with unlimited vocabulary and is highly scalable and robust to outliers.
Abstract: Although it has been studied for years by the computer vision and machine learning communities, image annotation is still far from practical. In this paper, we propose a novel attempt at model-free image annotation, which is a data-driven approach that annotates images by mining their search results. Some 2.4 million images with their surrounding text are collected from a few photo forums to support this approach. The entire process is formulated in a divide-and-conquer framework where a query keyword is provided along with the uncaptioned image to improve both the effectiveness and efficiency. This is helpful when the collected data set is not dense everywhere. In this sense, our approach contains three steps: 1) the search process to discover visually and semantically similar search results, 2) the mining process to identify salient terms from textual descriptions of the search results, and 3) the annotation rejection process to filter out noisy terms yielded by Step 2. To ensure real-time annotation, two key techniques are leveraged - one is to map the high-dimensional image visual features into hash codes, the other is to implement it as a distributed system, of which the search and mining processes are provided as Web services. As a typical result, the entire process finishes in less than 1 second. Since no training data set is required, our approach enables annotating with unlimited vocabulary and is highly scalable and robust to outliers. Experimental results on both real Web images and a benchmark image data set show the effectiveness and efficiency of the proposed algorithm. It is also worth noting that, although the entire approach is illustrated within the divide-and- conquer framework, a query keyword is not crucial to our current implementation. We provide experimental results to prove this.
TL;DR: A unified framework for generalized LDA is proposed, which elucidates the properties of various algorithms and their relationships, and shows that the matrix computations involved in LDA-based algorithms can be simplified so that the cross-validation procedure for model selection can be performed efficiently.
Abstract: High-dimensional data are common in many domains, and dimensionality reduction is the key to cope with the curse-of-dimensionality. Linear discriminant analysis (LDA) is a well-known method for supervised dimensionality reduction. When dealing with high-dimensional and low sample size data, classical LDA suffers from the singularity problem. Over the years, many algorithms have been developed to overcome this problem, and they have been applied successfully in various applications. However, there is a lack of a systematic study of the commonalities and differences of these algorithms, as well as their intrinsic relationships. In this paper, a unified framework for generalized LDA is proposed, which elucidates the properties of various algorithms and their relationships. Based on the proposed framework, we show that the matrix computations involved in LDA-based algorithms can be simplified so that the cross-validation procedure for model selection can be performed efficiently. We conduct extensive experiments using a collection of high-dimensional data sets, including text documents, face images, gene expression data, and gene expression pattern images, to evaluate the proposed theories and algorithms.
TL;DR: A novel feature selection filter for supervised learning, which relies on the efficient estimation of the mutual information between a high-dimensional set of features and the classes, and which is able to outperform algorithms which individually rank features, as well as a greedy algorithm based on the maximal relevance and minimal redundancy criterion.
Abstract: We propose a novel feature selection filter for supervised learning, which relies on the efficient estimation of the mutual information between a high-dimensional set of features and the classes. We bypass the estimation of the probability density function with the aid of the entropic-graphs approximation of Renyi entropy, and the subsequent approximation of the Shannon entropy. Thus, the complexity does not depend on the number of dimensions but on the number of patterns/samples, and the curse of dimensionality is circumvented. We show that it is then possible to outperform algorithms which individually rank features, as well as a greedy algorithm based on the maximal relevance and minimal redundancy criterion. We successfully test our method both in the contexts of image classification and microarray data classification. For most of the tested data sets, we obtain better classification results than those reported in the literature.
TL;DR: P3C is the first projected clustering algorithm for both numerical and categorical data that can effectively discover projected clusters in the data while minimizing the number of required parameters.
Abstract: Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.
TL;DR: This work presents an approach at the conceptual level for identifying problematic situations in fact-dimension relationships, and applying a normalization process to transform this conceptual MD model into a summarizability-compliant model that avoids erroneous analysis of data.
Abstract: Multidimensional analysis allows decision makers to efficiently and effectively use data analysis tools, which mainly depend on multidimensional (MD) structures of a data warehouse such as facts and dimension hierarchies to explore the information and aggregate it at different levels of detail in an accurate way. A conceptual model of such MD structures serves as abstract basis of the subsequent implementation according to one specific technology. However, there is a semantic gap between a conceptual model and its implementation which complicates an adequate treatment of summarizability issues, which in turn may lead to erroneous results of data analysis tools and cause the failure of the whole data warehouse project. To bridge this gap for relationships between facts and dimension, we present an approach at the conceptual level for (i) identifying problematic situations in fact-dimension relationships, (ii) defining these relationships in a conceptual MD model, and (iii) applying a normalization process to transform this conceptual MD model into a summarizability-compliant model that avoids erroneous analysis of data. Furthermore, we also describe our Eclipsebased implementation of this normalization process.
TL;DR: In this paper, a multidimensional analysis of poverty of Italian households is performed on the basis of SHIW data using fuzzy set theory, and a set of composite indicators is constructed in order to analyze different dimensions of poverty.
Abstract: By using fuzzy set theory a multidimensional analysis of poverty of Italian households is performed on the basis of SHIW data. A set of composite indicators is constructed in order to analyze different dimensions of poverty. For each indicator is calculated an unidimensional poverty ratio, thus allowing a comparison among indicators on the dimensions of poverty. Finally, a multidimensional poverty ratio is obtained.
TL;DR: A new approach combining original hypertree construction techniques for multidimensional clustering results visualization with novel cluster labeling techniques based on the use of cluster content evaluation criteria, like the F-measure on cluster properties is presented.
Abstract: Hyperbolic visualization represents a useful tool for the interpretation of complex data analysis results, whenever it can be combined with efficient labeling strategies. In this paper, we firstly present a new approach combining original hypertree construction techniques for multidimensional clustering results visualization with novel cluster labeling techniques based on the use of cluster content evaluation criteria, like the F-measure on cluster properties. The first part of the paper briefly presents the cluster hypertree construction principle. The main part of the paper focuses on the presentation of the labeling techniques. It illustrates that the scope of the proposed techniques can be extended from single cluster labeling to labeling of hierarchical structures, like hypertrees. Finally, using specific evaluation criteria, we show the better efficiency of the proposed methods, as compared to usual labeling methods, both for single cluster labeling and for hierarchical labeling. The experimental context of the paper is a bibliographic database of 2127 PASCAL references related to the geological domain.
TL;DR: This paper discusses and applies the proposed model for the semantic annotation of the schema of a datacube, that is the basis for OLAP analysis and contains information derived from Data Warehouse schema.
Abstract: In the information system view, the reference architecture for strategic and decision support is based on the Data Warehouse architecture, that enables flexible and multidimensional analysis of strategic indexes by means of OLAP tools and reports In this paper we propose a novel model for semantic annotation of Data Warehouse schema that takes into account domain ontologies as well as a mathematical ontology Such an ontology describes mathematical formulas underlying elements of the datacube schema, including the semantics of operands and operators In particular, we discuss and apply the proposed model for the semantic annotation of the schema of a datacube, that is the basis for OLAP analysis and contains information derived from Data Warehouse schema In the paper, an illustrative case study together with some examples of analysis based on this kind of annotation are provided
TL;DR: A multistep approach aims to the identification of potentially interesting items exploiting well-known techniques of multidimensional data analysis using association rules, which represent one of the most powerful and largely used approaches to detect the presence of regularities and paths in large databases.
TL;DR: In this paper the so-called generalized convolution, being in fact an adequate adaptation of the well known circular convolution concept to any invertible block-transform, is proposed, developed, and analysed.
Abstract: In this paper the so-called generalized convolution, being in fact an adequate adaptation of the well known circular convolution concept to any invertible block-transform, is proposed, developed, and analysed. First the proposed idea is summarized for a one-dimensional case. Then it is extended to multidimensional problems. The presented generalized convolution concept is based on the earlier A-convolution. This idea is recalled at the beginning and a set of techniques for studying the dependence of the respective coefficients on the arbitrary transform, is suggested. The generalized convolution matrix, being an extension of that for the circular convolution, is introduced and adapted to an arbitrary invertible transform. The filtering problem is then defined and presented in the transform domain. The multidimensional analysis starts with two-dimensional problems, then it is continued with formulas for multidimensional filtering tasks. The paper is illustrated with examples computed for twenty carefully selected transforms. Among them are Haar, Hadamard, Hartley, Karhunen-Loeve and a family of 16 discrete trigonometric transforms.
TL;DR: This paper presents a cluster-outlier iterative detection algorithm, tending to detect the clusters and outliers in another perspective for noisy data sets, and can be applied in many fields such as pattern recognition, data clustering and signal processing.
Abstract: Nowadays many data mining algorithms focus on clustering methods. There are also a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat clusters and outliers as concepts of the same importance in data analysis. In this paper, we present a cluster-outlier iterative detection algorithm, tending to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields such as pattern recognition, data clustering and signal processing.
TL;DR: In this article, a highly usable multidimensional data analysis method for performing interactive analysis on, for example, medical/administrative data stored in a hospital information system to support knowledge discovery about clinical decision-making is proposed.
Abstract: A highly-usable multidimensional data analysis method for performing interactive analysis on, for example, medical/administrative data stored in a hospital information system to support knowledge discovery about clinical decision-making is proposed. A multidimensional data analysis apparatus ( 200 ) includes: a database ( 201 ) separately holding an interval table I indicating intervals and a hierarchy table T indicating a hierarchical structure of each dimension of multidimensional data; an interval selection operation unit ( 202 c ) selecting an interval I′ having a user-requested property c from the interval table I, using an interval selection operation g; a join operation unit ( 202 b ) joining a set of intervals with a join operation (3 in the interval selected by the interval selection operation unit ( 202 c ), using the join operation β; and an aggregation operation unit ( 202 a ) generating a multidimensional cube from a result in the join operation unit ( 202 b ), using an aggregation operation α.
TL;DR: In this article, the authors investigate the problem of enabling block level parallelism, for multi-dimensional data sets, with arbitrary but static causal dependency between blocks that constitute the data set, and propose a method for synchronizing the execution of multiple processors to respect the dependency structure and calculate the total processing time as a function of the number of parallel processors.
Abstract: In this paper, we investigate the problem of enabling block level parallelism, for multi-dimensional data sets, with arbitrary but static causal dependency between blocks that constitute the data set. As the use of video and other multi-dimensional data sets becomes more common place and the algorithms used to process them become more complex, there is greater need for effective parallelization schemes. We describe a method for synchronizing the execution of multiple processors to respect the dependency structure and calculate the total processing time as a function of the number of parallel processors. We also provide an algorithm to calculate the optimal starting times for each processor which enables them to continuously process blocks without the need for synchronizing with other processors, under the assumption that the time to process each block is fixed.
TL;DR: The relational database model is probably the most frequently used database model today but it doesn’t perform very well with complex queries and analysis of very large sets of data.
Abstract: The relational database model is probably the most frequently used database model today. It has its strengths, but it doesn’t perform very well with complex queries and analysis of very large sets ...
TL;DR: A modeling approach based on fragmentation techniques to manage the various issues related to the management of multistructured documents (representation, storage, reconstruction, and management of concurrent structures) is proposed.
Abstract: More than half of information used in organizations is stored in documents. The diversity of their uses implies several descriptions and thus several decomposition needs of these documents (and so several structures). Hence, designing models and tools to exploit simultaneously these various kinds of structures presents a real challenge. To fulfill these needs, we propose a modeling approach based on fragmentation techniques to manage the various issues related to the management of multistructured documents (representation, storage, reconstruction, and management of concurrent structures). Three exploitation modes are proposed according to the different kinds of information searched: information, data interrogation and multidimensional analysis.
TL;DR: This paper describes CubeLink in terms of the underlying components, their integration, and the formal mapping from multidimensional data analysis into link analysis, and provides a formal basis for one particular form of automatic schema-ontology mapping from OLAP data cubes to semantic graphs databases, and point the way towards future "intelligent" OLAPData cubes equipped with meta-data about their dimensional typing.
Abstract: The challenges facing the Department of Homeland Security (DHS) require not only multi-dimensional, but also multi-scale data analysis. In particular, the ability to seamlessly move from summary information, such as trends, into detailed analysis of individual entities, while critical for law enforcement, typically requires manually transferring information among multiple tools. Such time-consuming and error prone processes significantly hamper the analysts' ability to quickly explore data and identify threats. As part of a DHS Science and Technology effort, we have been developing and deploying for Immigration and Customs Enforcement the CubeLink system integrating information between relational data cubes and link analytical semantic graphs. In this paper we describe CubeLink in terms of the underlying components, their integration, and the formal mapping from multidimensional data analysis into link analysis. In so doing, we provide a formal basis for one particular form of automatic schema-ontology mapping from OLAP data cubes to semantic graphs databases, and point the way towards future "intelligent" OLAP data cubes equipped with meta-data about their dimensional typing.
TL;DR: A new paradigm for the multidimensional analysis of the geographic information: the Geographic OLAP, which combines the semantic component of geographic information and the flexibility of spatial analysis.
Abstract: Data warehouses in association with On Line Analytical Processing (OLAP) can enable effective Decision Support Systems. These solutions are based on the multidimensional paradigm, which allows a multidimensional analysis of huge datasets thanks to the concepts of dimension, fact, measure and OLAP operators. Spatial OLAP (SOLAP) integrates spatial information in data warehousing and OLAP systems. In this paper, we detail all different definitions of spatial dimension, spatial measure and spatio-multidimensional operators. A panorama of conceptual SOLAP models is provided. These solutions do not completely integrate the semantic component of geographic information and the flexibility of spatial analysis. The contribution of this paper is to introduce a new paradigm for the multidimensional analysis of the geographic information: the Geographic OLAP.
TL;DR: In this paper, a multi-stage optimal component analysis (OCA) algorithm is proposed to learn a subspace that is optimal for data discrimination based on the K-nearest neighbor classifier.
TL;DR: In this article, the analysis of the most important financial and economic indicators at the level of some organizations from the same sector of activity, the selection of performance ratios and determining a particular model of analysis, a score type of function for which they will use modern techniques of multidimensional analysis, like Data Mining techniques.
Abstract: In the context of helping any company to move from the desire to obtain performance to action through better and faster decisions, facilities like data warehouse, business intelligence systems and decisions support systems are available now to make it possible.The performance is an abstract concept, an extremely important one, which has to be put into a proper perspective from the point of view of the organization, of the performance indicators taken into consideration, of the models proposed for establishing performance and of the chosen software. The study case has as objectives the analysis of the most important financial and economic indicators at the level of some organizations from the same sector of activity, the selection of performance ratios and determining a particular model of analysis, a score type of function for which we will use modern techniques of multidimensional analysis, like Data Mining techniques. In our case, the data used to calculate the economic and financial ratios were collected from companies' annual reports, using the Internet as the primary source. Also, a comparative multidimensional analysis of the organizations, of products types, sales area and customers can offer a panoramic view over the performance in that sector of activity.
TL;DR: In this paper, a multi-stage optimal component analysis (OCA) algorithm is proposed to learn a subspace that is optimal for data discrimination based on the K-nearest neighbor classifier.
Abstract: Learning data representations is a fundamental challenge in modeling neural processes and plays an important role in applications such as object recognition. Optimal component analysis (OCA) formulates the problem in the framework of optimization on a Grassmann manifold and a stochastic gradient method is used to estimate the optimal basis. OCA has been successfully applied to image classification problems arising in a variety of contexts. However, as the search space is typically very high dimensional, OCA optimization often requires expensive computational cost. In multi-stage OCA, we first hierarchically project the data onto several low-dimensional subspaces using standard techniques, then OCA learning is performed hierarchically from the lowest to the highest levels to learn about a subspace that is optimal for data discrimination based on the K-nearest neighbor classifier. One of the main advantages of multi-stage OCA lies in the fact that it greatly improves the computational efficiency of the OCA learning algorithm without sacrificing the recognition performance, thus enhancing its applicability to practical problems. In addition to the nearest neighbor classifier, we illustrate the effectiveness of the learned representations on object classification used in conjunction with classifiers such as neural networks and support vector machines.
TL;DR: The idea proposed in this thesis is to trasform the histogram data by means of an approximation function in order to control the error deriving from empirical data.
Abstract: This work aims at analysing complex phenomena through the construction of appropriate models, followed by an analysis of the characteristic parameters of the models. It all derives from the need to work, not only with empirical values but with functions that are able to smooth the histogram and give us the possiblity to omit values
that could be outlier. According to the classical theory of measure, the data generated by a “correct” model are more “real” then the empirical one, because they are purified from error sampling and from error of measurement.
We should never forget that there are no “real” models, but rather models that approximate the reality in a more ore less accuracy. Models compatible with empirical data can be manifold.
The idea proposed in this thesis is to trasform the histogram data by means of an approximation function in order to control the error deriving from empirical data.
What we look for is the right compromise between model and error. Our target is to be able to work with models that are comparable in order to be able to apply the techniques of a Multidimensional Data Analysis. For that reason, all the histograms will be transformed into models through the approximation by means of functions of the same family. In that case we would work with data that have been synthesized
through a model, and from there we would obtain N models for
each variable, all corresponding to the i-th observation. Models constructed that way can be synthesized through parameters and through an appropriate quality index of adaptation. Successively we will pass
on to the analysis of the data achieved through adequate techniquesof Multidimensional Analysis.
TL;DR: This paper focuses on the extrapolative features of the Lee-Carter model (1992) and proposes a reading of such method in the framework of the Multidimensional Data Analysis, which is to propose a data analysis strategy exploiting the analytical and geometrical properties of theLee-Carter method.
Abstract: In the framework of demographic processes, different approaches can be identified. According to their aims, these approaches can be differentiated in extrapolative and structural methods. The first focus on the homogeneity of trends in order to obtain projection. The second are based on structural models relating demographic variables to other kinds of variables (geographical, social, economical, etc.). Nowadays, this distinction is not so clear and the joint use of explorative and explanatory approaches is increasing. In this paper, we focus on the extrapolative features of the Lee-Carter model (1992) and propose a reading of such method in the framework of the Multidimensional Data Analysis. Our aim is to propose a data analysis strategy exploiting the analytical and geometrical properties of the Lee-Carter method.
TL;DR: In this paper, a reverse-engineering process is used to obtain a logical representation of data sources, mark this logical representation with multidimensional concepts, and derive a conceptual multi-dimensional model from the marked model.
Abstract: Data warehouses integrate several operational sources to provide a multidimensional analysis of data, thus improving the decision making process. Therefore, an in-depth analysis of these data sources is crucial for data warehouse development. Traditionally, this analysis has been based on a set of informal guidelines or heuristics to support the manually discovery of multidimensional elements on a well-known documentation. Therefore, this task may become highly tedious and prone to fail. In this paper, MDA (Model Driven Architecture) is used to design a reverse engineering process in which the following tasks are performed (i) obtain a logical representation of data sources (ii) mark this logical representation with multidimensional concepts, and (iii) derive a conceptual multidimensional model from the marked model.
TL;DR: In the second phase of the research, an analysis of the risk management systems was fulfilled and a solution based on the OLAP systems was proposed with the view of its implementation and a multidimensional model for data organizing was proposed.
Abstract: SUMMARY OLAP (On-Line Analytical Processing) performs multidimensional analysis of business data and provides the capability for complex calcul ations, trend analysis, and sophisticated data modeling. It became the fundamental foundation for Intelligent Solutions including Business Performance Management, Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge Discovery, and Data Warehouse Reporting. OLAP enables end-users to perform ad-hoc analysis of dat a in multiple dimensions, thereby providing the insight and understanding they need f or better decision making. For achieving the main objective of the three years research PN2 Project “Sustainable Management System of Resources Used for Monitoring and Evaluating the Environmental Risks in Order to Prevent the Negative Effects and to Manage Crises Situations MEMDUR”, code D11-037/18.09.2007, webpage: http://memdur.ssai.valahia.ro), in the second phase of the research, an analysis of the ex isting risk management systems was fulfilled and a solution based on the OLAP systems was proposed with the view of its implementation. The OLAP system designing was made based on the solution offered by Microsoft and it proposes a multidimensional model for data organizing which allows the analysis of data through various visualizations of multidimensional cubes. The multitude of possible views over the data collection together wi th the choosing of the convenient measures can be taken into account for the risk calculus whe n is needed to offer a real image on the air pollution rate as a result of different pollutants resulted from the industrial processes (R adulescu 2005). Looking forward, a SOLAP (Spatial On-Line Analytical Processing) instrument will integrate the advantages offered by the Geographic Information System (GIS) with the OLAP applications ones. The classical data warehouse architecture (structured on three levels: data level, OLAP server level and cli ent level) can be extended to a SOLAP model where spatial data (with related attributes), measures and hierarchies have to be added. At the OLAP server level, the metadata warehouse integrates concepts like the relational model, the multidimensional model and GIS concepts. (Rivest et all. 2003).
TL;DR: A suitable use of multidimensional data analysis (MDA) is proposed to investigate the associations characterizing the indicators/attributes of the system to asses the impact of an adopted policy by measuring system performance.
Abstract: The present paper focuses on ex post analysis to asses the impact of an adopted policy by measuring system performance. Since accurate impact assessment requires in-depth knowledge of the structure underlying the system, this contribution proposes a suitable use of multidimensional data analysis (MDA) to investigate the associations characterizing the indicators/attributes of the system. The general aim is to identify homogeneous subsets of objects that are described by subsets of attributes. This approach was planned to study students performance in Italian universities: the focus is on student careers. The example data set is a data mart selected from the University of Macerata data base and refers to the students at the Economics Faculty from 2001 to 2007.
TL;DR: In this paper, a log folder generation part of a client terminal acquires and transmits data for the operating environment or executing state of the application by entry points 24 1 -24 7 provided on the application 22 during execution of an essential module or abnormal termination.
Abstract: PROBLEM TO BE SOLVED: To provide a system, capable of acquiring multidimensional data for the operation environment or executing state of an application to perform multidimensional analysis thereof. SOLUTION: In the application analysis support system, a log folder generation part 12 of a client terminal 2 acquires and transmits data for the operating environment or executing state of the application 22 by entry points 24 1 -24 7 provided on the application 22 during execution of an essential module or the like of the application 22 or abnormal termination. An application analysis support device 1 which acquires the data performs predetermined analysis based on the plurality of acquired data, and displays the result as image data such as a graph. COPYRIGHT: (C)2008,JPO&INPIT
TL;DR: A parametric statistics and multidimensional analysis algorithm will be applied in target recognition from quad-polarimetric SAR image and looked the backscattering characters of land cover targets as the basis of analysis.
Abstract: Diverse parameters decomposed from quad-polarimetric SAR could become the important basis in target recognition, classification and other applications.During the target recognition, due to the sidedness of single parameter or two parameters decomposed by same algorithm, obvious differences exist among the results extracted by different parameters. In this paper, a parametric statistics and multidimensional analysis algorithm will be applied in target recognition from quad-polarimetric SAR image. The proposed algorithm looked the backscattering characters of land cover targets as the basis of analysis. Through parametric statistics, the dominant scattering mechanisms of every decomposing algorithm would be picked out such as volume scattering and sphere scattering of paddy. They become the data of multi dimensional analysis. In the multi-dimensional analysis, the targetpsilas location of parametric space will be the basis of target recognition. The application on land cover and land use by the proposed algorithm improved the monitoring capability of quad polarimetric SAR.
TL;DR: A powerful approach for noisy data reconstruction and also for data compression based on the developed algorithms for tensor factorization and decomposition, which has many potential applications in computational neuroscience, multi-sensory, multidimensional data analysis and text mining.
Abstract: In this paper we present a powerful approach for noisy data reconstruction and also for data compression based on our algorithms for tensor factorization and decomposition [10], [9]. This approach has many potential applications in computational neuroscience, multi-sensory, multidimensional data analysis and text mining. Our algorithms are locally stable and work well for sufficiently sparse data even if in the heavy noisy case. Moreover, the proposed approach provides promise applications for real world data. The extensive experimental results confirm the validity and high performance of the developed algorithms not only for synthetic benchmarks but also for real-world data, especially, with usage of the multi-layer hierarchical approach [4], [9].
TL;DR: This special issue presents a selection of nine papers dealing with various topics of multidimensional data analysis, coming from very different fields and most of them combine Stochastic Modeling with techniques of approximation that are either combinatorial or geometrical.
TL;DR: Analysis of caCORE system and systemic integration experiment proved that building a large-scale oncological data system integrating distributed data is extremely complex and requires a lot of effort to understand its structures, prepare data mappings and implement integration procedures.
Abstract: Multidimensional analysis, trends analysis, summaries and drill-downs as data warehousing methods of choice provided rich, valuable and detailed perspective of cancer threats in terms of virtually any dimension covered by data. These allowed to model the risk of cancer including age, race, sex and survival chances among others, to spot most dangerous and incident cancers, revealed how little survival chances and treatment efficiency increased over last 30 years and how little early diagnosis was improved, presented trends and changes in them and changes in cancer risk related to place of residence and emphasized the importance of risk mitigation by screening and healthy lifestyle. These methods also turned out to be easy, requiring less computer science related knowledge as one could expect. With little support from IT staff, oncology domain professionals can easily benefit from vast data sets and analytical power applied to it. Data mining algorithms evaluated over melanoma of the skin data managed to extract what's already known in the domain. Therefore, when used by oncology professionals over less generic data one can expect data mining to have the potential of extending experts' knowledge. Neural networks, decision trees and clusters showed higher prediction accuracy than Naive Bayes classifiers and association rules but it is advised to merge results from many algorithms. Findings by particular algorithms are often disjoint and when combined, allow to reveal more despite varying predictive performance. Analysis of caCORE system and systemic integration experiment proved that building a large-scale oncological data system integrating distributed data is extremely complex. Integrating with it requires a lot of effort to understand its structures, prepare data mappings and implement integration procedures. Strict cooperation of IT and oncology professionals is mandatory. Suggestions were made to simplify the generic caCORE data model (ontology) or split it into smaller parts and expose as much integration functionality as web interfaces or encapsulated classes to decrease the complexity of the process. Tweaked like that, caCORE would be fully feasible and could be considered as the future of application of data warehousing and data mining techniques in oncology, providing distributed and common-model compliant dataset and leveraging the power of research community.