TL;DR: In this paper, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: In this paper, a graphical user interface (GUI) control based on the metadata associated with the search hits is constructed and displayed with search results in a standard view, and the metadata in the search results is arranged in a tabular view which is embedded in the display of search results and rendered invisible until selected by the user.
Abstract: Records in databases or unstructured files are enriched with metadata and are indexed for retrieval by a search engine. In response to a search request, a graphical user interface (GUI) control based on the metadata associated with the search hits is constructed and displayed with the search results in a standard view. Selection of a metadata value via the GUI control filters the previously matched records down to those matching the value selected via the GUI control. The metadata in the search results is arranged in a tabular view which is embedded in the display of search results and rendered invisible until selected by the user. Reports can be constructed from an identifier each returned record set for presenting, analyzing and modifying the data, and for generating further reports.
TL;DR: In this paper, the authors present a method and system for integrating an enterprise's structured and unstructured data to provide users and enterprise applications with efficient and intelligent access to that data.
Abstract: Disclosed herein is a method and system for integrating an enterprise's structured and unstructured data to provide users and enterprise applications with efficient and intelligent access to that data. Queries can be directed toward both an enterprise's structured and unstructured data using standardized database query formats such as SQL commands. A coprocessor can be used to hardware-accelerate data processing tasks (such as full-text searching) on unstructured data as necessary to handle a query. Furthermore, traditional relational database techniques can be used to access structured data stored by a relational database to determine which portions of the enterprise's unstructured data should be delivered to the coprocessor for hardware-accelerated data processing.
TL;DR: In this article, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: This paper first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech, then used maximum entropy method to classify Arabic text documents.
Abstract: In organizations, a large amount of information exists in text documents. Therefore, it is important to use text mining to discover knowledge from these unstructured data. Automatic text classification considered as one of important applications in text mining. It is the process of assigning a text document to one or more predefined categories based on their content. This paper focus on classifying Arabic text documents. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In our approach, we first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech. Then, we used maximum entropy method to classify Arabic documents. We experimented our approach using real data, then we compared the results with other existing systems.
TL;DR: In this paper, the problem of matching relevant content to user queries is addressed by grouping concepts and phrases, where each such group represents one possible user intention (as implied by the query phrase or keyword) and each such grouping is analyzed to provide relevant content, including but not limited to, unstructured data like world wide web, categorized data and paid listings.
Abstract: Methods and apparatus for a new approach to the problem of matching relevant content to user queries. Instead of looking for the exact keyword, the invention expands it into groupings of concepts and phrases, where each such group represents one possible user intention (as implied by the query phrase or keyword). Each such grouping is analyzed to provide relevant content, including but not limited to, unstructured data like world wide web, categorized data and paid listings. The provided method can better capture user intentions even for cases where there is no click-through information.
TL;DR: As online communities proliferate, developing effective solutions to support their information needs becomes increasingly important and the problem community information management, or CIM for short, is called.
Abstract: Community Information Management: There are many communities on the Web. Some are based on common interests, such as communities of movie goers, database researchers, and bioinformaticians, while others are based on a shared purpose, such as organization intranets and online technical support groups. Community members often want to discover, monitor, and query entities and relationships in their community. For example, database researchers might want to know if there is a connection between two given researchers, where a given paper has been cited in the past week, or what of interest has happened in the last 24 hours. Answering such questions often requires retrieving raw, largely unstructured data from multiple sources (e.g., home pages, DBLP, mailing lists), then inferring and monitoring semantic information. Examples of such inference and monitoring include recognizing entity mentions (e.g., “J. Gray”, “SIGMOD-04”), deciding if two mentions (e.g., “J. Gray” and “Jim Gray”) refer to the same real-world entity, recognizing that a relationship (e.g., co-authoring, advising, giving a talk) exists between two entities, detecting new entities (e.g., new workshops), and inferring that a relationship (e.g., affiliation with a university) has ceased to exist. The above inference and monitoring tasks are well known to be difficult [1, 2, 3, 7, 10]. As online communities proliferate, developing effective solutions to support their information needs becomes increasingly important. We call this problem community information management, or CIM for short.
TL;DR: In this article, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: In this article, the authors use data from performance metrics to drive the behavior of geometric shapes to visualize business performance and create new composite objects that show magnitude, patterns of structured and unstructured data, interrelationships, causalities, and dependencies.
Abstract: Users are enabled to utilize data from performance metrics to drive the behavior of geometric shapes to visualize business performance and create new composite objects that show magnitude, patterns of structured and unstructured data, interrelationships, causalities, and dependencies. Presentations are then rendered in a performance metric application or in another application through an embeddable user interface using the geometric shapes and composite objects. Automatic update of presented information in response to changes in the underlying data is enabled through the use of composite objects.
TL;DR: In this article, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: In this paper, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: In this article, Latent Metonymical analysis and indexing (LMai) is used to identify the relationship between the words in a set of given documents (Unstructured Data).
Abstract: The present invention relates to Latent Metonymical analysis and Indexing (LMai) is a novel concept for Advance Machine Learning or Unsupervised Machine Learning Techniques, which uses a statistical approach to identify the relationship between the words in a set of given documents (Unstructured Data). This approach does not necessarily need training data to make decisions on matching the related words together but actually has the ability to do the classification by itself. All that is needed is to give the algorithm a set of natural documents. The method is elegant enough to classify the relationships automatically without any human guidance during the process as shown in FIGS. 6 and 7.
TL;DR: In this article, a data analysis system that includes an information mining engine for extracting structured data from unstructured data, a data store for storing the extracted structured data, data received from third party data sources, and data collected from sensors monitoring insured property is described.
Abstract: A data analysis system that includes an information mining engine for extracting structured data from unstructured data, a data store for storing the extracted structured data, data received from third party data sources, and data received from sensors monitoring insured property is described. The system also includes a business logic processor that synergistically analyzes the structured data extracted by the text mining engine, the data received from the sensor, and the data received from the third party data source to make an insurance evaluation.
TL;DR: DBconnect is introduced, a prototype that exploits the social network coded within the DBLP database by drawing on a new random walk approach to reveal interesting knowledge about the research community and even recommend collaborations.
Abstract: Extracting information from large collections of structured, semi-structured or even unstructured data can be a considerable challenge when much of the hidden information is implicit within relationships among entities within the data. Social networks are such data collections in which relationships play a vital role in the knowledge these networks can convey. A bibliographic database is an essential tool for the research community, yet finding and making use of relationships comprised within such a social network is difficult. In this paper we introduce DBconnect, a prototype that exploits the social network coded within the DBLP database by drawing on a new random walk approach to reveal interesting knowledge about the research community and even recommend collaborations.
TL;DR: In this article, a computer readable medium is configured to receive a query, to map the query to an unstructured data source, to dispatch a request based on the query, and to aggregate the query results in a structured data store.
Abstract: A computer readable medium is configured to receive a query, to map the query to an unstructured data source, to dispatch a request based on the query to the unstructured data source, to aggregate data returned by the unstructured data source in a structured data store, and to issue the query against the structured data store.
TL;DR: In this paper, the authors extract and edit medical related quality of care information for reporting, using unstructured data to create structured information and then derive quality measures automatically from the structured information.
Abstract: Medical related quality of care information is extracted and edited for reporting. Patient records are mined. The mining may include mining unstructured data to create structured information. Measures are derived automatically from the structured information. A user may then edit the measures, data points used to derive the measures, or other quality metric based on expert review. The editing may allow for a better quality report. Tools may be provided to configure reports, allowing generation of new or different reports.
TL;DR: In this paper, the authors present methods, systems, and software for querying heterogeneous business data comprising structured data and unstructured data, which may be stored across one or more repositories.
Abstract: The present disclosure relates to methods, systems, and software for querying heterogeneous business data comprising structured data and unstructured data. The structured data and unstructured data may be stored across one or more repositories. The combined query may be initiated when the system receives a query for the heterogeneous business data and automatically parses the received query into sub-queries. Each sub-query can be associated with either structured or unstructured data stored in one of the repositories. At least one of the sub-queries can include of a portion of the received query. The results of the various sub-queries can be merged automatically using business logic.
TL;DR: In this paper, a method and system that utilizes OLAP and supporting data structures for making predictions about business locations is presented, where relationships are automatically extracted from the utilizable data by employing machine learning.
Abstract: A method and system that utilizes OLAP and supporting data structures for making predictions about business locations. The method includes providing a spatial map and analyzing heterogeneous data having a spatial component to find utilizable data. Relationships are automatically extracted from the utilizable data by employing machine learning. The step of automatically extracting relationships includes generating a composite indicator, which correlates spatial data with unstructured data. The extracted relationships are presented on a spatial map to make a prediction about at least one business location. Preferably, the predictions are presented as a rank-ordered list on the spatial map and a heat map overlays the spatial map to indicate predictions about particular regions.
TL;DR: The research methodology pertaining to this study is explained by reasoning out the selection of case study research approach coupled with semi-structured interviews to identify and generate main concepts of the study.
Abstract: Content analysis is a research technique used to organise large amounts of
textual data into standardised formats which allows arriving at suggestions/conclusions.
Content analysis can be carried out quantitatively by counting the words or qualitatively by
coding. The former approach refers to counting the frequency of the keywords and the later
refers to identifying similar themes or concepts from the data set. This paper discusses the use
of conceptual content analysis by using computerised software to analyse data gathered from
semi-structured interviews. The context of the research within which content analysis is used
is to identify the influence of performance measurement towards construction research
activities. The paper first explains the research methodology pertaining to this study by
reasoning out the selection of case study research approach coupled with semi-structured
interviews. The paper then discusses how the information gathered from semi-structured
interviews is fed into the computerised software to identify and generate main concepts of the
study.
TL;DR: This work proposes a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text and describes the technical challenges involved, motivated in part by the experiences with an early 90M-page prototype.
Abstract: The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web’s unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.
TL;DR: In Mining the Talk as discussed by the authors, two leading-edge IBM researchers introduce a new approach to unlock the business value hidden in virtually any form of unstructured data, from word processing documents to websites, emails to instant messages.
Abstract: Leverage Unstructured Data to Become More Competitive, Responsive, and InnovativeIn Mining the Talk, two leading-edge IBM researchers introduce a revolutionary new approach to unlocking the business value hidden in virtually any form of unstructured datai¾from word processing documents to websites, emails to instant messages.The authors review the business drivers that have made unstructured data so importanti¾and explain why conventional methods for working with it are inadequate. Then, writing for business professionalsi¾not just data mining specialistsi¾they walk step-by-step through exploring your unstructured data, understanding it, and analyzing it effectively.Next, you'll put IBM's techniques to work in five key areas: learning from your customer interactions; hearing the voices of customers when they're not talking to you; discovering the “collective consciousness” of your own organization; enhancing innovation; and spotting emerging trends. Whatever your organization, Mining the Talk offers you breakthrough opportunities to become more responsive, agile, and competitive. Identify your key information sources and what can be learned about them Discover the underlying structure inherent in your unstructured information Create flexible models that capture both domain knowledge and business objectives Create visual taxonomies: “pictures” of your data and its key interrelationships Combine structured and unstructured information to reveal hidden trends, patterns, and relationships Gain insights from “informal talk” by customers and employees Systematically leverage knowledge from technical literature, patents, and the Web Establish a sustainable process for creating continuing business value from unstructured data Preface xvAcknowledgements xxChapter 1: Introduction 1Chapter 2: Mining Customer Interactions 21Chapter 3: Mining the Voice of the Customer 71Chapter 4: Mining the Voice of the Employee 93Chapter 5: Mining to Improve Innovation 111Chapter 6: Mining to See the Future 133Chapter 7: Future Applications 163Appendix: The IBM Unstructured Information Modeler Users Manual 171
TL;DR: In this paper, the authors provide various embodiments of systems, methods, and software for managing archived data, such as metadata indexing, metadata parsing, and metadata attribute indexing.
Abstract: This disclosure provides various embodiments of systems, methods, and software for managing archived data. For example, software for archiving data may receive a request to archive an unstructured data object and archive the unstructured data object into an archive object in an offline storage media. The archive object is associated with one or more metadata attributes. The request may be received from an exposed API method embedded within a communicably coupled business application. The software may receive identification of an archive index via the request from the exposed API, where the archive index points to the offline storage media and is based on one or more metadata attribute criteria. The software may parse the archive object into the metadata attributes according to at least a subset of the attribute criteria and populate the archive index with the one or more metadata attributes indexing the archive object.
TL;DR: An IWMS, called BIwTL (Business Information Warehouse Toolkit and Language), that automates and simplifies IW MS tasks by devising a high-level declarative information warehousing language, GIWL, and building the runtime system components for such a language.
Abstract: Rapidly leveraging information analytics technologies to mine the mounting information in structured and unstructured forms, derive business insights and improve decision making is becoming increasingly critical to today's business successes. One of the key enablers of the analytics technologies is an Information Warehouse Management System (IWMS) that processes different types and forms of information, builds, and maintains the information warehouse (IW) effectively. Although traditional multi-dimensional data warehousing techniques, coupled with the well-known ETL processes (Extract, Transform, Load) may meet some of the requirements in an IWMS, in general, they fall short on several major aspects: 1. They often lack comprehensive support for both structured and unstructured data processing; 2. they are database-centric and require detailed database and data warehouse knowledge to perform IWMS tasks, and hence they are tedious and time-consuming to operate and learn; 3. they are often inflexible and insufficient in coping with a wide variety of on-going IW maintenance tasks, such as adding new dimensions and handling regular and lengthy data updates with potential failures and errors. To cope with such issues, this paper describes an IWMS, called BIwTL (Business Information Warehouse Toolkit and Language), that automates and simplifies IWMS tasks by devising a high-level declarative information warehousing language, GIWL, and building the runtime system components for such a language. BIwTL hides system details, e.g., databases, full text indexers, and data warehouse models, from users by automatically generating appropriate runtime scripts and executing them based on the GIWL language specification. Moreover, BIwTL supports structured and unstructured information processing by embedding flexible data extraction and transformation capabilities, while ensuring high performance processing for large datasets. In addition, this paper systematically studied the core tasks around information warehousing and identified five key areas. In particular, we describe our technologies in three areas, i.e., constructing an IW, data loading, and maintaining an IW. We have implemented such technologies in BIwTL 1.0 and validated it in real world environments with a number of customers. Our experience suggests that BIwTL is light-weight, simple, efficient, and flexible.
TL;DR: This paper presents an unstructured data processing benchmark suite that is developed and provides detailed descriptions of the workloads in the benchmark suite and discusses the larger space of application characteristics that each of them capture.
Abstract: A large fraction of the data that will stored and accessed in future systems is expected to be unstructured, in the form of images, audio files, etc. Therefore, it is very important to design future I/O subsystems to provide efficient storage, and access to these vast and continuously growing repositories of unstructured data. To facilitate system design and evaluation, we first need benchmarks that capture the processing and I/O access characteristics of applications that operate on unstructured data. In this paper, we present an unstructured data processing benchmark suite that we have developed. We provide detailed descriptions of the workloads in the benchmark suite and discuss the larger space of application characteristics that each of them capture.
TL;DR: In this paper, a rule-based content mining system is proposed to extract content from structured or unstructured data. But the system is not suitable for large-scale data sets.
Abstract: A system for facilitating rule-based content mining to extract content from structured or unstructured data receives a file that contains structured or unstructured data, or a mixture of both. The system then generates a processable extensible markup language (pXML) file based on the received file. The system further extracts content from the pXML file based on one or more rules and generates a semantic XML file based on a specified format.
TL;DR: In this paper, a security parameter index (SPI) of a set of structured, semi-structured or unstructured data is determined based on the content of the set of data.
Abstract: Some embodiments of high granularity reactive measures for selective pruning of information have been presented. The system and apparatus embody algorithms to automatically evaluate the security based significance (also referred to as “information enthalpy”) of a given set of structured, semi-structured or unstructured Data. This is also termed as security parameter index (SPI), represented by a numerical value, and is regarded as the intrinsic property of a given set of structured, semi-structured or unstructured Data. In one embodiment, a security parameter index (SPI) of a set of data is determined based on content of the set of data. If the SPI is above a predetermined threshold, then a security quotient (S q ) of the set of data is further determined based on the SPI and an action to be performed on the set of data in the current situation. Based on the value of the Sq, a data leak prevention policy is automatically defined and enforced on the set of data in the current situation. The system and apparatus also embody a Security Map that enumerates the security based inter-relationship between Agents, Data Set(s) and permissible Action(s) that can be invoked on the data. The Security Map enables automatic and dynamic generation and enforcement of security policies to prevent data leak.
TL;DR: A methodological framework for more objective E-R data modeling by eliminating the structured content- dependent metadata associated with the unstructured data is proposed and a system called the human brain image database system (HBIDS) is developed accordingly.
Abstract: Data Modeling is an essential first step for data preparation in any data mining procedure. Conventional entity-relational (E-R) data modeling is lossy, irreproducible, and time- consuming especially when dealing with unstructured image data associated with complex systems like the human brain. We propose a methodological framework for more objective E-R data modeling by eliminating the structured content- dependent metadata associated with the unstructured data. The proposed method is applied to epilepsy-related image data and a system called the human brain image database system (HBIDS) is developed accordingly. Supported with navigation, segmentation, data fusion, and feature extraction modules, HBIDS provides a content-based support environment (C-BASE). Such an environment potentially provides an unlimited (ad hoc) query support with a reproducible and efficient database schema. Switching between different modalities of data, while confining the feature extractors within the object(s) of interest, HBIDS yields anatomically specific query results. The price of such scheme is large storage requirements and relatively high computational cost. Examples of navigation through unstructured image data and content-based retrieval are presented in this paper. The results show the potential of HBIDS in content-based data management for decision support systems in real life medical applications.
TL;DR: In this paper, a configuration interface is described for allowing users to specify object oriented policies, which allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic.
Abstract: Systems and methods for configuring and evaluating policies that direct processing of one or more data streams are described. A configuration interface is described for allowing users to specify object oriented policies. These object oriented policies may allow any data structures to be applied with respect to a payload of a received packet stream, including any portions of HTTP traffic. A configuration interface may also allow the user to control the order in which policies and policy groups are executed, in addition to specifying actions to be taken if one or more policies are undefined. Systems and methods for processing the policies may allow efficient processing of object-oriented policies by applying potentially complex data structures to unstructured data streams. A device may also interpret and process a number of flow control commands and policy group invocation statements to determine an order of execution among a number of policies and policy groups. These policy configurations and processing may allow configuration and processing of complex network behaviors relating to load balancing, VPNs, SSL offloading, content switching, application security, acceleration, and caching.
TL;DR: A conceptual model for clustering that helps focusing on the data- mining process at the adequate abstraction level and an extension of the unified modeling language (UML) by means of the UML profiling mechanism allowing us to design clustering data-mining models on top of the MD model of a DW.
Abstract: Clustering can be considered the most important unsupervised learning technique finding similar behaviors (clusters) on large collections of data. Data warehouses (DWs) can help users to analyze stored data, because they contain preprocessed data for analysis purposes. Furthermore, the multidimensional (MD) model of DWs, intuitively represents the system underneath. However, most of the clustering data mining are applied at a low-level of abstraction to complex unstructured data. While there are several approaches for clustering on DWs, there is still not a conceptual model for clustering that facilitates modeling with this technique on the multidimensional (MD) model of a DW. Here, we propose (i) a conceptual model for clustering that helps focusing on the data-mining process at the adequate abstraction level and (ii) an extension of the unified modeling language (UML) by means of the UML profiling mechanism allowing us to design clustering data-mining models on top of the MD model of a DW. This will allow us to avoid the duplication of the time-consuming preprocessing stage and simplify the clustering design on top of DWs improving the discovery of knowledge.
TL;DR: A system designed to satisfy three primary goals: real-time concept mining of high-volume data streams; dynamic organization of concepts into a relational hierarchy; adaptive reorganization of the concept hierarchy in response to evolving circumstances and user feedback.
Abstract: We are concerned with the general problem of concept mining - discovering useful associations, relationships, and groupings in large collections of data. Mathematical transformation algorithms have proven effective at reducing the content of multilingual, unstructured data into a vector that describes the content. Such methods are particularly desirable in fields undergoing information explosions, such as network traffic analysis, bioinformatics, and the intelligence community. In response, concept mining methodology is being extended to improve performance and permit hardware implementation -traditional methods are not sufficiently scalable. Hardware-accelerated systems have proven effective at automatically classifying such content when topics are known in advance. Our complete system builds on our past work in this area, presented in the Aerospace 2005 and 2006 conferences, where we described a novel algorithmic approach for extracting semantic content from unstructured text document streams. However, there is an additional need within the intelligence community to cluster related sets of content without advance training. To allow this function to happen at high speed, we have implemented a system that hierarchically clusters streaming content. The method, streaming hierarchical partitioning, is designed to be implemented in hardware and handle extremely high ingestion rates. As new documents are ingested, they are dynamically organized into a hierarchy, which has a fixed maximal size. Once this limit is reached, documents must consequently be excreted at a rate equaling their ingestion. The choice of documents to excrete is a point of interest -we present several autonomous heuristics for doing so intelligently, as well as a proposal for incorporating user interaction to focus attention on concepts of interest. A related desideratum is robust accommodation of concept drift -gradual change in the distribution and content of the document stream over time. Accordingly, we present and analyze experimental results for document streams evolving over time under several regimes. Current and proposed methods for concisely and informatively presenting derived content from streaming hierarchical clustering to the user for analysis are presented in this content. To support our claims of eventual hardware implementation and real-time performance with a high ingestion rate, we provide a detailed hardware-ready design, with asymptotic analysis and performance predictions. The system has been prototyped and tested on a Xeon processor as well as on a PowerPC embedded within a Xilinx Virtex2 FPGA. In summary, we describe a system designed to satisfy three primary goals: (1) real-time concept mining of high-volume data streams; (2) dynamic organization of concepts into a relational hierarchy; (3) adaptive reorganization of the concept hierarchy in response to evolving circumstances and user feedback.