TL;DR: This paper presents a general system architecture tailored to perform searching, filtering, compression, encryption, and other operations on unstructured data streaming from a disk system by providing for parallelism, hardware-application specialization and reconfiguration, and hardware placement near the disk systems.
Abstract: This paper presents a general system architecture tailored to perform searching, filtering, compression, encryption, and other operations on unstructured data streaming from a disk system. The system achieves high performance on such applications by providing for parallelism, hardware-application specialization and reconfiguration, and hardware placement near the disk systems. A limited prototype of a single compute node has been implemented and is described. The prototype is tailored to applications involving complex searching and its performance is compared to a pure software implementation having the same search capabilities. Performance is considered in terms of data set size, query string hit rate and query complexity. Performance results as a function of these parameters are presented and the results indicate that, for data set sizes above 1.4 MB, the prototype compute node is between one and two orders of magnitude faster than a pure software implementation. At high data set sizes, on an individual node, speedups of about 200 and a sustained throughput of 300 MB/sec have been achieved.
TL;DR: This paper demonstrates a system that can accurately produce a volume rendering of an unstructured mesh with a first-order approximation to any classification method and is capable of rendering over 300 thousand tetrahedra per second yet is independent of the classification scheme used.
Abstract: In this paper, we describe an unstructured mesh volume renderer. Our renderer is interactive and accurately integrates light intensity an order of magnitude taster than previous methods. We employ a projective technique that takes advantage of the expanded programmability of the latest 3D graphics hardware. We also analyze an optical model commonly used for scientific volume rendering and derive a new method to compute it that is very accurate but computationally feasible in real time. We demonstrate a system that can accurately produce a volume rendering of an unstructured mesh with a first-order approximation to any classification method. Furthermore, our system is capable of rendering over 300 thousand tetrahedra per second yet is independent of the classification scheme used.
TL;DR: This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems.
Abstract: Unstructured information represents the vast majority of data collected and accessible to enterprises. Exploiting this information requires systems for managing and extracting knowledge from large collections of unstructured data and applications for discovering patterns and relationships. This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems. It also introduces the Unstructured Information Management Architecture (UIMA) and provides the context for the unstructured information management (UIM) papers that follow.
TL;DR: In this article, a machine-learned statistical model is used to map between unstructured data and structured data, which can be easily trained for new and different locations or domains and can accommodate for inputs which are unseen in the training data.
Abstract: The present invention uses a machine-learned statistical model to map between unstructured data and structured data. By using machine learning techniques, the present parsing engine can be very quickly and easily trained for new and different locations or domains and can also accommodate for inputs which are unseen in the training data.
TL;DR: In this article, a system and related techniques generate and maintain a unified product index, to perform searching and browsing in an online product, service, content or information catalog, to enable users to transparently move between browsing the Web site and searching the web site, refining their search activity in a seamless fashion.
Abstract: A system and related techniques generate and maintain a unified product index, to perform searching and browsing in an online product, service, content or information catalog. A user investigating, for example, a set of retail offerings of digital cameras may for instance browse through a Web site layout or taxonomy to locate products of interest, such as cameras having resolution in the range of 3-4 megapixels or in the price range of $200-300. Alternatively, the user may input search terms in a search dialogue box to locate those or other features. Unlike conventional e-commerce platforms in which search may be performed against structured databases while browsing may access unstructured HTML or other descriptive material, according to the invention navigation and searching may be integrated and both access a structured index derived from product descriptions as well as traditional SQL or other structured data. A user may thus transparently move between browsing the Web site and searching the Web site, refining their search or browsing activity in a seamless fashion. The HTML or other unstructured data may in one regard be processed by an index engine to identify product attributes such as type (electronics), size, price, weight or other specifications as well as attribute values, which may be sorted or stored in a separate table. A set of results generated by conducting an initial search may thus be continued or refine by further browsing down to a particular product or other level of detail. Conversely a user who has browsed to a given level of detail, such as digital camcorders priced less than $700, may initiate a search through products at that level of the hierarchy without resetting search terms or position in the taxonomy. Greater ease of use, more efficient location of products or services and less dead-end pathways are thus achieved.
TL;DR: In this paper, a search request is received from a requester including one or more search terms and a set of objects are located that fulfill the search request, and a relevancy score is computed for the objects based on whether the object(s) include structured or unstructured data.
Abstract: A search request is received from a requester including one or more search terms. In response to the search request, one or more objects are located that fulfill the search request. A relevancy score is computed for the one or more objects based on whether the object(s) include structured or unstructured data. The relevancy scores enable the requestor to determine the content relevancy of one or more objects.
TL;DR: In this article, a data processing method for automatically identifying the underlying syntaxes of unstructured data items is presented, where data items are strings that include incomplete syntactical information but implicitly are characterized by a nontrivial syntax.
Abstract: A data processing method for automatically identifying the underlying syntaxes of unstructured data items, where unstructured data items are strings that include incomplete syntactical information but implicitly are characterized by a nontrivial syntax. The method comprises receiving input of unstructured data items into a processing machine memory; and recognizing the underlying syntaxes of the data items by the processing machine by applying pattern recognition techniques, wherein this step comprises identifying potential syntax components; and combining the components until the underlying syntaxes emerge.
TL;DR: A selection of web-based tools for annotating genes with biological information can be found in this article, where the authors discuss a selection of these tools with regards to their scope, limitations and ease of use.
TL;DR: This paper explores the advantages and highlights the potential limitations of using off line information to generate approximations of dynamic systems and store and retrieve approaches utilize previously generated information about the kinetic evolution of the reaction system in order to build implicit approximation.
TL;DR: The most important schemes contributed from the various communities to date are surveyed, by commenting on the following aspects: optimization techniques, the role of normalizations, setting the parameters, computing time, quality of results, and the integration of external knowledge.
Abstract: Dimension reduction techniques have been a successful avenue for automatically extracting the “concepts” underlying unstructured data, a task that naturally arises in fields as diverse as information retrieval, image processing, social science, etc. It is surprising how much can be achieved for this task using only the raw data itself, without resorting to any additional knowledge or intelligence. We will survey the most important schemes contributed from the various communities to date, by commenting on the following aspects: optimization techniques, the role of normalizations, setting the parameters, computing time, quality of results, and the integration of external knowledge.
TL;DR: A logical model and algebra which represents a step further in the process of bridging the gap between different data modeling approaches, based on set theory and data graphs, which deals with large collections of data instances and is orthogonal to the language used to internally manipulate them.
Abstract: At the present time, the way in which we manage data depends on its structural features. In this report we propose a logical model and algebra which represents a step further in the process of bridging the gap between different data modeling approaches. In particular, the focus is on structured and semistructured data. Our model is based on set theory, as in the relational context, and on data graphs, logical data structures which are simple, mathematically defined, and very expressive. Our approach is parametric and flexible enough to adapt to heterogeneous application contexts, by simply tuning its parameters. The algebra is composed of a small and expressive set of operators. It deals with large collections of data instances, and it is orthogonal to the language used to internally manipulate them. In this way, we clearly distinguish between two levels of data manipulation: the internal one, for navigation and modification of data graphs, and the external one, for manipulation of sets of instances. 1. Department of Computer Science, University of Bologna, Mura Anteo Zamboni 7, 40127 Bologna, Italy 2. Department of Mathematics and Informatics, University of Camerino, Via Madonna delle Carceri 9, 62032 Camerino MC, Italy
TL;DR: The lessons learned from these experiments include the application of the rich metadata message approach, choosing large size of unstructured data but limiting the structured message's sizes, and minimizing COTS software customization.
Abstract: Web services have been opening a wide avenue for software integration. We have reported our experiments with three applications that are built by utilizing and providing Web services for geographic information systems (GIS). The services are designed to handle a large number of concurrent requests. It is clear that performance has to be the central consideration in design of GIS Web services. The lessons learned from these experiments include the application of the rich metadata message approach, choosing large size of unstructured data but limiting the structured message's sizes, and minimizing COTS software customization.
TL;DR: This work presents a system architecture for a digital library that includes provisions for storing and retrieving data with differing structure levels (ranging from highly unstructured to highly structured), and instantiated this architecture via a practical implementation that integrates technologies such as relational and novel XML databases as well as indexing and information retrieval techniques.
Abstract: This work presents an approach to the design of digital libraries by raising and addressing issues involved in supporting services for collections of structured, semi-structured and unstructured objects. We present a system architecture for a digital library that includes provisions for storing and retrieving data with differing structure levels (ranging from highly unstructured to highly structured). We have successfully instantiated this architecture via a practical implementation that integrates technologies such as relational and novel XML databases as well as indexing and information retrieval techniques. A number of digital library applications are using these facilities for handling differing structure levels in an integrated fashion.
TL;DR: In this paper, a document distribution path is developed as a directional graph that is a representation of the historic dependencies between documents, which is constructed in real time as documents are created.
Abstract: A technique for efficient representation of dependencies between electronically-stored documents, such as in an enterprise data processing system. A document distribution path is developed as a directional graph that is a representation of the historic dependencies between documents, which is constructed in real time as documents are created. The system preferably maintains a lossy hierarchical representation of the documents indexed in such a way that allows for fast queries for similar but not necessarily equivalent documents. A distribution path, coupled with a document similarity service, can be used to provide a number of applications, such as a security solution that is capable of finding and restricting access to documents that contain information that is similar to other existing files that are known to contain sensitive information.
TL;DR: A data warehouse is described, developed from the registration web pages at Union College, which allows faculty and students to get on-line access to course enrollment trends, classroom availability, student class schedules, and other pertinent information.
Abstract: Data warehousing is the ability to collect information from various data repositories and combine them into a single structured repository that can be queried for new information such as performance trends, decision modeling, predictions, and association rules. Internet web sites are data repositories containing useful but unstructured data. In this paper, we describe a data warehouse, developed from the registration web pages at Union College, which allows faculty and students to get on-line access to course enrollment trends, classroom availability, student class schedules, and other pertinent information. The results of this project were so successful in the type of information that could be obtained that the administration became concerned about student privacy issues.
TL;DR: A recommender system based on uncontrolled or unstructured data would allow auction websites to benefit from this technology and help provide a higher level of service to their customers.
Abstract: Recommender systems have become an important tool for marketing products to customers by learning and monitoring their behavior. Many of the current recommender systems available use structured information to make the "suggestions" which over time have been proven to accurately suggest products which customers would want to view. Amazon.com has implemented an excellent recommender system that uses structured data. A recommender system based on uncontrolled or unstructured data would allow auction websites to benefit from this technology. Websites such as ebay.com and ubid.com could integrate a recommender system into their websites and provide a higher level of service to their customers.
TL;DR: Transforming and merging heterogeneous information from various formats into a unique format (what they call Format Fusion) is a good basis for Intelligence Fusion and becomes a material upon which powerful data analysis can be performed.
Abstract: : People have to deal with an impressive continuum of representations, from fully numeric and structured to totally textual and unstructured. Solving this situation of heterogeneity is a prerequisite to information fusion processes and algorithms. However, in the human brain, the distinction between structured and unstructured data simply does not exist. Humans are easily able to merge information coming from heterogeneous sources. How can computers mimic this extraordinary capability? The solution is to represent information in machines in a way that is similar to the way it is represented in the human brain. This subject was studied for years in the field of Artificial Intelligence, and, as early as the 1950s, the concept of Semantic Nets arrived to meet this challenge. Semantic Nets are an extremely efficient and human friendly way of representing complex information. The authors started developing and using a tool dedicated to the management of Semantic Nets in the early 1990s. Their first experiments with Ideliance showed that Semantic Nets can play an important role in Symbolic Intelligence Fusion. Transforming and merging heterogeneous information from various formats (databases, tables, messages, texts) into a unique format (what they call Format Fusion) is a good basis for Intelligence Fusion. First, it offers an efficient support for "manual" seamless inspection and navigation of the whole set of information. Second, it becomes a material upon which powerful data analysis (distance and cluster computation) can be performed. They call this process "Litteratus Calculus." Their conjecture is that the objects resulting from this analysis form the backbone of the Intelligence Fusion process, which, ultimately, is the domain of human decision. Twenty-two briefing charts summarize the presentation.
TL;DR: A flexible mining architecture able to define and validate a process, generally applicable in different e-business sectors, for providing new added value e-Knowledge services is presented.
Abstract: In this paper we present a flexible mining architecture able to define and validate a process, generally applicable in different e-business sectors, for providing new added value e-Knowledge services. The architecture is designed on the ModelView-Controller pattern in order to get a clear separation of the component functionalities and covers the whole process of Knowledge Discovery in Databases (KDD) and in Text (KDT) for the extraction of patterns starting from structured and unstructured data. When a service request comes to the system, this is received by a Controller that will call one or more Miners to provide the results. The Miners represent the Model of the system and, by using a Kernel, dynamically activate either the KDD or the KDT process, depending on the typology of the service. As View, the system makes use of the CWM standard for the representation of metadata about models and results of the mining processes. The Kernel includes two different Focuses, for the selection of structured and unstructured data. The results of the Focuses are passed to a unique Pattern Extraction step, where Web and Data Mining algorithms are collected for the analysis. Finally, an Evaluation step interprets the utility of the extracted patterns. The proposed solutions can be suited in a distributed mining environment, where a set of services are managed and made available as a means of meeting the diverse needs of the e-business world.
TL;DR: The WEAVE® system which automates the structuring and collating of unstructured data from multiple on-line Websites and uses this coherent view of MRO data to allow a user to quickly locate and compare MRO products.
Abstract: Gleaning consistent and complete data from multiple sources of unstructured information is often a difficult and time consuming process. In this paper we outline the WEAVE® system which automates the structuring and collating of unstructured data from multiple on-line Websites. WEAVE® is presented in the context of the maintenance, repair, and operations supply chain. The underlying knowledge representation for WEAVE® is an MRO product ontology. This ontology drives classification of product descriptions harvested from Websites and attribute value extraction from the descriptions. The system uses logic programming to manage the ontology driven classification and extraction and the Java 2 Enterprise Edition platform and Open Business Engine workflow engine to continuously harvest and collate data from multiple MRO catalog Websites. It uses this coherent view of MRO data to allow a user to quickly locate and compare MRO products.
TL;DR: This dissertation contains a set of contributions that deal with search or classification of non-textual information, including search in music, classifying digitally sampled music, visualization and navigation in search results, and classifying images and Internet sites.
Abstract: This dissertation contains a set of contributions that deal with search or classification of non-textual information. Each contribution can be considered a solution to a specific problem, in an attempt to map out a common ground. The problems cover a wide range of research fields, including search in music, classifying digitally sampled music, visualization and navigation in search results, and classifying images and Internet sites.On classification of digitally sample music, as method for extracting the rhythmic tempo was disclosed. The method proved to work on a large variety of music types with a constant audible rhythm. Furthermore, this rhythmic properties showed to be useful in classifying songs into music groups or genre.On search in music, a technique is presented that is based on rhythm and pitch correlation between the notes in a query theme and the notes in a set of songs. The scheme is based on a dynamic programming algorithm which attempts to minimize the error between a query theme and a song. This operation includes finding the best alignment, taking into account skipped notes and additional notes, use of different keys, tempo variations, and variances in pitch and time information.On image classification, a system for classifying whole Internet sites based on the image content, was proposed. The system was composed of two parts; an image classifier and a site classifier. The image classifier was based on skin detection, object segmentation, and shape, texture and color feature extraction with a training scheme that used genetic algorithms. The image classification method was able to classify images with an accuracy of 90%. By classifying multi-image Internet web sites this accuracy was drastically increased using the assumption that a site only contains one type of images. This assumption can be defended for most cases.On search result visualization and navigation, a system was developed involving the use of a state-of-the-art search engine together with a graphical front end to improve the user experience associated with search in unstructured data. Both structured and unstructured data with the help of entity extraction can be indexed in a modern search engine. Combining this with a multidimensional visualization based on heatmaps with navigation capabilities showed to improve the data value and search experience on current search systems.
TL;DR: One set of papers in this year?s Data Warehousing and Business Intelligence Minitrack investigates how the field is meeting changing business needs.
Abstract: Data warehousing and business intelligence continues to evolve as a field to address new trends in the marketplace, such as compliance and privacy, managing and leveraging unstructured data, and real-time, tactical decision making. One set of papers in this year?s Data Warehousing and Business Intelligence Minitrack investigates how the field is meeting changing business needs.
TL;DR: This paper gives a preview of the most popular method — based on RDF triples, and suggests a way to automate topic map creation from unstructured information sources, which can be applied in information systems development domain when analyzing vast unstructuring data repositories in preparation for system design.
Abstract: There is an increasing interest in automating creation of semantic structures, especially topic maps, by taking advantage of existing, structured information resources. This paper gives a preview of the most popular method — based on RDF triples, and suggests a way to automate topic map creation from unstructured information sources. The method can be applied in information systems development domain when analyzing vast unstructured data repositories in preparation for system design, or when migrating large amounts of unstructured data from legacy systems. There are two innovative methods presented in the paper — Term Crawling (TC) and Clustering History Projection (CHP), which are used in order to build a topic map based on free text documents downloaded from the Internet. A sample tool, which uses described techniques, has been implemented. The preliminary results that have been achieved on the test collection are presented in concluding sections of the article.
TL;DR: Information retrieval (IR) deals with the representation, storage, and organization of unstructured data, applying natural-language processing, semantic relationships, linguistic analyses, behavioral histories, and fuzzy statistical techniques to help human beings quickly find and retrieve the information they seek.
Abstract: Information retrieval (IR) deals with the representation, storage, and organization of unstructured data, applying natural-language processing, semantic relationships, linguistic analyses, behavioral histories, and fuzzy statistical techniques to help human beings quickly find and retrieve the information they seek.
TL;DR: The KDD-bused technology intelligence system KDD/TIS is presented and evaluated and encompasses techniques for mining quantitative, structured, as well as qualitative semi-and unstructured data and addresses all relevant environmental dimensions and is designed as an open system.
Abstract: As a consequence of the increasing complexity and dynamics of the environment in which companies operate, technology intelligence systems (TISs) - aiming primarily at the early discovery of new technologies with strategic relevance - have recently begun to gain importance. Two generic approaches an information technology-driven and a human resource-driven approach - have been suggested for the construction of such systems. The recent developments in knowledge discovery in databases (KDD), with respect to text mining, offer the possibility to integrate the various approaches available under a common architecture. Based on discussions of KDD and TI, the KDD-bused technology intelligence system KDD/TIS is presented and evaluated. It integrates information technology and human resource-driven approaches and encompasses techniques for mining quantitative, structured, as well as qualitative semi-and unstructured data. It addresses all relevant environmental dimensions and is designed as an open system. Based on the evaluation of KDD/TIS, future research requirements are identified.
TL;DR: This dissertation contains a set of contributions that deal with search or classification of non-textual information, including search in music, classifying digitally sampled music, visualization and navigation in search results, and classifying images and Internet sites.
Abstract: This dissertation contains a set of contributions that deal with search or classification of non-textual information. Each contribution can be considered a solution to a specific problem, in an attempt to map out a common ground. The problems cover a wide range of research fields, including search in music, classifying digitally sampled music, visualization and navigation in search results, and classifying images and Internet sites.On classification of digitally sample music, as method for extracting the rhythmic tempo was disclosed. The method proved to work on a large variety of music types with a constant audible rhythm. Furthermore, this rhythmic properties showed to be useful in classifying songs into music groups or genre.On search in music, a technique is presented that is based on rhythm and pitch correlation between the notes in a query theme and the notes in a set of songs. The scheme is based on a dynamic programming algorithm which attempts to minimize the error between a query theme and a song. This operation includes finding the best alignment, taking into account skipped notes and additional notes, use of different keys, tempo variations, and variances in pitch and time information.On image classification, a system for classifying whole Internet sites based on the image content, was proposed. The system was composed of two parts; an image classifier and a site classifier. The image classifier was based on skin detection, object segmentation, and shape, texture and color feature extraction with a training scheme that used genetic algorithms. The image classification method was able to classify images with an accuracy of 90%. By classifying multi-image Internet web sites this accuracy was drastically increased using the assumption that a site only contains one type of images. This assumption can be defended for most cases.On search result visualization and navigation, a system was developed involving the use of a state-of-the-art search engine together with a graphical front end to improve the user experience associated with search in unstructured data. Both structured and unstructured data with the help of entity extraction can be indexed in a modern search engine. Combining this with a multidimensional visualization based on heatmaps with navigation capabilities showed to improve the data value and search experience on current search systems.
TL;DR: This paper firstly introduces what is text mining, then the operating process and implement technique of text mining are analyzed, and the future of application about text mining is figured out.
Abstract: Text mining is an important aspect of information mining and it is used to knowledge finding on test information. Text mining mainly deals with incomplete data, unstructured data and character data. This paper firstly introduces what is text mining, then analyses operating process and implement technique of text mining, finally figures out the future of application about text mining.
TL;DR: This paper has developed and used a tool employing indigenous technique of Recursive Noise Removal (RNR), based on crossing minimization paradigm for automated detection of hidden patterns in Agro-Metrological data.
TL;DR: The concept of cell marketing is described and why it is positioned as more specific and customer focused than segmentation processes alone and its broader applicability to any industry segment via named entity extraction methods from unstructured data sources.
Abstract: Cell marketing is a new marketing method derived from complex systems research. It provides marketers with new insights as to 'who's who in the (customers & prospects) zoo '. This paper describes for the first time the concept of cell marketing and why it is positioned as more specific and customer focused than segmentation processes alone. A case study is presented which involves a telecommunications company's attempts to more effectively target premium offers to key parties within its existing base. The paper also describes usage for customer acquisition purposes and its broader applicability to any industry segment via named entity extraction methods from unstructured data sources.
TL;DR: With further emergence of the Web and multimedia the importance of systems that can manage and search audio-visual data has arisen, and information retrieval systems took over, providing search methods for unstructured textual documents.
Abstract: Database management systems have been extensively used for more than 30 years as a standard tool for manipulating large amounts of alphanumeric data. They allow efficient and fast access to stored data taking the advantage of the fact that data is structured. With the emergence of the Web, large volumes of unstructured data have become available. As database management systems were not capable of storing and searching that data efficiently, information retrieval systems took over, providing search methods for unstructured textual documents. However, with further emergence of the Web and multimedia the importance of systems that can manage and search audio-visual data has arisen.
TL;DR: A new web-based tool for functional annotations of microarray data that enables evaluating the functional significance of experiment results through statistical indexes and graphical views is created.
Abstract: Microarray technology is generating a massive amount of unstructured data. To uncover useful information, this vast quantity of data need to be studied with statistical approaches and enriched with biological relevant annotations. Most of the databases and tools created to annotate genes are useful for studying one gene at a time but do not allow batch analyses of many genes, as microarray experiments require. We created a new web-based tool for functional annotations of microarray data that enables evaluating the functional significance of experiment results through statistical indexes and graphical views.