TL;DR: The results show that the results are comparable to or even better than previous findings, and it is found that movie review mining is a more challenging application than many other types of review mining.
Abstract: Web content mining is intended to help people discover valuable information from large amount of unstructured data on the web. Movie review mining classifies movie reviews into two polarities: positive and negative. As a type of sentiment-based classification, movie review mining is different from other topic-based classifications. Few empirical studies have been conducted in this domain. This paper investigates movie review mining using two approaches: machine learning and semantic orientation. The approaches are adapted to movie review domain for comparison. The results show that our results are comparable to or even better than previous findings. We also find that movie review mining is a more challenging application than many other types of review mining. The challenges of movie review mining lie in that factual information is always mixed with real-life review data and ironic words are used in writing movie reviews. Future work for improving existing approaches is also suggested.
TL;DR: In this paper, a system and method of making unstructured data available to structured data analysis tools is presented, which includes middleware software that can be used in combination with structured data tools to perform analysis on both structured and unstructural data.
Abstract: A system and method of making unstructured data available to structured data analysis tools. The system includes middleware software that can be used in combination with structured data tools to perform analysis on both structured and unstructured data. Data can be read from a wide variety of unstructured sources. The data may then be transformed with commercial data transformation products that may, for example, extract individual pieces of data and determine relationships between the extracted data. The transformed data and relationships may then be passed through an extraction/transform/load (ETL) layer and placed in a structured schema. The structured schema may then be made available to commercial or proprietary structured data analysis tools.
TL;DR: In this article, a user interface for parsing unstructured data using pattern recognition is presented, where the data that parses according to a pattern may be placed in a column associated with the pattern in a tabular user interface.
Abstract: A user interface for parsing unstructured data using pattern recognition. The patterns used in parsing data are formed from regular expressions. The parsed data may be displayed in a first format and unmatched strings in the unstructured text may be displayed in a second format. A format may comprise a desired color, font or any other user interface parameter. In addition, the data that parses according to a pattern may be placed in a column associated with the pattern in a tabular user interface, for example a spreadsheet like Excel™. Associating a pattern with a position to display successful matches in allows for breaking unstructured text into pieces associated with a particular field or column. Modification of the patterns allows for more and more of the unstructured text to match the patterns and when the data has been parsed to the desired level, the data may be imported into a database.
TL;DR: In this article, the authors present an appliance, a process and a computer program product for the processing of unstructured or semi-structured digital data in a file system. But, when accessing data, logical access is carried out jointly with physical access and, when doing so, a particularly transparent, common access mechanism is implemented for both types of access.
Abstract: The present invention concerns an appliance, a process and a computer program product for the processing of unstructured or semi-structured digital data in a file system. In order to create an appliance, a process and a computer program product which allow simple, reliable, high-performance and purpose oriented management of every manner of digital, stored, unstructured data, it is proposed that, when accessing data, logical access be carried out jointly with physical access and, when doing so, a particularly transparent, common access mechanism be implemented for both types of access.
TL;DR: In this paper, a profile is created that is associated with the portion of data, the profile having at least a first user defined label and a user identifier, and the profile is transmitted from the client to a server in the networked computing system.
Abstract: Methods and systems for managing unstructured data. Embodiments involve providing a portion of data within a client in the networked computing system. A profile is created that is associated with the portion of data, the profile having at least a first user defined label and a user identifier. The portion of data and the profile are transmitted from the client to a server in the networked computing system. The portion of data, for example, a file, and the first user defined label are automatically stored into a data structure, such as a file and an associated database, on the server in response to receipt of the portion of data and the profile by the server. The data structure is subsequently identified in response to a query by the user seeking data associated with the first user defined label.
TL;DR: In this article, a system and methods for analysis and generating reports presenting analysis involving capturing unstructured data from online information services are determined. Butler et al. analyzed the captured data, speaker attributes, and semantic attributes to generate processed information based on captured data.
Abstract: Systems and methods for analysis and generating reports presenting analysis involving capturing unstructured data from online information services. Speaker attributes and semantic attributes associated with items of the captured data are determined. The captured data, speaker attributes, and semantic attributes are analyzed to generate processed information based on the captured data. A report is generated to present the processed information.
TL;DR: In this article, an adaptive information retrieval system is presented, which includes a database component to store structured and unstructured data values, and a search component queries the data values from the database, wherein a learning component associated with the search component or the database component is provided to facilitate retrieval of desired information.
Abstract: The subject invention relates to systems and methods that employ automated learning techniques to database and information retrieval systems in order to facilitate knowledge capabilities for users and systems. In one aspect, an adaptive information retrieval system is provided. The system includes a database component to store structured and unstructured data values. A search component queries the data values from the database, wherein a learning component associated with the search component or the database component is provided to facilitate retrieval of desired information.
TL;DR: In this article, the authors present a system and method for managing electronic files and tasks in a way that is intuitive to the users, mimicking their environment, but "Process-Blind".
Abstract: A system and method for managing electronic files and tasks in a way that is intuitive to the users, mimicking their environment, but “Process-Blind’. The system and method comprise a configurable structure that facilitates the accurate filing and subsequent locating of electronic files in underlying document/record management applications. The manager enables users, with permissions, to communicate with each other regarding these files and their work through ad-hoc workflows that are created by the user as needed, then retained as documentation of events. The manager provides an efficient, secure, auditable handling of unstructured data that is free of predetermined inflexible processes and is not dependant on specific underlying document management software.
TL;DR: Experimental result indicated that, compared with previous researches for English reviews, the performance of both approaches for Chinese reviews sentiment classification are acceptable, while the support vector machine approach has better performance than the semantic orientation approach.
Abstract: Web content mining is intended to help people to discover valuable information from large amount of unstructured data on the Web. Sentiment classification aims to mining the Web content of product reviews by classifying the reviews into positive or negative opinions. Such kind of classification approaches could help both consumers and sellers in making their decisions. But it is also a complicated task with great challenge. This paper conducted a comparison between the SVM approach and semantic approach for sentiment classification of Chinese reviews and also proposed some improvement for sentiment classification approaches. Experimental result indicated that, compared with previous researches for English reviews, the performance of both approaches for Chinese reviews sentiment classification are acceptable, while the support vector machine approach has better performance than the semantic orientation approach.
TL;DR: In extracting themes from unstructured data, both text mining packages were only marginally helpful, implying that a text mining approach, which is based on analysis units other than terms, may be more powerful in extracting themes, an idea touched upon in the conclusion section.
Abstract: The purpose of this article is to review two text mining packages, namely, WordStat and SAS TextMiner. WordStat is developed by Provalis Research. SAS TextMiner is a product of SAS. We review the features offered by each package on each of the following key steps in analyzing unstructured data: (1) data preparation, including importing and cleaning; (2) performing association analysis; and (3) presenting the findings, including illustrative quotes and graphs. We also evaluate each package on its ability to help researchers extract major themes from a dataset. Both packages offer a variety of features that effectively help researchers run associations and present results. However, in extracting themes from unstructured data, both packages were only marginally helpful. The researcher still needs to read the data and make all the difficult decisions. This finding stems from the fact that the software can search only for specific terms in documents or categorize documents based on common terms. Respondents, h...
TL;DR: This paper proposes a novel approach wherein the application specifies its information needs using only a SQL query on the structured data, and this query is automatically ``translated'' into a set of keywords that can be used to retrieve relevant unstructured data.
Abstract: Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of seamlessly integrating critical business information distributed across both structured and unstructured data sources. In existing information integration solutions, the application needs to formulate the SQL logic to retrieve the needed structured data on one hand, and identify a set of keywords to retrieve the related unstructured data on the other. This paper proposes a novel approach wherein the application specifies its information needs using only a SQL query on the structured data, and this query is automatically ``translated'' into a set of keywords that can be used to retrieve relevant unstructured data. We describe the techniques used for obtaining these keywords from (i) the query result, and (ii) additional related information in the underlying database. We further show that these techniques achieve high accuracy with very reasonable overheads.
TL;DR: In this paper, a technique for optimizing the archival and management of data stored as XML documents is presented, which is capable of handling mixed data including highly structured data and unstructured data.
Abstract: A technique for optimizing the archival and management of data stored as XML documents is capable of handling mixed data including highly structured data and unstructured data. The technique maps the structured data to a relational database while storing the unstructured data in its native XML format. The data is updated using a rules database that maps updating rules against attributes and classes of elements within the documents. A document checking/validation engine performs the updates based on rule verification. A search engine searches the documents using both a path index table and a weighted content index.
TL;DR: In this paper, a system and method for processing a document to generate a set of related documents is presented, which includes a textual analytics system that analyzes unstructured data contained in a source document, and extracts structured information about the source document.
Abstract: A system and method for processing a document to generate a set of related documents. A system is provided that includes a textual analytics system that analyzes unstructured data contained in a source document and extracts a set of structured information about the source document; and a compare system that identifies a set of related documents by comparing the set of structured information with metadata indexed from a set of publications.
TL;DR: In this article, the unstructured data and structured data are captured and associated with the structured data, and a method for processing data is provided, which is based on a data structure based on the link.
Abstract: A method for processing data is provided. In this method, unstructured data and structured data are captured and the unstructured data is associated with the structured data. After capture, the unstructured data and the structured data are correlated to define a link between the unstructured data and the structured data. The unstructured data and the structured data then are stored in a data structure based on the link. A system for processing data also is described.
TL;DR: This work considers the question of modeling an application domain whose data may be partially structured and partially unstructured, and proposes the concept of malleable schemas as a modeling tool that enables incorporating both structured and unstructuring data from the very beginning, and evolving one's model as it becomes more structured.
Abstract: Large-scale information integration, and in particular, search on the World Wide Web, is pushing the limits on the combination of structured data and unstructured data. By its very nature, as we combine a large number of information sources, our ability to model the domain in a completely structured way diminishes. We argue that in order to build applications that combine structured and unstructured data, there is a need for a new modeling tool. We consider the question of modeling an application domain whose data may be partially structured and partially unstructured. In particular, we are concerned with applications where the border between the structured and unstructured parts of the data is not well deflned, not well known in advance, or may evolve over time. We propose the concept of malleable schemas as a modeling tool that enables incorporating both structured and unstructured data from the very beginning, and evolving one’s model as it becomes more structured. A malleable schema begins the same way as a traditional schema, but at certain points gradually becomes vague, and we use keywords to describe schema elements such as classes and properties. The important aspect of malleable schemas is that a modeler can capture the important aspects of the domain at modeling time without having to commit to a very strict schema. The vague parts of the schema can later evolve to have more structure, or can remain as such. Users can pose queries in which references to schema elements can be imprecise, and the query processor will consider closely related schema elements as well.
TL;DR: It is shown that, despite the apparent diversity, two basic principles underlie the recent approaches: first, use structured machines to learn structured data; second, learn representations instead of handcrafting them, which proved very successful for handling structured data, to the point of generating a novel branch of numerical machine learning.
TL;DR: In this paper, a system and method of retrieving data from a database comprising unstructured data comprises specifying a text analytic component at query-runtime, submitting the unstructuring text query to a web service database, filtering unStructured text data in the web services database based on constraints defined in the text analytic components in the query; and receiving the filtered unStructuring text data based on the submitted query from the web service databases, wherein the text analytical component comprises metadata requirements.
Abstract: A system and method of retrieving data from a database comprising unstructured data comprises specifying a text analytic component in an unstructured text query at query-runtime; submitting the unstructured text query to a web service database; filtering unstructured text data in the web service database based on constraints defined in the text analytic component in the query; and receiving the filtered unstructured text data based on the submitted query from the web service database, wherein the text analytic component comprises metadata requirements. Preferably, the constraints comprise any of positive sentiments regarding an unstructured text document and negative sentiments regarding the unstructured text document. Alternatively, the constraints may comprise any of name spotting constraints, address spotting constraints, date spotting constraints, and entity spotting constraints. The filtering preferably occurs using a web-based callback service specified in a WFQL XML document. The database is preferably run on a WebFountain platform.
TL;DR: In this article, a method and system for determining and analyzing clinical trend data is provided for classification of unstructured data at a medical facility into structured data, and associated with various parameter identifiers.
Abstract: A method and system are provided for determining and analyzing clinical trend data. Unstructured data at a medical facility is classified utilizing a rule database into structured data, and associated with various parameter identifiers. The structure data is then statistically analyzed and graphical displays and/or reports are produced showing the relationships between data associated with the various parameters. The aggregation of large amounts of data from various sites permits trends to be recognized that would otherwise not be apparent.
TL;DR: The aim of the paper is to define and validate a process of knowledge discovery, starting from structured and unstructured data, in a distributed and heterogeneous environment by integrating Data and Web Mining techniques.
Abstract: This paper presents a flexible mining system built on a multi-tier architecture. The architecture of the system is designed on the Model-View-Controller design pattern. The aim of the paper is to define and validate a process of knowledge discovery, starting from structured and unstructured data, in a distributed and heterogeneous environment by integrating Data and Web Mining techniques. The system is able to provide new added value e-Knowledge services in order to meet the diverse e-Knowledge needs of e-business, by driving the user through the three stages of the knowledge discovery, data preparation, data and web mining and result analysis, in an innovative unique process of workflow that integrates a set of existing tools. Each e-Knowledge service represents the result of an orchestration of reusable building blocks, with well defined tasks, able to interoperate among them. Finally, a case study is shown.
TL;DR: The goal is to discover interesting and powerful functional integrations that permit inferential, analogical, and intelligent IR technologies to exploit each others’ strengths to mitigate their weaknesses.
Abstract: Our project is aimed at integrating and extending inferential, analogical, and intelligent IR technologies to create power tools for intelligence analysts. Our goal is to discover interesting and powerful functional integrations that permit these technologies to exploit each others’ strengths to mitigate their weaknesses. From the perspective of knowledge-based AI technology, a key goal of the project is to extend the reach of such systems into the world of unstructured data and text. From the perspective of IR technology, it is to leverage the application of inferential and analogical techniques to structured representations in order to achieve significant new functionality.
TL;DR: The aggregation service presented here eliminates the need for extensive information acquisition that is necessary for most learning algorithms, and users instead have full control over the aggregation process, including the distribution of their presence status information to interested watchers.
Abstract: Personal presence systems are widely used to get aware of other users' availability and willingness to communicate before actually contacting them. After early systems focused on ad-hoc text messaging only and relied on manual updates of status descriptions, modern applications not only integrate multimedia communication but also facilitate automatic detection of status changes. The status of devices owned by a particular user then can be used to infer the user's presence status automatically, i.e. without explicit user-interaction. To achieve this, unstructured data provided from various sources is aggregated and set into relation with user-specific context information. The aggregation service presented here eliminates the need for extensive information acquisition that is necessary for most learning algorithms. Users instead have full control over the aggregation process, including the distribution of their presence status information to interested watchers.
TL;DR: A fast, reliable, accurate and on-demand way of cleansing transactional data and generating an integrated view of spend is described, currently in the process of being deployed by IBM for use in its BTO practice.
Abstract: The development of an aggregate view of the procurement spend across an enterprise using transactional data is increasingly becoming a very important and strategic activity. Not only does it provide a complete and accurate picture of what the enterprise is buying and from whom, it also allows it to consolidate suppliers, as well as negotiate better prices. The importance, as well as the complexity, of this cleansing exercise is further magnified by the increasing popularity of Business Transformation Outsourcing (BTO) wherein enterprises are turning over non-core activities, such as indirect procurement, to third parties, who now need to develop an integrated view of spend across multiple enterprises in order to optimize procurement and generate maximum savings. However, the creation of such an integrated view of procurement spend requires the creation of a homogeneous data repository from disparate (heterogeneous) data sources across various geographic and functional organizations throughout the enterprise(s). Such repositories get transactional data from various sources such as invoices, purchase orders, account ledgers. As such, the transactions are not cross-indexed, refer to the same suppliers by different names, and use different ways of representing information about the same commodities. Before an aggregated spend view can be developed, this data needs to be cleansed, primarily to normalize the supplier names and correctly map each transaction to the appropriate commodity code. Commodity mapping, in particular, is made more difficult by the fact that it has to be done on the basis of unstructured text descriptions found in the various data sources. We describe an on-demand system to automatically perform this cleansing activity using techniques from information retrieval and machine learning. Built on standard integration and application infrastructure software, this system provides enterprises with a fast, reliable, accurate and on-demand way of cleansing transactional data and generating an integrated view of spend. This system is currently in the process of being deployed by IBM for use in its BTO practice.
TL;DR: A new architectural approach is described that enables the processing of very large data sets, yielding two orders of magnitude performance gain over conventional approaches.
Abstract: While improvements in the density of semiconductor circuitry have been dramatic, the density improvements in magnetic storage have been even greater. We now store much more data than we have time to process, implying that techniques for processing these data need to be significantly altered. This paper describes a new architectural approach that enables the processing of very large data sets, yielding two orders of magnitude performance gain over conventional approaches
TL;DR: In this paper, a system and method facilitating a unified framework for accessing structured and unstructured data is provided, which includes a source document having data having data that is parsed into a data document component providing a hierarchical representation of data associated with the source document and a data set component providing relational representation of at least a portion of the data associated in the source documents.
Abstract: A system and method facilitating a unified framework for accessing structured and unstructured data is provided. The invention includes a source document having data that is parsed into a data document component providing a hierarchical representation of data associated with the source document and a data set component providing a relational representation of at least a portion of the data associated with the source document. The invention further provides for a schema defining a structure of the relational representation to be associated with the source document and/or inferred by the data set component. Data stored in the data document component and the data set component are synchronized, thus a change made to data stored in the data set component is reflected in data stored in the data document component. Further, a change made to data stored in the data document component is reflected in data stored in the data set component if utilized according to the schema. The invention further provides for a service to access the hierarchical representation of data associated with the source document and/or a designer to access the relational representation of data associated with the source document.
TL;DR: This document recognizes the text mining market as composed by pure players (companies which develop text mining software), indirect players (which integrate text mining into their offering, e.g. Clearforest, Inxight, Temis), partial players ( companies which use text mining to improve their core business), and hundreds of other companies which are already working successfully in the worldwide market.
Abstract: Several tools are already available in the text mining (or more generally unstructured data) quickly growing market. We recognize the text mining market as composed by pure players (companies which develop text mining software, e.g. Clearforest, Inxight, Temis), indirect players (which integrate text mining into their offering, e.g. IBM, SAS, SPSS), partial players (companies which use text mining to improve their core business, e.g. Fast, Verity). A section is dedicated to each player which provided us with their company description, and a cumulative section listing the most known players in the text mining arena. Hundreds of other companies are already working successfully in the worldwide market. For a list of them, go to www.kdnuggets.com.
TL;DR: In this article, the authors propose a hierarchical organization of the conditional logic of a COTS system, which permits a high level of control over aggregated complex rule-based processing, and provides dynamic behavior.
Abstract: Processing provides a high level of automated decision support using COTS computing software and hardware. By combining information gathered from multiple structured and unstructured data sources and converting to a common protocol shared with the conditional decision logic, the operator is freed from the task of continually monitoring the situation for compliance with pre-established rules. By organizing the conditional and simulation logic of the system in a hierarchical manner, rules are applied to data-based entities, their interactions, and the overall operational situation, and then to established procedures. The hierarchical organization of the conditional logic permits a high level of control over aggregated complex rule-based processing, and provides dynamic behavior, allowing modifications of the entire system processing to be based on the simplest human interaction or a single change in the state of one data item gathered by the system.
TL;DR: The KAARE (knowledge availability, access, retrieval and extraction) system is described, a generic business model for knowledge extraction of semi structured and unstructured data from Web pages that provides a set of generic tools that will enable an effective access, retrieving and filtering of information available on the World Wide Web.
Abstract: The main objective of this paper is to describe the KAARE (knowledge availability, access, retrieval and extraction) system, a generic business model for knowledge extraction of semi structured and unstructured data from Web pages. The system is ontology driven and provides a set of generic tools that will enable an effective access, retrieval and filtering of information available on the World Wide Web. The interactive model is composed of five managers namely the query manager, the ontology manager, the search manager, the information manager, and the presentation manager. Each manager is responsible for carrying out the delegated tasks from which valid inferences can be made.
TL;DR: In this paper, a method based on tag-separated clustering was proposed to organize semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering.
Abstract: A method organizes semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering. The method comprises retrieving documents including the semi-structured data. The semi-structured data comprises structured data including structured data fields and tags, and unstructured data. The method selects a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute. The method clusters the semi-structured data from the retrieved documents into a plurality of clusters based on the selected structured attribute type and the selected unstructured attribute type. For a categorical attribute, each category corresponds to a single cluster. For a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of the numerical attribute. For an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for the annotated text data.
TL;DR: The NEMIS project, and especially its Working Group 5, aims to identify possible applications of text mining in the world of production and dissemination of official statistics.
Abstract: There is a tremendous increase in the number of actors in the statistical arena in terms of producers, distributors, and users due to the new options of the web technology. These actors are not sufficiently informed about the technological progress made in the field of text mining and the ways in which they can benefit from these. The NEMIS project, and especially its Working Group 5, aims to identify possible applications of text mining in the world of production and dissemination of official statistics. Examples of such applications might be advanced querying of document warehouses at websites, analysing, processing and coding the answers to open-ended questions in questionnaire data, sophisticated access to internal and external sources of statistical metainformation, or to “pull” statistical data and metadata from the web sites of sending institutions.
TL;DR: Models and Indices for Integrating Unstructured Data with a Relational Database and Implicit Enumeration of Patterns and Condensed Representation of EPs and Patterns Quantified by Frequency-Based Measures.
Abstract: Invited Paper.- Models and Indices for Integrating Unstructured Data with a Relational Database.- Contributed Papers.- Constraint Relaxations for Discovering Unknown Sequential Patterns.- Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data.- Theoretical Bounds on the Size of Condensed Representations.- Mining Interesting XML-Enabled Association Rules with Templates.- Database Transposition for Constrained (Closed) Pattern Mining.- An Efficient Algorithm for Mining String Databases Under Constraints.- An Automata Approach to Pattern Collections.- Implicit Enumeration of Patterns.- Condensed Representation of EPs and Patterns Quantified by Frequency-Based Measures.