TL;DR: In this paper, a database of web pages and sites is provided, where entries are at least partially mapped to a set of predetermined categories and the entries are associated with keywords, for instance by automatic indexing of documents.
Abstract: A database of entries, such as Web pages and sites, is provided. The entries are at least partially mapped to a set of predetermined categories. The entries are also associated with keywords, for instance, by automatic indexing of documents. In response to a query into the database, a user is provided with a series of refinement strategies, in addition to search results. Refinement strategies comprise categories relevant for the search, selected among the set of predetermined categories. Refinement strategies also include keywords dynamically selected among keywords associated with the entries. The user may easily navigate among the results to the query, and formulate new queries.
TL;DR: Preliminary experimental results suggest that a combination of domain-specific knowledge and information extraction techniques can be used to generalize from the examples and derive more powerful features.
Abstract: The prohibitive cost of assigning indices to textual cases is a major obstacle for the practical use of AI and Law systems supporting reasoning and arguing with cases. While progress has been made toward extracting certain facts from well-structured case texts or classifying case abstracts under Key Number concepts, these methods still do not suffice for the complexity of indexing concepts in CBR systems.In this paper, we lay out how a better example representation may facilitate classification-based indexing. Our hypotheses are that (1) abstracting from the individual actors and events in cases, (2) capturing actions in multi-word features, and (3) recognizing negation, can lead to a better representation of legal case texts for automatic indexing. We discuss how to implement these techniques with state-of-the-art NLP tools. Preliminary experimental results suggest that a combination of domain-specific knowledge and information extraction techniques can be used to generalize from the examples and derive more powerful features.
TL;DR: This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes.
Abstract: In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution.
TL;DR: The approach to information access discussed in this paper is very promising, and it is concluded that there is much room for further research.
Abstract: The potential of automatically generated indexes for information acces s has been recognized for several decades (e.g., Bush 1945 [2], Edmundson and Wyllys 1961 [4]), but the quantity of text and the ambiguity of natural language processing have made progress at this task more difficult than was originally foreseen. Recently, a body of work on development of interactive systems to support phrase browsing has begun to emerge (e.g., Anick and Vaithyanathan 1997 [1], Gutwin et al. [10], Nevill-Manning et al. 1997 [17], Godby and Reighart 1998 [9]). In this paper, we consider two issues related to the use of automatically identified phrases as index terms in a dynamic text browser (DTB), a user-centered system for navigating and browsing index terms: 1) What criteria are useful for assessing the usefulness of automatically identified index terms? and 2) Is the quality of the terms identified by automatic indexing such that they provide useful access to document content? The terms that we focus on have been identified by LinkIT, a software tool for identifying significant topics in text [7]. Over 90% of the terms identified by LinkIT are coherent and therefore merit inclusion in the dynamic text browser. Terms identified by LinkIT are input to Intell-Index, a prototype DTB that supports interactive navigation of index terms. The distinction between phrasal heads (the most important words in a coherent term) and modifiers serves as the basis for a hierarchical organization of terms. This linguistically motivated structure helps users to efficiently browsing and disambiguate terms. We conclude that the approach to information access discussed in this paper is very promising, and also that there is much room for further research. In the meantime, this research is a contribution to the establishment of a solid foundation for assessing the usability of terms in phrase browsing applications.
TL;DR: Three statistical techniques for identifying content-bearing phrases within a natural language database are presented and the effectiveness of the methods on test data is demonstrated, and it is shown how all three methods can be combined to produce a single improved method.
Abstract: An important problem in the indexing of natural language text is how to identify those words and phrases that reflect the content of the text. In general, automatic indexing has dealt with this problem by removing instances of a few hundred common words known as stop words, and treating the remaining words as though they were content bearing. This approach is acceptable for some applications such as statistical estimates of the similarity of queries and documents for the purpose of document retrieval. However, when the indexing terms are to be examined by a human as a means of accessing the literature, it greatly improves efficiency if most of the noncontent-bearing words and phrases can be eliminated from the indexing. Here we present three statistical techniques for identifying content-bearing phrases within a natural language database. We demonstrate the effectiveness of the methods on test data, and show how all three methods can be combined to produce a single improved method.
TL;DR: In this paper, an automatic indexing/search device in a client device, such as a Tivo, Replay TV or the like, enables the user to record and play sound/video synchronously and store index information in a storage device by extracting and indexing meaningful/structural/summary information of a multimedia stream automatically when video/sound recording of the multimedia stream is carried out.
Abstract: A client device (set-top box) allows to record and play simultaneously a multimedia stream transmitted from a broadcasting station or a video/radio server. The present invention enables to provide a user with a function of searching/filtering or browsing the previously-stored streams using structural/meaningful/summarized information to meet a user's demand by installing an automatic indexing/search device in a client device, such as a Tivo, Replay TV or the like which enables the user to record and play sound/video synchronously and store index information in a storage device by extracting and indexing meaningful/structural/summary information of a multimedia stream automatically when video/sound recording of the multimedia stream is carried out.
TL;DR: An automatic video indexing method which takes advantage of the characteristic of news videos: similar images have similar contents is proposed, which tags indexes selectively according to typical image content.
TL;DR: This work describes how AIDAS, a software tool, automatically divides the source data into reusable chunks, how it automatically indexes these chunks and stores them in a database to enable reuse.
Abstract: Indexing large bodies of data is necessary to enable satisfactory search results. Ontologies serve as fixed vocabularies to index data from different viewpoints. We describe how AIDAS, a software tool, automatically divides the source data (PDF documents) into reusable chunks, how it automatically indexes these chunks and stores them in a database to enable reuse.
TL;DR: In this article, the authors present a system for automatic indexing and content-based retrieval of multimedia documents. But the system uses multi-modal clues, obtained from three different multimedia components: audio, video, and superimposed text.
Abstract: Content-based video retrieval is emerging as an important part in the process of utilization of various multimedia documents. In this report we present a novel system for the automatic indexing and content-based retrieval of multimedia documents. We chose the domain of Formula 1 sport videos because the manual annotation of Formula 1 races is complicated and time consuming. Our system uses multi-modal clues, obtained from three different multimedia components: audio, video, and superimposed text. The audio and video feature extraction subsystems are developed to extract important parameters from multimedia documents. We also performed text detection and recognition to extract some semantic information superimposed in the Formula 1 race video. To unify the audio and video clues we employed dynamic Bayesian networks. Many experiments that we carried out are also presented, as well as the results and conclusions drawn from them.
TL;DR: An approach to automatic indexing in which medically plausible morphological units are used for cross-lingual medical document retrieval and semantic value is added to these index terms by a compact domain-specific thesaurus.
Abstract: Assisting users to search medical information on the WWW is here considered from two perspectives – the linguistic complexity of medical terms, nominal compounds in par-ticular, and cross-lingual relationships between monolin-gual medical terminologies. In order to solve the first prob-lem, we present an approach to automatic indexing in which medically plausible morphological units are used. Semantic value is added to these index terms by a compact domain-specific thesaurus. We further discuss tools for morphological segmentation and morphosemantic normali-zation of HTML documents, as well as an adaptation of a standard WWW search engine for morpheme-based re-trieval. The second problem, cross-lingual medical docu-ment retrieval, is dealt with by defining cross-lingual equivalence relations on the emerging morpheme sets. Keywords: World Wide Web, Information Retrieval, Medical Lan-guage Processing Introduction Querying the WWW for languages other than English may cause problems due to differing structural patterns of medi-cal terminology. This claim becomes already obvious at the word level of medical language. Almost all Germanic, Romanic and Slavic languages have a much richer and di-versified
TL;DR: The paper presents an approach that uses geometrical properties of fuzzy hypercube space to make indexing and retrieval processes of cases.
Abstract: In some applications with case-based systems, the attributes available for indexing are better described as linguistic variables instead of receiving numerical treatment. In these applications, the concept of fuzzy hypercube can be applied to give a geometrical interpretation of similarities among cases. The paper presents an approach that uses geometrical properties of fuzzy hypercube space to make indexing and retrieval processes of cases.
TL;DR: An attempt is made to consider the elements of indexing, their pros and cons, and the argument is made that machine indexing is far too prolific in its production of index terms.
Abstract: The growth of the Internet has highlighted the use of machine indexing. The difficulties in using the Internet as a searching device can be frustrating. The use of the term "Python" is given as an example. Machine indexing is noted as "rotten" and human indexing as "capricious." The problem seems to be a lack of a theoretical foundation for the art of indexing. What librarians have learned over the last hundred years has yet to yield a consistent approach to what really works best in preparing index terms and in the ability of our customers to search the various indexes. An attempt is made to consider the elements of indexing, their pros and cons. The argument is made that machine indexing is far too prolific in its production of index terms. Neither librarians nor computer programmers have made much progress to improve Internet indexing. Human indexing has had the same problems for over fifty years.
TL;DR: This contribution considers the different representation levels and document knowledge required to do the goal driven authoring of training material from existing technical manuals and develops tools for automatic indexing in diverse domains.
Abstract: Goal driven authoring of training material from existing technical manuals requires automatic indexing of the content of the manual. In this contribution we consider the different representation levels and document knowledge required to do the task. On that basis we have developed tools for automatic indexing in diverse domains.
TL;DR: An automatic process that determines the type of transition and extracts information from it, so that this information can be used in object extraction, and its implementation, timing, and performance analysis are introduced.
Abstract: Video indexing is a technique used to extract objects within a video sequence and index them so that they can be used for future retrieval. In many video sequences, special effects such as fade and wipe are incorporated, and in some cases, it is desirable to be able to include objects within such effects for indexing. In this paper, we introduce an automatic process that determines the type of transition and extracts information from it, so that this information can be used in object extraction. Such a process consists of four stages: shot boundary refinement, shot type determination, frame reconstruction for soft transitions, and shot classification for hard transitions. In this paper, we will give the implementation, timing, and performance analysis for each stage. Long transition analysis bridges the gap between shot boundary detection and object tracking and smoothes the process of automatic video indexing for video databases.
TL;DR: A combination of XSLT and a host language to access a database with linking information and how to merge documents and links at run-time (a posteriori cross linking) is shown.
Abstract: In this paper we describe a way to enhance existing digital library documents by adding links without modifying the stored documents themselves. We show how to use a combination of XSLT and a host language to access a database with linking information and how to merge documents and links at run-time (a posteriori cross linking). Our approach is already used in the system OMNIS/2, which is an advanced meta system for existing digital library systems and enhances existing digital library systems or retrieval systems by additional storing and indexing of user-defined multimedia documents, automatic and personal linking concepts, annotations, filtering and personalization.
TL;DR: Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system.
Abstract: Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system. Compaction is achieved by embedding subphrases and by enabling the user to select rapidly word by word. Phrases displayed would not necessarily be those selected for automatic indexing.
TL;DR: A novel local mutation-based method to detect and locate captions according to shape-regularity and connectivity of the local mutation region, which is general, effective, real time and robust enough for caption detection and retrieval, and farther for use in news indexing.
Abstract: Generating semantically meaningful content of news video has been increasingly spotlighted Captions in video include useful information for automatic annotation and indexing Unfortunately, there are difficulties in retrieving captions for most real applications because of the following factors: low resolution characters and extremely complex backgrounds This paper proposes a novel local mutation-based method to solve these problems We detect and locate captions according to shape-regularity and connectivity of the local mutation region The overall experimental results show that our approach is general, effective, real time and robust enough for caption detection and retrieval, and farther for use in news indexing
TL;DR: In this paper, a print job is carried out via a corresponding client by a standard printer driver which converts the file into PostScript (RTM) format, and if an index file (4) is missing, an automatic indexing is performed.
Abstract: From an application, a print job (1) is carried out via a corresponding client by a standard printer driver which converts the file into PostScript (RTM) format. The print job is supplied to a hardware module which provides a printer interface for the network. Here, a special software module performs a conversion to PDF format. If an index file (4) is missing, an automatic indexing is performed.
TL;DR: Projet CARMEN base sur un nouveau procede d'indexation permettant l'acces a des donnees heterogenes d'un nouvel exercice d'efficacité généralisant laissez-vous de l'inflation de la population d'heterogenes.
Abstract: Projet CARMEN base sur un nouveau procede d'indexation permettant l'acces a des donnees heterogenes
TL;DR: A method for automatic classification of offensive play patterns in soccer games has been developed using the neural networks technique to obtain data that define the positions of both a player and the ball on the ground.
Abstract: A method for automatic classification of offensive play patterns in soccer games has been developed using the neural networks technique. Back-propagation (BP) neural network techniques have been applied to obtain data that define the positions of both a player and the ball on the ground. The offensive play patterns that have been formulated from the group formations enable automatic indexing of the highlights of soccer games. Excepts from actual soccer games, including some from the 1998 French World Cup, yielded 297 video clips which were categorized into the following five types of patterns: Left-Running are 60, Right-Running 74, Center-Running 72, Corner-Kick 39 and Free-Kick 52. Examination of the results shows the following rates of satisfactory pattern recognition: Left-Running comes to 91.7%, Right-Running 100%, Center-Running 87.5%, Corner-Kick 97.4% and Free-Kick 75%.
TL;DR: This work starts by using morphological and syntactic information to locate candidate noun phrases, and then it uses statistical information to improve result accuracy.
Abstract: Terminological acquisition is an important issue when learning about Natural Language Processing (NLP) due to the constant terminological renewal caused by technological changes. Terms play a key role in several NLP activities such as machine translation, automatic indexing, text understanding, and information retrieval. This is especially true at this time when corpora in electronic format keep growing in number and variety. In this work we start by using morphological and syntactic information to locate candidate noun phrases, and then we use statistical information to improve result accuracy.
TL;DR: Two hypotheses involving the use of linguistic analysis to improve the automatic creation of indexes are explored, finding that if it is possible to develop algorithms that index only the words which most closely capture the document's content, the retrieval of irrelevant documents might be minimized.
Abstract: The primary goal of the ELIXIR project was to achieve improvements in the performance of text-retrieval systems by confronting some of the challenges presented by the indexing of full-text documents. We explored two hypotheses involving the use of linguistic analysis to improve the automatic creation of indexes: (1) if it is possible to develop algorithms that index only the words which most closely capture the document's content-not the complete text, as is done in existing sys-tems-the retrieval of irrelevant documents might be minimized; (2) if it is possible to create indexing algorithms that perform a minimal syntactic parse required to capture important phrases, the index may represent the document's content even more accurately, increasing the likelihood that the article can be retrieved by a relevant query.
TL;DR: This essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable, and concludes that human analysis should be used on a much more selective basis.
Abstract: Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Perez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types of indexing might be allocated to best advantage. Part one of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human vs automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled. Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.
TL;DR: A large-scale ANN-based filter is trained and it is revealed that the ANN relevance prediction out-performs the prediction of the IF system and implies that the users underestimate the importance of some terms, failing to include them in their profiles.
Abstract: Information filtering (IF) systems usually filter data items by correlating a vector of terms that represent the user profile with similar vectors of terms that represent data items. Terms that represent data items can be determined by experts or automatic indexing methods. In this study we employ an artificial neural network (ANN) as an alternative method for both IF and term selection and compare its effectiveness to that of “traditional” methods. In an earlier study we developed and examined the performance of an IF system that employed content-based and stereotypic rule-based filtering methods in the domain of e-mail messages. In this study, we train a large-scale ANN-based filter, which uses meaningful terms in the same database as input, and use it to predict the relevance of those messages. Our results reveal that the ANN relevance prediction out-performs the prediction of the IF system. Moreover, we found very low correlation between the terms in the user profile (explicitly selected by the users) and the positive causal-index (CI) terms of the ANN, which indicate the relative importance of terms in messages. This implies that the users underestimate the importance of some terms, failing to include them in their profiles. This may explain the rather low prediction accuracy of the IF system.
TL;DR: It is found that research comparing human versus automatic approaches has done little to change strongly held beliefs, and automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.
Abstract: Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Perez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types indexing might be allocated to best advantage. Part I of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human versus automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled. Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However, automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.