TL;DR: This paper aims to provide a history of indexing and Abstracting techniques used in the field since the 1970s, and some of the techniques used today are still in use.
Abstract: Preface. Acknowledgements. Part I: The Indexing and Abstracting Environment. 1. The Need for Indexing and Abstracting Texts. 2. The Attributes of Text. 3. Text Representations and Their Use. Part II: Methods of Automatic Indexing and Abstracting. 4. Automatic Indexing: The Selection of Natural Language Index Terms. 5. Automatic Indexing: The Assignment of Controlled Language Index Terms. 6. Automatic Abstracting: The Creation of Text Summaries. Part III: Applications. 7. Text Structuring and Categorization When Summarizing Legal Cases. 8. Clustering of Paragraphs when Summarizing Legal Cases. 9. The Creation of Highlight Abstracts of Magazine Articles. 10. The Assignment of Subject Descriptors to Magazine Articles. Summary and Future Prospects. References. Subject Index.
TL;DR: This article investigates the combination and boundary detection approaches based on mutual information for segmentation in Chinese indexing, which is the foundation of Chinese and cross-lingual information retrieval.
Abstract: Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/ DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to be solved before such a vision is fulfilled. The most critical is to support a cross-lingual retrieval or multilingual digital library. Much work has been done on English information retrieval, however, there is relatively less work on Chinese information retrieval. In this article, we focus on Chinese indexing, which is the foundation of Chinese and cross-lingual information retrieval. The smallest indexing units in Chinese digital libraries are words, while the smallest units in a Chinese sentence are characters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek-based orthographies, often, spacing reliably indicates word boundaries. In Chinese, a number of characters are placed together without any delimiters indicating the boundaries between consecutive characters. In this article, we investigate the combination and boundary detection approaches based on mutual information for segmentation. The combination approach combines ngrams to form words with more number of characters. In the combination approach Algorithm 1 does not allow overlapping of n-grams while Algorithm 2 does. The boundary detection approach detects the segmentation points on a sentence based on the values and the change of values of the mutual information. Experiments are conducted to evaluate their performances. An interface of the system is also presented to show how a Chinese web page is downloaded, the text in the page filtered, and segmented into words. The segmented words can be submitted for indexing or new unknown words can be identified and submitted to a dictionary.
TL;DR: The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with better accuracy.
Abstract: This thesis discusses the problems and the methods of finding relevant information in large collections of documents. The contribution of this thesis to this problem is to develop better content analysis methods which can be used to describe document content with
TL;DR: An automatic indexing method is proposed for human motion data which converts a motion data, which is represented as time series of 3-D position, into a symbol sequence which represents a sequence of primitive motions.
Abstract: Contents-based retrieval of multimedia information has been investigated in several research projects. In this paper, we will focus on an automatic indexing method for human motion data. We convert a motion data, which is represented as time series of 3-D position, into a symbol sequence. We call this method as conversion automatic indexing. The automatic indexing is performed in a pattern matching approach. Reference patterns are necessary for pattern matching, so that we will propose two methods to define primitive motions in order to make reference patterns. The first method divides motion data into segmental motion data by detecting the change of motion speed. The second method classifies segmental motions such that similar segmental motions are gathered in the same cluster. In order to evaluate the similarity between two segmental motions, we use the Dynamic Time Warping (DTW) method because each segmental motion takes different time length even if the same person performed the same motions. Motion data can be converted into a symbol sequence which represents a sequence of primitive motions. Then, Continuous Dynamic Programming (CDP) method is used to recognize contents of motion. CDP is one of the extensions of DTW. It makes us possible to recognize a motion with ease even if it is complex.
TL;DR: This work investigates automatic audio classification and segmentation based on silence ratio, the variation of zero-crossing rate and their combination, and shows promising classification and segmentsation performance.
Abstract: To benefit from the increasing amount of information stored as audio data, automatic audio indexing and retrieval systems are required. Audio classification and segmentation are the prerequisites to automatic indexing and retrieval. We investigate automatic audio classification and segmentation based on silence ratio, the variation of zero-crossing rate and their combination. Our experimental results show promising classification and segmentation performance.
TL;DR: A word spacing algorithm that automatically finds out word spacing positions is proposed based on the recognition of Eojeol components by using the sentence partition and bidirectional longest-match algorithm.
Abstract: Automatic word spacing is needed to solve the automatic indexing problem of the non-spaced documents and the space-insertion problem of the character recognition system at the end of a line. We propose a word spacing algorithm that automatically finds out word spacing positions. It is based on the recognition of Eojeol components by using the sentence partition and bidirectional longest-match algorithm. The sentence partition utilizes an extraction of Eojeol-block where the Eojeol boundary is relatively clear, and a Korean morphological analyzer is applied bidirectionally to the recognition of Eojeol components. We tested the algorithm on two sentence groups of about 4,500 Eojeols. The space-level recall ratio was 97.3% and the Eojeol-level recall ratio was 93.2%.
TL;DR: New similarity measures between frames are presented and a new algorithm to detect scene changes using a cross entropy defined between two histograms is proposed to detect abrupt scene changes and gradual transitions including fade in/out and flash light scenes.
Abstract: In huge video databases, an effective video indexing method is required. While manual indexing is the most effective approach to this goal, it is slow and expensive. Thus automatic indexing is desirable, and previously various indexing tools for video databases have been developed. For efficient video indexing and retrieval, the similarity measure is an important factor. This paper presents new similarity measures between frames and proposes a new algorithm to detect scene changes using a cross entropy defined between two histograms. Experimental results show that the proposed algorithm is fast and effective compared with several conventional algorithms to detect abrupt scene changes and gradual transitions including fade in/out and flash light scenes.
TL;DR: The overall result showed the effectiveness of the proposed automatic indexing system for news video by 7 to 8%, and indicated that character existence itself is an important feature.
Abstract: Reflecting the demand for recycling and retrieval of video, we are proposing an automatic indexing system for news video that considers correspondences between textual indices and image contents. In this paper, we focus on the background image content (i.e. scene) identification portion of the system. The analysis is performed by segmenting (human) character region from background region, and was applied to actual news video for evaluation. The overall result showed the effectiveness of the proposed method by 7 to 8%, and indicated that character existence itself is an important feature. Individual observation among various scenes indicated that multiple features should be combinatorily used according to each scene, and that the data set should be exponentially extended for higher performance.
TL;DR: The RIAO 2000 Conference on Content-Based Multimedia Information Access as discussed by the authors focused on the use of Natural Language Processing (NLP) in the traditional processing of text documents to the rapidly growing field of automatic indexing and retrieval of images and speech and more generally to all processing of audio-visual and multimedia information on various distribution venues, including the Net.
Abstract: The theme of the conference is "Content-Based Multimedia Information Access". The conference scope ranges from the traditional processing of text documents to the rapidly growing field of automatic indexing and retrieval of images and speech and, more generally, to all processing of audio-visual and multimedia information on various distribution venues, including the Net.
It covers topics related to information indexing, retrieval, routing, alerting, profiling, filtering and summarizing, to text mining and data mining, to human-computer interaction in Information Retrieval and Document Processing. It emphasizes the use of Natural Language Processing in this area. It extends the field to spoken language processing (including speaker and language recognition) and to image processing (including image and video indexing, browsing and retrieval, and face recognition). It considers both technological issues and issues related to the development (architecture, best practice, standards, evaluation, resources) and use (cognitive aspects, human factors, socio-economics, security, privacy, personalization, legal aspects) of these technologies. Special interest is given to multilingual and translingual, as well as multimedia, multimodal and transmodal processes. The applications of those technologies may be found in many areas, ranging from medical applications to business intelligence.
The conference is of interest for several scientific communities, including Information Retrieval, Natural Language Processing, Spoken Language Processing, Computer Vision, Human-Computer Interaction and Digital Libraries. RIAO 2000 thereby serves as a forum for cross-discipline initiatives and innovative applications. It accompanies large initiatives which are taken worldwide, such as the Darpa TIDES program in the US, or the Human Language Technology and Information Filtering programs of the European Commission.
TL;DR: An algorithm for automatic indexing of video surveillance sequence that consists in the recognition of specific scenarios defined by a human operator for off-line retrieval of specific behaviour from a long stored image sequence is developed.
Abstract: We have developed an algorithm for automatic indexing of video surveillance sequence. The indexing consists in the recognition of specific scenarios defined by a human operator. The principal utility concerns the off-line retrieval of specific behaviour from a long stored image sequence. From a video indexing point of view, image acquisition is made in a favourable condition: fixed acquisition parameter. This permits to isolate easily moving objects and allowing their advanced behavioural analysis. Scenarios are composed of limited elementary events. They define spatio-temporal trajectories, and interactions between objects. The use of fuzzy set permits the construction of scenarios by modelling variability of objects behaviour. An end user graphical interface helps the parameter adjustment of each event by an operator. In order to reduce the complexity of the recognition stage, only two categories of elementary event are used. A pattern matching approach using the Possibility Theory is used in order to manage the uncertainty of the scenario recognition.
TL;DR: The results suggest that a more extended study, based upon Medline, should encompass various components: Extraction of ‘concept strings’ from titles and abstracts of records,based upon linguistic features characteristic of medical literature.
Abstract: This article discusses the possibility of the automation of sophisticated subject indexing of medical journal articles. Approaches to subject descriptor assignment in information retrieval research are usually either based upon the manual descriptors in the database or generation of search parameters from the text of the article. The principles of the Medline indexing system are described, followed by a summary of a pilot project, based upon the Amed database. The results suggest that a more extended study, based upon Medline, should encompass various components: Extraction of ‘concept strings’ from titles and abstracts of records, based upon linguistic features characteristic of medical literature. Use of the Unified Medical Language System (UMLS) for identification of controlled vocabulary descriptors. Coordination of descriptors, utilising features of the Medline indexing system. The emphasis should be on system manipulation of data, based upon input, available resources and specifically designed rules.
TL;DR: This work extracts shape, color, position, size and rotation features and provides a syntax for the description of complex objects as composition of simpler "basic" shapes to bridge the gap between low-level features extracted with image analysis and the complex structure of objects.
Abstract: We study the use of description logic for semantic indexing in image retrieval systems in order to bridge the gap between low-level features extracted with image analysis and the complex structure of objects. In our approach, starting from a region based segmentation of images, we extract shape, color, position, size and rotation features and provide a syntax for the description of complex objects as composition of simpler "basic" shapes. An extensional semantics allows to define reasoning services, such as recognition and subsumption. A simple user interface allows to describe prototypical images and pose structured queries by sketch.
TL;DR: The JRC’s statistical tool for controlled vocabulary keyword assignment assigns descriptors of the multilingual Eurovoc thesaurus, which can be used for cross-language document comparison and for the visualisation of multilingual document collections.
Abstract: This article presents an approach for cross-language document comparison and for the visualisation of multilingual document collections. Document comparison usually relies on the calculation of the degree of lexical overlap between documents. As this is not possible for documents written in different languages, the contents of these documents first have to be mapped onto a language-independent representation. The JRC’s statistical tool for controlled vocabulary keyword assignment assigns descriptors of the multilingual Eurovoc thesaurus, which can be used for cross-language document comparison. The language-independent sets of thesaurus descriptors allow to identify, for a given document, the most similar documents even if they are written in different languages. They furthermore allow to organise and to visualise the structure and approximate contents of whole multilingual document collections in two-dimensional document maps.
TL;DR: The proposed system consists of three components: keyframe extraction, text extraction, and text recognition that extracts text regions using a neural network that operates as a set of texture discrimination filters, and performing profile analysis with some heuristics.
Abstract: We design and implement a video indexing system. For automatic indexing of video images, character recognition information is utilized. The proposed system consists of three components: keyframe extraction, text extraction, and text recognition. It extracts text regions using a neural network that operates as a set of texture discrimination filters, and performing profile analysis with some heuristics. To ensure accurate segmentation of touching characters, character segmentation and recognition are performed in turn. The proposed system has been tested with 390 Korean news archives.
TL;DR: This chapter treats a problem in connection with automatic indexing and retrieval of documents where it cannot be guaranteed that the user queries in-clude the actual words that occur in the documents that should be retrieved.
Abstract: This chapter treats a problem in connection with automatic indexing and retrieval of documents where it cannot be guaranteed that the user queries in-clude the actual words that occur in the documents that should be retrieved Fuzzy tolerance and similarity relations will be presented and the notion of hierarchical co-occurrence is defined that allows the introduction of two or more hierarchical categories of words in the documents If the query is based on a single keyword it is possible to extend the query to the compatibility (or equivalence) class of the queried word So, directly matching documents can be retrieved, or a class of matching words established by some sample document collection and then docu-ments matching with words in this latter class can be retrieved Various methods of search and retrieval will be proposed and illustrated, with the intention of real application in legal document collections
TL;DR: A large-scale ANN-based filter is trained which uses meaningful terms in the same database of email messages as input, and is used to predict the relevancy of those messages, and reveals that the ANN prediction of releVancy is very good, compared to the prediction of the IF system.
Abstract: Information filtering (IF) systems usually filter data items by correlating a vector of terms (keywords) that represent the user profile with similar vectors of terms that represent the data items (e.g. documents). The terms that represent the data items can be determined by (human) experts (e.g. authors of documents) or by automatic indexing methods. In this study we employ an artificial neural-network (ANN) as an alternative method for both filtering and term selection, and compare its effectiveness to “traditional” methods. In an earlier study we developed and examined the performance of an IF system that employed content-based and stereotypic rule-based filtering methods, in the domain of e-mail messages. In this study we train a large-scale ANN-based filter which uses meaningful terms in the same database of email messages as input, and use it to predict the relevancy of those messages. Results of the study reveal that the ANN prediction of relevancy is very good, compared to the prediction of the IF system: correlation between the ANN prediction and the users’ evaluation of message relevancy ranges between 0.760.99, compared to correlation in the range of 0.41-0.77 for the IF system. Moreover, we found very low correlation between the terms in the user profile (which were selected by the users) and the positive causal-index terms of the ANN (which indicate the important terms that appear in the messages). This indicates that the users under-estimate the importance of some terms, failing to include them in their profiles. This may explain the rather low prediction accuracy of the IF system that is based on user-generated profiles.
TL;DR: This work combines both color and spatial features of a picture to index and measure the similarity of images and proposes a new automatic indexing scheme of the image database according to the clustering method.
Abstract: We propose a new approach to retrieve images from an image database. We combine both color and spatial features of a picture to index and measure the similarity of images. We propose a new automatic indexing scheme of the image database according to our clustering method which could filter the image efficiently. As a technical contribution, a Seed-Filling like algorithm that could extract the shape and spatial relationship feature of the image is proposed. Also, the system is incorporated with a visual interface, which allows the user to specify objects as the spatial specification of pictures. With color identification and spatial similarity functions, the preliminary experience shows that the system is able to retrieve image information of a very high satisfaction.
TL;DR: A novel text representation and matching scheme for Chinese text retrieval that uses both proximity and mutual information of the word pairs to represent the text coutent so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems.
Abstract: This paper proposed a novel text representation and matching scheme for Chinese text retrieval. At present, the indexing methods of Chinese retrieval systems are either character-based or word-based. The character-based indexing methods, such as bi-gram or tri-gram indexing, have high false drops due to the mismatches between queries and documents. On the other hand, it’s difficult to efficiently identify all the proper nouns, terminology of different domains, and phrases in the word-based indexing systems. The new indexing method uses both proximity and mutual information of the word pairs to represent the text coutent so as to overcome the high false drop, new word and phrase problems that exist in the character-based and word-based systems. The evaluation results indicate that the average query precision of proximity-based indexing is 5.2% higher than the best results of TREC-5.
TL;DR: A methodology for analysis and automatic indexing of imaged documents within an archiving and retrieval system based on a new generation Archiving and Retrieval Engine (ARE), which overcomes the bottleneck of document profiling by alleviating the existing limitations of pre-defined indexing schemes.
Abstract: In this paper a methodology for analysis and automatic indexing of imaged documents within an archiving and retrieval system is described. This system, which is being developed within the Esprit project STRETCH (STorage and RETrieval by Content of imaged documents), is based on a new generation Archiving and Retrieval Engine (ARE), which overcomes the bottleneck of document profiling by alleviating the existing limitations of pre-defined indexing schemes. The ARE exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout.
The objective of this system is twofold. First, it aims at combining direct digitalisation, based on location of information fields and OCR, with advanced techniques derived from Image Analysis and Pattern Recognition. Second, it aims at offering ease of use and programming and ability to dynamically adapt to new types of documents.
After experiments in some other document domains (i.e. invoices, medical images, circular letters), the system has been tested on bank documents. In this application, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging.
TL;DR: An implementation of the content based indexing using textual, graphical and visual information combined system is reported and how visual feedback and visual interaction helps to improve the indexing parameters created using the low-level image feature extraction technologies are shown.
Abstract: Due the cursive nature of the Arabic scripts automatic recognition of keywords using computers is very difficult. Content based indexing using textual, graphical and visual information combined provides a more realistic and practical approach to the problem of indexing large collection of calligraphic material. Starting with low level patter recognition and feature extraction techniques, graphical representations of the calligraphic material can be captured to form the low level indexing parameters. These parameters are then enhanced using textual and visual information provided by the users. Through visual feedback and visual interaction, recognized textual information can be used to enhance the indexing parameter and in return improve the retrieval of the calligraphic material. In this paper, we report an implementation of the system and show how visual feedback and visual interaction helps to improve the indexing parameters created using the low-level image feature extraction technologies.
TL;DR: A common methodology based on Delaunay Triangulation is proposed to structure data and facilitate the comparison process of aerial photos to obtain information that will be compared with those stored into the database.
Abstract: This article deals with the integration of multi-sources information in the context of geographical database updating. We manage aerial photos to obtain information that will be compared with those stored into the database. We also explain why and how this process of updating could help us to index the aerial photo. We propose a common methodology based on Delaunay Triangulation to structure data and facilitate the comparison process.
TL;DR: A knowledge-based classification system that supports automatic Indexing and filtering based on semantic concepts for the dissemination of on-line real-time media and is extensively evaluated by applying the engine to video of basketball games.
TL;DR: A system for the automatic registration of correspondence to and from a public office based on a general methodology for the extraction, indexing, archiving, and retrieval of significant information from semi-structured documents is presented.
Abstract: Archives of optical documents are more and more massively employed, the demand driven also by the new norms sanctioning the legal value of digital documents, provided they are stored on supports that are physically unalterable. On the supply side there is now a vast and technologically advanced market, where optical memories have solved the problem of the duration and permanence of data at costs comparable to those for magnetic memories. The remaining bottleneck in these systems is the indexing. The indexing of documents with a variable structure, while still not completely automated, can be machine supported to a large degree with evident advantages both in the organization of the work, and in extracting information, providing data that is much more detailed and potentially significant for the user. We present here a system for the automatic registration of correspondence to and from a public office. The system is based on a general methodology for the extraction, indexing, archiving, and retrieval of significant information from semi-structured documents. This information, in our prototype application, is distributed among the database fields of sender, addressee, subject, date, and body of the document.
TL;DR: DIAMS as mentioned in this paper is a system of distributed, collaborative agents to help users access, manage, share and exchange information on the World Wide Web, where a personal agent helps its owner find information most relevant to current needs.
Abstract: In this paper, we present DIAMS, a system of distributed, collaborative agents to help users access, manage, share and exchange information. A DIAMS personal agent helps its owner find information most relevant to current needs. It provides tools and utilities for users to manage their information repositories with dynamic organization and virtual views. Flexible hierarchical display is integrated with indexed query search-to support effective information access. Automatic indexing methods are employed to support user queries and communication between agents. Contents of a repository are kept in object-oriented storage to facilitate information sharing. Collaboration between users is aided by easy sharing utilities as well as automated information exchange. Matchmaker agents are designed to establish connections between users with similar interests and expertise. DIAMS agents provide needed services for users to share and learn information from one another on the World Wide Web.
TL;DR: This paper presents some methods for image classification in video, based on the presence, size and location of faces and captions, and argues that such classifications are highly domain-dependent and are best handled using flexible knowledge management systems.
Abstract: Indexing videos by their image content is an important issue for digital audiovisual archives While much work has been devoted to classification and indexing methods based on perceptual qualities of images, such as color, shape and texture, there is also a need for classification and indexing of some structural properties of images In this paper, we present some methods for image classification in video, based on the presence, size and location of faces and captions We argue that such classifications are highly domain-dependent, and are best handled using flexible knowledge management systems (in our case, a description logics)
TL;DR: The degree of similarity in indexing terms was found to vary on a scale from completely different to completely identical with various levels in between, and both databases differed in assigning terms in various semantic categories.
Abstract: The KeyWords Plus in the Science Citation Index database represents an approach to combining citation and semantic indexing in describing the document content. This paper explores the similarities or dissimilarities between citation-semantic and analytic indexing. The dataset consisted of over 400 matching records in the SCI and MEDLINE databases on antibiotic resistance in pneumonia. The degree of similarity in indexing terms was found to vary on a scale from completely different to completely identical with various levels in between. The within-document similarity in the two databases was measured by a variation on the Jaccard Coefficient—the Inclusion Index. The average inclusion coefficient was 0.4134 for SCI and 0.3371 for MEDLINE. The 20 terms occurring most frequently in each database were identified. The two groups of terms shared the same terms that consist of the “intellectual base” for the subject. Conceptual similarity was analyzed through scatterplots of matching and nonmatching terms vs. partially identical and broader/narrower terms. The study also found that both databases differed in assigning terms in various semantic categories. Implications of this research and further studies are suggested.
TL;DR: A novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights and this approach is faster and uses less memory than the PAT-tree based approach.
Abstract: This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary , ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking , Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval , ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.
TL;DR: Two new coding schemes are introduced, probabilistic reasoning models (PRM) and enhanced FLD (Fisher linear discriminant) models (EFM), for indexing and retrieval of large image databases with applications to face recognition.
Abstract: This paper introduces two new coding schemes, probabilistic reasoning models (PRM) and enhanced FLD (Fisher linear discriminant) models (EFM), for indexing and retrieval of large image databases with applications to face recognition. The unifying theme of the new schemes is that of lowering the space dimension ("data compression") subject to increased fitness for the discrimination index.