Top 27 papers published in the topic of Automatic indexing in 2001

Showing papers on "Automatic indexing published in 2001"

Patent•

Searching tool and process for unified search using categories and keywords

[...]

François Bourdoncle, Patrice Bertin, Eric Jeux

14 Aug 2001

TL;DR: In this paper, a database of web pages and sites is provided, where entries are at least partially mapped to a set of predetermined categories and the entries are associated with keywords, for instance by automatic indexing of documents.

...read moreread less

Abstract: A database of entries, such as Web pages and sites, is provided. The entries are at least partially mapped to a set of predetermined categories. The entries are also associated with keywords, for instance, by automatic indexing of documents. In response to a query into the database, a user is provided with a series of refinement strategies, in addition to search results. Refinement strategies comprise categories relevant for the search, selected among the set of predetermined categories. Refinement strategies also include keywords dynamically selected among keywords associated with the entries. The user may easily navigate among the results to the query, and formulate new queries.

...read moreread less

185 citations

Proceedings Article•10.1145/383535.383540•

Improving the representation of legal case texts with information extraction methods

[...]

Stefanie Brüninghaus¹, Kevin D. Ashley¹•Institutions (1)

University of Pittsburgh¹

1 May 2001

TL;DR: Preliminary experimental results suggest that a combination of domain-specific knowledge and information extraction techniques can be used to generalize from the examples and derive more powerful features.

...read moreread less

Abstract: The prohibitive cost of assigning indices to textual cases is a major obstacle for the practical use of AI and Law systems supporting reasoning and arguing with cases. While progress has been made toward extracting certain facts from well-structured case texts or classifying case abstracts under Key Number concepts, these methods still do not suffice for the complexity of indexing concepts in CBR systems.In this paper, we lay out how a better example representation may facilitate classification-based indexing. Our hypotheses are that (1) abstracting from the individual actors and events in cases, (2) capturing actions in multi-word features, and (3) recognizing negation, can lead to a better representation of legal case texts for automatic indexing. We discuss how to implement these techniques with state-of-the-art NLP tools. Preliminary experimental results suggest that a combination of domain-specific knowledge and information extraction techniques can be used to generalize from the examples and derive more powerful features.

...read moreread less

69 citations

Journal Article•10.1007/PL00010904•

Automatic document classification and indexing in high-volume applications

[...]

Enrico Appiani, Francesca Cesarini¹, Anna Maria Colla, Michelangelo Diligenti², Marco Gori², Simone Marinai¹, Giovanni Soda¹ - Show less +3 more•Institutions (2)

University of Florence¹, University of Siena²

01 Dec 2001-International Journal on Document Analysis and Recognition

TL;DR: This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes.

...read moreread less

Abstract: In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution.

...read moreread less

53 citations

Proceedings Article•10.1145/379437.379468•

Automatic identification and organization of index terms for interactive browsing

[...]

Nina Wacholder¹, David Evans¹, Judith L. Klavans¹•Institutions (1)

Columbia University¹

1 Jan 2001

TL;DR: The approach to information access discussed in this paper is very promising, and it is concluded that there is much room for further research.

...read moreread less

Abstract: The potential of automatically generated indexes for information acces s has been recognized for several decades (e.g., Bush 1945 [2], Edmundson and Wyllys 1961 [4]), but the quantity of text and the ambiguity of natural language processing have made progress at this task more difficult than was originally foreseen. Recently, a body of work on development of interactive systems to support phrase browsing has begun to emerge (e.g., Anick and Vaithyanathan 1997 [1], Gutwin et al. [10], Nevill-Manning et al. 1997 [17], Godby and Reighart 1998 [9]). In this paper, we consider two issues related to the use of automatically identified phrases as index terms in a dynamic text browser (DTB), a user-centered system for navigating and browsing index terms: 1) What criteria are useful for assessing the usefulness of automatically identified index terms? and 2) Is the quality of the terms identified by automatic indexing such that they provide useful access to document content? The terms that we focus on have been identified by LinkIT, a software tool for identifying significant topics in text [7]. Over 90% of the terms identified by LinkIT are coherent and therefore merit inclusion in the dynamic text browser. Terms identified by LinkIT are input to Intell-Index, a prototype DTB that supports interactive navigation of index terms. The distinction between phrasal heads (the most important words in a coherent term) and modifiers serves as the basis for a hierarchical organization of terms. This linguistically motivated structure helps users to efficiently browsing and disambiguate terms. We conclude that the approach to information access discussed in this paper is very promising, and also that there is much room for further research. In the meantime, this research is a contribution to the establishment of a solid foundation for assessing the usability of terms in phrase browsing applications.

...read moreread less

47 citations

Journal Article•10.1002/1097-4571(2000)9999:9999<::AID-ASI1588>3.3.CO;2-Z•

Corpus-based statistical screening for content-bearing terms

[...]

Won Kim¹, W. John Wilbur¹•Institutions (1)

National Institutes of Health¹

01 Feb 2001-Journal of the Association for Information Science and Technology

TL;DR: Three statistical techniques for identifying content-bearing phrases within a natural language database are presented and the effectiveness of the methods on test data is demonstrated, and it is shown how all three methods can be combined to produce a single improved method.

...read moreread less

Abstract: An important problem in the indexing of natural language text is how to identify those words and phrases that reflect the content of the text. In general, automatic indexing has dealt with this problem by removing instances of a few hundred common words known as stop words, and treating the remaining words as though they were content bearing. This approach is acceptable for some applications such as statistical estimates of the similarity of queries and documents for the purpose of document retrieval. However, when the indexing terms are to be examined by a human as a means of accessing the literature, it greatly improves efficiency if most of the noncontent-bearing words and phrases can be eliminated from the indexing. Here we present three statistical techniques for identifying content-bearing phrases within a natural language database. We demonstrate the effectiveness of the methods on test data, and show how all three methods can be combined to produce a single improved method.

...read moreread less

31 citations

Patent•

Record/play apparatus and method for extracting and searching index simultaneously

[...]

Kyoung Yoon¹, Jun Seong Bae¹, Chan Eui Cheong¹, Jieun Lee¹•Institutions (1)

LG Electronics¹

13 Jul 2001

TL;DR: In this paper, an automatic indexing/search device in a client device, such as a Tivo, Replay TV or the like, enables the user to record and play sound/video synchronously and store index information in a storage device by extracting and indexing meaningful/structural/summary information of a multimedia stream automatically when video/sound recording of the multimedia stream is carried out.

...read moreread less

Abstract: A client device (set-top box) allows to record and play simultaneously a multimedia stream transmitted from a broadcasting station or a video/radio server. The present invention enables to provide a user with a function of searching/filtering or browsing the previously-stored streams using structural/meaningful/summarized information to meet a user's demand by installing an automatic indexing/search device in a client device, such as a Tivo, Replay TV or the like which enables the user to record and play sound/video synchronously and store index information in a storage device by extracting and indexing meaningful/structural/summary information of a multimedia stream automatically when video/sound recording of the multimedia stream is carried out.

...read moreread less

27 citations

Journal Article•10.1002/SCJ.1053•

An automatic video indexing method based on shot classification

[...]

Ichiro Ide¹, Koji Yamamoto¹, Reiko Hamada¹, Hidehiko Tanaka¹•Institutions (1)

University of Tokyo¹

01 Aug 2001-Systems and Computers in Japan

TL;DR: An automatic video indexing method which takes advantage of the characteristic of news videos: similar images have similar contents is proposed, which tags indexes selectively according to typical image content.

...read moreread less

Abstract: With the recent increase in the number of news videos and in their value, their automatic indexing has been anticipated. Among various proposals for this problem, the use of textual data has been actively considered, and some applications of this type have actually been developed. Nonetheless, most of these approaches are textual data oriented, and a few of them consider the correspondences between their indexes (keywords) and image contents which are essentials for an image database. This paper proposes an automatic video indexing method which takes advantage of the characteristic of news videos: similar images have similar contents. The method tags indexes selectively according to typical image content. The method has been applied to actual news videos, and indexing has been successful for between 25% and 93% of “typical shots,” and for between 75% and 100% of “captionized shots.” © 2001 Scripta Technica, Syst Comp Jpn, 32(9): 32–41, 2001

...read moreread less

21 citations

Automatic indexing of PDF documents with ontologies

[...]

Anjo Anjewierden¹, S.C. Kabel¹•Institutions (1)

University of Amsterdam¹

1 Jan 2001

TL;DR: This work describes how AIDAS, a software tool, automatically divides the source data into reusable chunks, how it automatically indexes these chunks and stores them in a database to enable reuse.

...read moreread less

Abstract: Indexing large bodies of data is necessary to enable satisfactory search results. Ontologies serve as fixed vocabularies to index data from different viewpoints. We describe how AIDAS, a software tool, automatically divides the source data (PDF documents) into reusable chunks, how it automatically indexes these chunks and stores them in a database to enable reuse.

...read moreread less

21 citations

Journal Article•

Automatic Annotation of Formula 1 Races for Content-Based Video Retrieval

[...]

Vojkan Mihajlovic, Milan Petkovic

01 Dec 2001-CTIT technical report series

TL;DR: In this article, the authors present a system for automatic indexing and content-based retrieval of multimedia documents. But the system uses multi-modal clues, obtained from three different multimedia components: audio, video, and superimposed text.

...read moreread less

Abstract: Content-based video retrieval is emerging as an important part in the process of utilization of various multimedia documents. In this report we present a novel system for the automatic indexing and content-based retrieval of multimedia documents. We chose the domain of Formula 1 sport videos because the manual annotation of Formula 1 races is complicated and time consuming. Our system uses multi-modal clues, obtained from three different multimedia components: audio, video, and superimposed text. The audio and video feature extraction subsystems are developed to extract important parameters from multimedia documents. We also performed text detection and recognition to extract some semantic information superimposed in the Formula 1 race video. To unify the audio and video clues we employed dynamic Bayesian networks. Many experiments that we carried out are also presented, as well as the results and conclusions drawn from them.

...read moreread less

20 citations

Journal Article•

Indexing medical WWW documents by morphemes.

[...]

Stefan Schulz, Martin Honeck, Udo Hahn

01 Jan 2001-Studies in health technology and informatics

TL;DR: An approach to automatic indexing in which medically plausible morphological units are used for cross-lingual medical document retrieval and semantic value is added to these index terms by a compact domain-specific thesaurus.

...read moreread less

Abstract: Assisting users to search medical information on the WWW is here considered from two perspectives – the linguistic complexity of medical terms, nominal compounds in par-ticular, and cross-lingual relationships between monolin-gual medical terminologies. In order to solve the first prob-lem, we present an approach to automatic indexing in which medically plausible morphological units are used. Semantic value is added to these index terms by a compact domain-specific thesaurus. We further discuss tools for morphological segmentation and morphosemantic normali-zation of HTML documents, as well as an adaptation of a standard WWW search engine for morpheme-based re-trieval. The second problem, cross-lingual medical docu-ment retrieval, is dealt with by defining cross-lingual equivalence relations on the emerging morpheme sets. Keywords: World Wide Web, Information Retrieval, Medical Lan-guage Processing Introduction Querying the WWW for languages other than English may cause problems due to differing structural patterns of medi-cal terminology. This claim becomes already obvious at the word level of medical language. Almost all Germanic, Romanic and Slavic languages have a much richer and di-versified

...read moreread less

16 citations

Proceedings Article•10.1109/NAFIPS.2001.944709•

Case-based system: indexing and retrieval with fuzzy hypercube

[...]

J. R. P. Mendes¹, Ivan Rizzo Guilherme, Celso Kazuyuki Morooka•Institutions (1)

State University of Campinas¹

25 Jul 2001

TL;DR: The paper presents an approach that uses geometrical properties of fuzzy hypercube space to make indexing and retrieval processes of cases.

...read moreread less

Abstract: In some applications with case-based systems, the attributes available for indexing are better described as linguistic variables instead of receiving numerical treatment. In these applications, the concept of fuzzy hypercube can be applied to give a geometrical interpretation of similarities among cases. The paper presents an approach that uses geometrical properties of fuzzy hypercube space to make indexing and retrieval processes of cases.

...read moreread less

Journal Article•10.1300/J104V33N01_07•

Indexing-neglected and poorly understood

[...]

Masse Bloomfield

01 Jun 2001-Cataloging & Classification Quarterly

TL;DR: An attempt is made to consider the elements of indexing, their pros and cons, and the argument is made that machine indexing is far too prolific in its production of index terms.

...read moreread less

Abstract: The growth of the Internet has highlighted the use of machine indexing. The difficulties in using the Internet as a searching device can be frustrating. The use of the term "Python" is given as an example. Machine indexing is noted as "rotten" and human indexing as "capricious." The problem seems to be a lack of a theoretical foundation for the art of indexing. What librarians have learned over the last hundred years has yet to yield a consistent approach to what really works best in preparing index terms and in the ability of our customers to search the various indexes. An attempt is made to consider the elements of indexing, their pros and cons. The argument is made that machine indexing is far too prolific in its production of index terms. Neither librarians nor computer programmers have made much progress to improve Internet indexing. Human indexing has had the same problems for over fifty years.

...read moreread less

Proceedings Article•10.1109/ICME.2001.1237880•

Automatic indexing of text and graphics in technical manuals

[...]

Marcel Worring¹, Bob Wielinga, Anjo Anjewierden¹, F. Verster, Leon Todoran, S.C. Kabel, R. de Hoog¹ - Show less +3 more•Institutions (1)

University of Amsterdam¹

1 Aug 2001

TL;DR: This contribution considers the different representation levels and document knowledge required to do the goal driven authoring of training material from existing technical manuals and develops tools for automatic indexing in diverse domains.

...read moreread less

Abstract: Goal driven authoring of training material from existing technical manuals requires automatic indexing of the content of the manual. In this contribution we consider the different representation levels and document knowledge required to do the task. On that basis we have developed tools for automatic indexing in diverse domains.

...read moreread less

Journal Article•10.1007/BF01201136•

Long transition analysis for digital video sequences

[...]

Wei Jyh Heng¹, King Ngi Ngan•Institutions (1)

University of Western Australia¹

01 Mar 2001-Circuits Systems and Signal Processing

TL;DR: An automatic process that determines the type of transition and extracts information from it, so that this information can be used in object extraction, and its implementation, timing, and performance analysis are introduced.

...read moreread less

Abstract: Video indexing is a technique used to extract objects within a video sequence and index them so that they can be used for future retrieval. In many video sequences, special effects such as fade and wipe are incorporated, and in some cases, it is desirable to be able to include objects within such effects for indexing. In this paper, we introduce an automatic process that determines the type of transition and extracts information from it, so that this information can be used in object extraction. Such a process consists of four stages: shot boundary refinement, shot type determination, frame reconstruction for soft transitions, and shot classification for hard transitions. In this paper, we will give the implementation, timing, and performance analysis for each stage. Long transition analysis bridges the gap between shot boundary detection and object tracking and smoothes the process of automatic video indexing for video databases.

...read moreread less

Automatic Indexing of Lecture Speech by Extracting Discourse Makers

[...]

Masahiro Hasegawa, Yuya Akita, Tatsuya Kawahara

31 May 2001

Book Chapter•10.1007/3-540-44796-2_9•

Enhancing Digital Library Documents by A Posteriori Cross Linking Using XSLT

[...]

Michael G. Bauer¹, Günther Specht²•Institutions (2)

Technische Universität München¹, Technische Universität Ilmenau²

4 Sep 2001

TL;DR: A combination of XSLT and a host language to access a database with linking information and how to merge documents and links at run-time (a posteriori cross linking) is shown.

...read moreread less

Abstract: In this paper we describe a way to enhance existing digital library documents by adding links without modifying the stored documents themselves. We show how to use a combination of XSLT and a host language to access a database with linking information and how to merge documents and links at run-time (a posteriori cross linking). Our approach is already used in the system OMNIS/2, which is an advanced meta system for existing digital library systems and enhances existing digital library systems or retrieval systems by additional storing and indexing of user-defined multimedia documents, automatic and personal linking concepts, annotations, filtering and personalization.

...read moreread less

Journal Article•10.1016/S0306-4573(00)00063-7•

Presentation of repeated phrases in a computer-assisted abstracting tool kit

[...]

Timothy C. Craven¹•Institutions (1)

University of Western Ontario¹

01 Mar 2001-Information Processing and Management

TL;DR: Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system.

...read moreread less

Abstract: Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system. Compaction is achieved by embedding subphrases and by enabling the user to select rapidly word by word. Phrases displayed would not necessarily be those selected for automatic indexing.

...read moreread less

Proceedings Article•10.1117/12.441561•

Detection and retrieval of captions in news video

[...]

Ming Luo¹, Xuesheng Bai¹, Guangyou Xu¹•Institutions (1)

Tsinghua University¹

25 Sep 2001

TL;DR: A novel local mutation-based method to detect and locate captions according to shape-regularity and connectivity of the local mutation region, which is general, effective, real time and robust enough for caption detection and retrieval, and farther for use in news indexing.

...read moreread less

Abstract: Generating semantically meaningful content of news video has been increasingly spotlighted Captions in video include useful information for automatic annotation and indexing Unfortunately, there are difficulties in retrieving captions for most real applications because of the following factors: low resolution characters and extremely complex backgrounds This paper proposes a novel local mutation-based method to solve these problems We detect and locate captions according to shape-regularity and connectivity of the local mutation region The overall experimental results show that our approach is general, effective, real time and robust enough for caption detection and retrieval, and farther for use in news indexing

...read moreread less

Patent•

Archiving method independent of platform or operating system using a combination of software and programmable hardware

[...]

Weber Gerhard, Bender Ralf

1 Mar 2001

TL;DR: In this paper, a print job is carried out via a corresponding client by a standard printer driver which converts the file into PostScript (RTM) format, and if an index file (4) is missing, an automatic indexing is performed.

...read moreread less

Abstract: From an application, a print job (1) is carried out via a corresponding client by a standard printer driver which converts the file into PostScript (RTM) format. The print job is supplied to a hardware module which provides a printer interface for the network. Here, a special software module performs a conversion to PDF format. If an index file (4) is missing, an automatic indexing is performed.

...read moreread less

Carmen: Content analysis, retrieval and metadata: Effective networking. Bericht über den middleOFTheRoad Workshop

[...]

Inka Tappenbeck, Carola Wessel

1 Jan 2001

TL;DR: Projet CARMEN base sur un nouveau procede d'indexation permettant l'acces a des donnees heterogenes d'un nouvel exercice d'efficacité généralisant laissez-vous de l'inflation de la population d'heterogenes.

...read moreread less

Abstract: Projet CARMEN base sur un nouveau procede d'indexation permettant l'acces a des donnees heterogenes

...read moreread less

Patent•

Digital video recording automatic indexing system

[...]

ヨハンソンイヴァン

22 Jun 2001

Journal Article•

An automatic classification technique for indexing of soccer highlights using neural networks

[...]

Hyun Sook Kim, Young Kyu Yang¹•Institutions (1)

Electronics and Telecommunications Research Institute¹

01 Sep 2001-Machine Graphics & Vision International Journal archive

TL;DR: A method for automatic classification of offensive play patterns in soccer games has been developed using the neural networks technique to obtain data that define the positions of both a player and the ball on the ground.

...read moreread less

Abstract: A method for automatic classification of offensive play patterns in soccer games has been developed using the neural networks technique. Back-propagation (BP) neural network techniques have been applied to obtain data that define the positions of both a player and the ball on the ground. The offensive play patterns that have been formulated from the group formations enable automatic indexing of the highlights of soccer games. Excepts from actual soccer games, including some from the 1998 French World Cup, yielded 297 video clips which were categorized into the following five types of patterns: Left-Running are 60, Right-Running 74, Center-Running 72, Corner-Kick 39 and Free-Kick 52. Examination of the results shows the following rates of satisfactory pattern recognition: Left-Running comes to 91.7%, Right-Running 100%, Center-Running 87.5%, Corner-Kick 97.4% and Free-Kick 75%.

...read moreread less

ATA – Automatic Term Acquisition

[...]

Joana Lúcio Paulo¹, Nuno J. Mamede²•Institutions (2)

Technical University of Lisbon¹, Association for Computing Machinery²

1 Jan 2001

TL;DR: This work starts by using morphological and syntactic information to locate candidate noun phrases, and then it uses statistical information to improve result accuracy.

...read moreread less

Abstract: Terminological acquisition is an important issue when learning about Natural Language Processing (NLP) due to the constant terminological renewal caused by technological changes. Terms play a key role in several NLP activities such as machine translation, automatic indexing, text understanding, and information retrieval. This is especially true at this time when corpora in electronic format keep growing in number and variety. In this work we start by using morphological and syntactic information to locate candidate noun phrases, and then we use statistical information to improve result accuracy.

...read moreread less

Journal Article•10.1300/J111V34N01_07•

Experimental linguistic indexing for information retrieval: ELIXIR

[...]

C. Jean Godby, Bradley C. Watson

01 Dec 2001-Journal of Library Administration

TL;DR: Two hypotheses involving the use of linguistic analysis to improve the automatic creation of indexes are explored, finding that if it is possible to develop algorithms that index only the words which most closely capture the document's content, the retrieval of irrelevant documents might be minimized.

...read moreread less

Abstract: The primary goal of the ELIXIR project was to achieve improvements in the performance of text-retrieval systems by confronting some of the challenges presented by the indexing of full-text documents. We explored two hypotheses involving the use of linguistic analysis to improve the automatic creation of indexes: (1) if it is possible to develop algorithms that index only the words which most closely capture the document's content-not the complete text, as is done in existing sys-tems-the retrieval of irrelevant documents might be minimized; (2) if it is possible to create indexing algorithms that perform a minimal syntactic parse required to capture important phrases, the index may represent the document's content even more accurately, increasing the likelihood that the article can be retrieved by a relevant query.

...read moreread less

Journal Article•10.1016/S0306-4573(00)00026-1•

The nature of indexing: how humans and machines analyze messages and texts for retrieval: part I: research, and the nature of human indexing

[...]

James D. Anderson¹, Jose Perez-Carballo¹•Institutions (1)

Rutgers University¹

01 Mar 2001-Information Processing and Management

TL;DR: This essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable, and concludes that human analysis should be used on a much more selective basis.

...read moreread less

Abstract: Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Perez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types of indexing might be allocated to best advantage. Part one of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human vs automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled. Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.

...read moreread less

Journal Article•10.1016/S0306-4573(00)00030-3•

Automatic keyword identification by artificial neural networks compared to manual identification by users of filtering systems

[...]

Zvi Boger, Tsvi Kuflik¹, Peretz Shoval¹, Bracha Shapira²•Institutions (2)

Ben-Gurion University of the Negev¹, Rutgers University²

01 Mar 2001-Information Processing and Management

TL;DR: A large-scale ANN-based filter is trained and it is revealed that the ANN relevance prediction out-performs the prediction of the IF system and implies that the users underestimate the importance of some terms, failing to include them in their profiles.

...read moreread less

Abstract: Information filtering (IF) systems usually filter data items by correlating a vector of terms that represent the user profile with similar vectors of terms that represent data items. Terms that represent data items can be determined by experts or automatic indexing methods. In this study we employ an artificial neural network (ANN) as an alternative method for both IF and term selection and compare its effectiveness to that of “traditional” methods. In an earlier study we developed and examined the performance of an IF system that employed content-based and stereotypic rule-based filtering methods in the domain of e-mail messages. In this study, we train a large-scale ANN-based filter, which uses meaningful terms in the same database as input, and use it to predict the relevance of those messages. Our results reveal that the ANN relevance prediction out-performs the prediction of the IF system. Moreover, we found very low correlation between the terms in the user profile (explicitly selected by the users) and the positive causal-index (CI) terms of the ANN, which indicate the relative importance of terms in messages. This implies that the users underestimate the importance of some terms, failing to include them in their profiles. This may explain the rather low prediction accuracy of the IF system.

...read moreread less

Journal Article•10.1016/S0306-4573(00)00046-7•

The nature of indexing: how humans and machines analyze messages and texts for retrieval: part II: machine indexing, and the allocation of human versus machine effort

[...]

James D. Anderson¹, Jose Perez-Carballo¹•Institutions (1)

Rutgers University¹

01 Mar 2001-Information Processing and Management

TL;DR: It is found that research comparing human versus automatic approaches has done little to change strongly held beliefs, and automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.

...read moreread less

Abstract: Does human intellectual indexing have a continuing role to play in the face of increasingly sophisticated automatic indexing techniques? In this two-part essay, a computer scientist and long-time TREC participant (Perez-Carballo) and a practitioner and teacher of human cataloging and indexing (Anderson) pursue this question by reviewing the opinions and research of leading experts on both sides of this divide. We conclude that human analysis should be used on a much more selective basis, and we offer suggestions on how these two types indexing might be allocated to best advantage. Part I of the essay critiques the comparative research, then explores the nature of human analysis of messages or texts and efforts to formulate rules to make human practice more rigorous and predictable. We find that research comparing human versus automatic approaches has done little to change strongly held beliefs, in large part because many associated variables have not been isolated or controlled. Part II focuses on current methods in automatic indexing, its gradual adoption by major indexing and abstracting services, and ways for allocating human and machine approaches. Overall, we conclude that both approaches to indexing have been found to be effective by researchers and searchers, each with particular advantages and disadvantages. However, automatic indexing has the over-arching advantage of decreasing cost, as human indexing becomes ever more expensive.

...read moreread less