Top 148 papers published in the topic of Vector space model in 2003

Showing papers on "Vector space model published in 2003"

Patent•

Information data retrieval, where the data is organized in terms, documents and document corpora

[...]

13 Jan 2003

TL;DR: In this article, a term-to-concept vector is generated for each term in the document corpus, which describes a relationship between a term and each of the concept vectors (i.e., a term is related to a second term).

...read moreread less

Abstract: The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms (431 - 438), documents and document corpora, where each document contains at least one term (431 - 438) and each document corpus contains at least one document. Based on a concept vector (420 - 424), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term (431 - 438) in the document corpus. The term-to-concept vector describes a relationship between teh term (431) and each of the concept vectors (420 - 424). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms (431 - 438) in the document corpus. The term-term matrix may then be processed and used for retireving information from the document corpus, such as the fact that a first term (431) is related to a second term (436).

...read moreread less

174 citations

Topic-based Vector Space Model

[...]

Jörg Becker¹, Dominik Kuropka¹•Institutions (1)

University of Münster¹

1 Jan 2003

TL;DR: This paper shows further how the Topic-based Vector Space Model can be fully implemented within the context of relational databases and facilitates the use of this approach by generic applications.

...read moreread less

Abstract: This paper motivates and presents the Topic-based Vector Space Model (TVSM), a new vector-based approach for document comparison. The approach does not assume independence between terms and it is flexible regarding the specification of term-similarities. Stopword-list, stemming and thesaurus can be fully integrated into the model. This paper shows further how the TVSM can be fully implemented within the context of relational databases. This facilitates the use of this approach by generic applications. At the end short comparisons with other vector-based approaches namely the Vector Space Model (VSM) and the Generalized Vector Space Model (GVSM) are presented.

...read moreread less

97 citations

Journal Article•10.1155/S111086570321101X•

A probabilistic multimedia retrieval model and its evaluation

[...]

Thijs Westerveld, Arjen P. de Vries, Alex van Ballegooij, Franciska de Jong¹, Djoerd Hiemstra¹ - Show less +1 more•Institutions (1)

University of Twente¹

01 Jan 2003-EURASIP Journal on Advances in Signal Processing

TL;DR: In this article, a probabilistic model for the retrieval of multimodal documents is presented, which is based on Bayesian decision theory and combines models for text-based search with models for visual search.

...read moreread less

Abstract: We present a probabilistic model for the retrieval of multimodal documents. The model is based on Bayesian decision theory and combines models for text-based search with models for visual search. The textual model is based on the language modelling approach to text retrieval, and the visual information is modelled as a mixture of Gaussian densities. Both models have proved successful on various standard retrieval tasks. We evaluate the multimodal model on the search task of TREC's video track. We found that the disclosure of video material based on visual information only is still too difficult. Even with purely visual information needs, text-based retrieval still outperforms visual approaches. The probabilistic model is useful for text, visual, and multimedia retrieval. Unfortunately, simplifying assumptions that reduce its computational complexity degrade retrieval effectiveness. Regarding the question whether the model can effectively combine information from different modalities, we conclude that whenever both modalities yield reasonable scores, a combined run outperforms the individual runs.

...read moreread less

73 citations

Proceedings Article•

KCCA for different level precision in content-based image retrieval

[...]

David R. Hardoon, John Shawe-Taylor¹•Institutions (1)

Royal Holloway, University of London¹

1 Jan 2003

TL;DR: This work uses kernel Canonical Correlation Analysis to learn a semantic representation of web images and their associated text and compares the approaches against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.

...read moreread less

Abstract: We use kernel Canonical Correlation Analysis to learn a semantic representation of web images and their associated text. In the application we look at two approaches of retrieving images based only on their content from a text query. The semantic space provides a common representation and enables a comparison between the text and image. We compare the approaches against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.

...read moreread less

64 citations

Automatic term extraction and document similarity in special text corpora

[...]

Evangelos E. Milios¹, Y. Zhang¹, L. Dong¹•Institutions (1)

Dalhousie University¹

1 Jan 2003

TL;DR: It is conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus.

...read moreread less

Abstract: This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus. The extracted terms are then used to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model. The precision of retrieval using a term-based representation is compared with that of a word-based representation, and a link-based similarity metric based on the overlap of the local neighborhoods of the papers in the citation graph. The term-based approach ofiers comparable performance to the word-based approach, but potentially with a much smaller vocabulary size. Automatic term extraction in special text corpora is an interesting problem, which is becoming relevant as literature in speciflc scientiflc flelds such as medicine, biology and computer science explodes making it di‐cult to track the evolving terminology in the flelds [Kageura and Umino1996]. Early approaches to automatic term extraction were focused on information-theoretic approaches based on mutual information in detecting collocations [Manning and Schuetze1999]. Collocations are expressions that are composed of two or more words, the meaning of which is not easy to guess from the meanings of the component words. There are nuances in the detection of collocation that require linguistic criteria to resolve [Justeson and Katz1995]. Shallow linguistic criteria are based on acceptable sequences of part-of-speech tags. Part-of-speech tagging can be performed automatically [Brill1992]. A key problem is that of nesting, where subsets of consecutive words of terms consisting of multiple words would satisfy the statistical criteria for \termhood", but they would not be called terms. In the flrst part of this paper, we describe experiments with a state-of-the-art method, C-value/NC-value [Frantzi et al.2000], which combines statistical and linguistic information for automatic term extraction. We applied it to a special text corpus of computer science articles, which is of a difierent nature from the medical corpus on which the method was originally tested. We conflrmed that the performance of the method is equally good on our corpus, and we identifled some adjustments that the method required. In the second part of this paper, we use the terms extracted to estimate the similarity between two documents. We evaluate the quality of the similarity estimation based on terms in an information retrieval context. It is broadly believed that it is di‐cult to improve upon the bag-of-words representation as far as retrieval performance is concerned by using more sophisticated features or shallow linguistic techniques. Although retrieval based on terms did not show signiflcant improvement over a bag-of-words representation, our long-term objective is to cluster special text corpora into subareas, and automatically generate lexical ontologies from the clusters [Ayad and Kamel2002]. Terms in this context are of interest in themselves, and not purely as a vehicle to information retrieval. We are, furthermore, interested in similarity criteria taking into account proximity of terms [Koubarakis2001], for which again it is essential to work with terms, not words. The use of terms instead of words may also be preferable in information dissemination, where given a database of proflles (of c

...read moreread less

64 citations

Proceedings Article•10.1145/900051.900096•

Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

[...]

Kazunari Sugiyama¹, Kenji Hatano¹, Masatoshi Yoshikawa², Shunsuke Uemura¹•Institutions (2)

Nara Institute of Science and Technology¹, Nagoya University²

26 Aug 2003

TL;DR: Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page.

...read moreread less

Abstract: In IR (information retrieval) systems based on the vector space model, the TF-IDF scheme is widely used to characterize documents However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting the contents of their hyperlinked neighboring pages In this paper, we first propose several approaches to refining the TF-IDF scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare the retrieval accuracy of our proposed approaches Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page

...read moreread less

63 citations

Journal Article•10.1023/A:1023293820057•

Genetic Mining of HTML Structures for Effective Web-Document Retrieval

[...]

Sun Kim¹, Byoung-Tak Zhang¹•Institutions (1)

Seoul National University¹

01 May 2003-Applied Intelligence

TL;DR: A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes, which tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.

...read moreread less

Abstract: Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.

...read moreread less

59 citations

Proceedings Article•10.1109/ICME.2003.1221543•

Support vector machines for region-based image retrieval

[...]

Feng Jing, Mingjing Li¹, Hong-Jiang Zhang², Bo Zhang•Institutions (2)

Massachusetts Institute of Technology¹, Princeton University²

6 Jul 2003

TL;DR: A new kind of kernel that is a generalization of Gaussian kernel is proposed to resolve the issue of infeasible kernels in region-based image retrieval systems that use variable-length representations.

...read moreread less

Abstract: In this paper, the application of support vector machines (SVM) in relevance feedback for region-based image retrieval is investigated. Both the one class SVM as a class distribution estimator and two classes SVM as a classifier are taken into account. For the latter, two representative display strategies are studied. Since the common kernels often rely on inner product or L/sub p/ norm in the input space, they are infeasible in the region-based image retrieval systems that use variable-length representations. To resolve the issue, a new kind of kernel that is a generalization of Gaussian kernel is proposed. Experimental results on a database of 10,000 general-purpose images demonstrate the effectiveness and robustness of the proposed approach.

...read moreread less

46 citations

Journal Article•10.1016/S0306-4573(02)00044-4•

Genetic algorithms in relevance feedback: a second test and new contributions

[...]

Cristina López-Pujalte¹, Vicente P. Guerrero-Bote¹, Félix de Moya-Anegón²•Institutions (2)

University of Extremadura¹, University of Granada²

01 Sep 2003-Information Processing and Management

TL;DR: Comparisons on relevance feedback genetic techniques that follow the vector space model and one of the best traditional methods of relevance feedback--the Ide dec-hi method are carried out.

...read moreread less

Abstract: The present work is the continuation of an earlier study which reviewed the literature on relevance feedback genetic techniques that follow the vector space model (the model that is most commonly used in this type of application), and implemented them so that they could be compared with each other as well as with one of the best traditional methods of relevance feedback--the Ide dec-hi method. We here carry out the comparisons on more test collections (Cranfield, CISI, Medline, and NPL), using the residual collection method for their evaluation as is recommended in this type of technique. We also add some fitness functions of our own design.

...read moreread less

46 citations

Proceedings Article•10.1145/860435.860461•

Experimental result analysis for a generative probabilistic image retrieval model

[...]

Thijs Westerveld, Arjen P. de Vries

28 Jul 2003

TL;DR: It is shown how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study.

...read moreread less

Abstract: The main conclusion from the metrics-based evaluation of video retrieval systems at TREC's video track is that non-interactive image retrieval from general collections using visual information only is not yet feasible. We show how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study. Such an analytical approach proves an important addition to standard evaluation measures.

...read moreread less

44 citations

Journal Article•

A Personalized Search Algorithm by Using Content-Based Filtering

[...]

Zeng Chun

01 Jan 2003-Journal of Software

TL;DR: A personalized search algorithm by using content-based filtering is presented in this paper, where the user model is represented as the probability distribution over the domain classification model.

...read moreread less

Abstract: Traditional information retrieval technologies satisfy users need to a great extent. However, for their all-purpose characteristics, they can not satisfy any query from the different background, with the different intention and at the different time. A personalized search algorithm by using content-based filtering is presented in this paper. The user model is represented as the probability distribution over the domain classification model. A method of computing similarity and a method of revising user model are provided. Compared with the vector space model, the probability model is more effective on describing a users interests.

...read moreread less

Journal Article•10.1016/S0020-0255(03)00169-5•

Category cluster discovery from distributed WWW directories

[...]

Mei-Ling Shyu¹, Choochart Haruechaiyasak¹, Shu-Ching Chen²•Institutions (2)

University of Miami¹, Florida International University²

15 Oct 2003-Information Sciences

TL;DR: The merging process is based on the clustering concept where categories with similar characteristics are merged into the same cluster under certain distributed constraints and a performance measure called cluster entropy is proposed, which determines how well the categories from different sources are distributed over the resulting clusters.

...read moreread less

Learning the semantics of multimedia content with application to web image retrieval and classification

[...]

Alexei Vinokourov¹, David R. Hardoon, John Shawe-Taylor¹•Institutions (1)

Royal Holloway, University of London¹

1 Jan 2003

TL;DR: This work uses kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text and compares against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.

...read moreread less

Abstract: We use kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text. This representation is used in two applications. In first application we consider classification of images into one of three categories. We use SVM in the semantic space and compare against the SVM on raw data and against previously published results using ICA. In the second application we retrieve images based only on their content from a text query. The semantic space provides a common representation and enables a comparison between the text and image. We compare against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.

...read moreread less

Journal Article•10.1023/A:1026080230789•

From Retrieval Status Values to Probabilities of Relevance for Advanced IR Applications

[...]

Henrik Nottelmann¹, Norbert Fuhr¹•Institutions (1)

University of Duisburg-Essen¹

01 Sep 2003-Information Retrieval

TL;DR: These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one and the approximation quality of the different mapping functions is compared.

...read moreread less

Abstract: Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a “normalisation” function which maps the retrieval status value onto the probability of relevance (“mapping functions”). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.

...read moreread less

Language Modelling and Relevance

[...]

K. Sparck Jones¹, Stephen Robertson², Djoerd Hiemstra, Hugo Zaragoza²•Institutions (2)

University of Cambridge¹, Microsoft²

1 Jan 2003

TL;DR: This paper addresses three questions about the Language Modelling (LM) approach to information retrieval by presenting the INQUERY network model as a probabilistic classifier testing for whether the hypothesis of relevance holds given the evidence supplied by document and query.

...read moreread less

Abstract: This paper addresses three questions about the Language Modelling (LM) approach to information retrieval. These questions are about LM and relevance. They arise because relevance has always been taken as fundamental to information retrieval (see, e.g. Saracevic [10] or Mizzaro [7]). Thus from the standpoint of retrieval theory, the presumption has been that as relevance is the key notion in retrieval (for how could it not be?), this should be explicitly recognised in any formal model of retrieval. The Probabilistic Model (PM) of retrieval does this very clearly. Turtle and Croft [13, 1] present the INQUERY network model as a probabilistic classifier testing for whether the hypothesis of relevance holds given the evidence supplied by document and query. The Vector Space Model assumes that some of the points in the multidimensional information space of a retrieval system represent relevant documents. But the LM account of what retrieval is about seems quite different: relevance does not formally figure in it at all. A retrieval model that does not mention relevance appears paradoxical. But the form in which the LM account is expressed immediately provokes the question:

...read moreread less

Book Chapter•10.1007/978-3-540-25981-7_7•

Automatic Keyword Extraction for News Finder

[...]

José Luis Martínez-Fernández¹, Ana García-Serrano², Paloma Martínez¹, Julio Villena¹•Institutions (2)

Charles III University of Madrid¹, Technical University of Madrid²

15 Sep 2003

TL;DR: In this paper, the authors describe the automatic keyword extraction (AKE) process for news characterization that uses several linguistic techniques to improve the current state of the text-based information retrieval.

...read moreread less

Abstract: Newspapers are one of the most challenging domains for information retrieval systems: new articles appear everyday written in different languages, with multimedia contents and the news repositories may be updated in a matter of hours so information extraction is crucial to the metadata contents of the news. Further approaches of ”smart retrieval” have to cope with multimedia and multilingual features as well as have to obtain really good precision features in order to reach a high degree of user satisfaction with the retrieved documents. The paper focus is the description of the automatic keyword extraction (AKE) process for news characterization that uses several linguistic techniques to improve the current state of the text-based information retrieval. The first prototype implemented focusing in the AKE process (www.omnipaper.org) is described and some relevant performance features are included. Finally, some conclusions and comments are given regarding the role of the linguistic engineering in the web era.

...read moreread less

Proceedings Article•10.1109/ICIT.2003.1290365•

On SVD-free latent semantic indexing for image retrieval for application in a hard industrial environment

[...]

Pavel Praks, Vaclav Snasel, Jiri Dvorsky, J. Cernohorsky

10 Dec 2003

TL;DR: Retrieval system based on latent semantic indexing (LSI) is presented, intended for practical industrial use within a control and information system operating at coking plant at Nova hut Ostrava.

...read moreread less

Abstract: Retrieval system based on latent semantic indexing (LSI) is presented in our paper LSI is usually based on single value decomposition, which is time-consuming computation New, fast algorithm for computing the LSI was developed The algorithm was tested on text and image data collection Experimental results are given The algorithms are intended for practical industrial use within a control and information system operating at coking plant at Nova hut Ostrava The description of the industrial environment is given

...read moreread less

Journal Article•10.1016/S0167-6393(03)00103-1•

Boosting and combination of classifiers for natural language call routing systems

[...]

Imed Zitouni¹, Hong-Kwang Jeff Kuo¹, Chin-Hui Lee¹•Institutions (1)

Alcatel-Lucent¹

01 Nov 2003-Speech Communication

TL;DR: This paper describes methods to improve a single classifier: boosting, discriminative training (DT) and automatic relevance feedback (ARF), and explores ways of deriving and combining uncorrelated classifiers in order to improve accuracy.

...read moreread less

Book Chapter•10.1007/978-3-642-57410-8_3•

Computational Information Retrieval

[...]

Jacob Kogan¹•Institutions (1)

University of Maryland, Baltimore County¹

1 Jan 2003

TL;DR: The main goal of this note is to introduce the notion of collection dependent “same context words”, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.

...read moreread less

Abstract: The main goal of this note is to introduce the notion of collection dependent “same context words”. Two (or more) words are the “same context words” if they occur in the same (or similar) context across a given text collection. Each word w in the collection is associated with a profile P(w). The profile P(w) is the set of words occurring in sentences that contain w. We introduce a distance function in the set profiles, and use it to cluster words. Words contained in the same cluster are “same context words”. We select “same context words” for several text collections, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.

...read moreread less

Journal Article•10.1023/A:1026201013738•

A Vector space model as a methodological approach to the Triple Helix dimensionality : a comparative study of Biology and Biomedicine Centres of two European National Research Councils from a webometric view

[...]

José Luis Ortega Priego¹•Institutions (1)

Spanish National Research Council¹

01 Oct 2003-Scientometrics

TL;DR: A Vector Space Model is proposed as a new methodological approach which allows the relationships between the elements of the Triple Helix Model (University, Industry, Government) in a spacial model by using the webpages of the National Research Councils of Germany and Spain as examples.

...read moreread less

Abstract: The aim of this paper is to propose a Vector Space Model as a new methodological approach which allows us to present the relationships between the elements of the Triple Helix Model (University, Industry, Government) in a spacial model by using the webpages of the National Research Councils of Germany and Spain as examples. Outlinks of the Biomedicine and Biology centres of these national councils were analysed with the intention of representing graphically these relationships through the Vector Space Model that allows for Multidimensional Scaling in three dimensions. Results show a map with the differences and similarities between the Spanish and German cases. It may be concluded that these results could become a qualitative indicator of a scientific and technical reality.

...read moreread less

Proceedings Article•10.1145/860435.860441•

Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model

[...]

Jaime Teevan¹, David R. Karger¹•Institutions (1)

Massachusetts Institute of Technology¹

28 Jul 2003

TL;DR: This thesis is that a model developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications, and is tested by learning from a corpus the best document model.

...read moreread less

Abstract: Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.

...read moreread less

Patent•

Position dependent information retrieval method, position dependent information retrieval device, position dependent information providing device, and position dependent information retrieval program

[...]

Masamichi Asukai, Tetsukazu Kai, Kenichiro Nakamura, Hiroyuki Segawa, 憲一郎中村, 博之勢川, 哲一開, 正道飛鳥井 - Show less +4 more

16 May 2003

TL;DR: In this article, the problem of presenting data having a meaning close to a retrieval condition as a retrieval result when no information matched to the retrieval condition is present is solved by generating a re-retrieval item condition.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To present data having a meaning close to a retrieval condition as a retrieval result when no information matched to the retrieval condition is present. SOLUTION: A range determination means 4 determines a range of retrieval based on position information 9a. A retrieval means 5 retrieves position dependent information matched to an inputted item condition 9b from a position dependent information database 2. A re-retrieval item condition generation means 6 traces back from a retrieval object classification item designated by the item condition 9b to an upper layer when the retrieval result satisfies a predetermined condition. The re-retrieval item condition generation means 6 sets another conception included in the classification time of the upper layer as a new retrieval object classification item to generate a re-retrieval item condition 9c. A re-retrieval means 7 performs a retrieval according to the re-retrieval item condition 9c. According to this, when no information matched to the retrieval condition is present, another information included in a conception wider than the retrieval condition is presented as the retrieval result. COPYRIGHT: (C)2005,JPO&NCIPI

...read moreread less

Book Chapter•10.1007/3-540-36618-0_1•

Document retrieval: shallow data, deep theories; historical reflections, potential directions

[...]

Karen Sparck Jones¹•Institutions (1)

University of Cambridge¹

14 Apr 2003

TL;DR: A review of the development of statistically-based retrieval can be found in this paper, where the authors present a survey of statistical methods for document or text retrieval and their application in information and language processing.

...read moreread less

Abstract: This paper reviews the development of statistically-based retrieval. Since the 1950s statistical techniques have clearly demonstrated their practical worth and statistical theories their staying power, for document or text retrieval. In the last decade the TREC programme, and the Web, have offered new retrieval challenges to which these methods have successfully risen. They are now one element in the much wider and very productive spread of statistical methods to all areas of information and language processing, in which innovative approaches to modelling their data and tasks are being applied.

...read moreread less

Journal Article•10.1111/1467-8640.00215•

Semantic-Based Information Retrieval for Content Management and Security

[...]

Bo-Hyun Yun¹, Chang-Ho Seo²•Institutions (2)

Mokwon University¹, Kongju National University²

1 May 2003

TL;DR: The experimental results show that the proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements, which can improve the retrieval effectiveness of the information retrieval system.

...read moreread less

Abstract: An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.

...read moreread less

Journal Article•10.1080/02286203.2003.11442267•

Arabic Text Data Mining: a Root-Based Hierarchical Indexing Model

[...]

Taisir Eldos¹•Institutions (1)

Jordan University of Science and Technology¹

01 Jan 2003-International Journal of Modelling and Simulation

TL;DR: This article focuses on speeding up the information retrieval process in Arabic document base by using a root-based hierarchical indexing model, and results demonstrated that speed gain in the range of 50-100 can be achieved for typical queries.

...read moreread less

Abstract: The world has recently witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Text data mining, as a m...

...read moreread less

Proceedings Article•10.1109/ICIP.2003.1247066•

Kernel indexing for relevance feedback image retrieval

[...]

Jing Peng¹, Douglas R. Heisterkamp•Institutions (1)

Tulane University¹

24 Nov 2003

TL;DR: The key observation is that kernel metrics may be nonlinear and highly dynamic in the input space but remain Euclidean in induced feature space, which enables us to learn arbitrary relevance functions without changing the index in feature space.

...read moreread less

Abstract: Relevance feedback is an attractive approach to developing flexible metrics for content-based retrieval in image and video databases. Large image databases require an index structure in order to reduce nearest neighbor computation. However, flexible metrics can alter an input space in a highly nonlinear fashion, thereby rendering the index structure useless. Few systems have been developed that address the apparent flexible metric/indexing dilemma. This paper proposes kernel indexing to try to address this dilemma. The key observation is that kernel metrics may be nonlinear and highly dynamic in the input space but remain Euclidean in induced feature space. It is this linear invariance in feature space that enables us to learn arbitrary relevance functions without changing the index in feature space. As a result, kernel indexing supports efficient relevance feedback retrieval in large image databases. Experimental results using a large set of image data are very promising.

...read moreread less

Proceedings Article•10.1109/IJCNN.2003.1223673•

Relevance feedback with active learning for document retrieval

[...]

Takashi Onoda, H. Murata, Seiji Yamada

20 Jul 2003

TL;DR: The proposed approach is very useful for document retrieval with relevance feedback experimentally and applies active learning techniques based on Support Vector Machine for evaluating successive batches, which is called relevance feedback.

...read moreread less

Abstract: We investigate the following data mining problems from the document retrieval: From a large data set of documents, we need to find documents that relate to human interesting in as few iterations of human testing or checking as possible. In each iteration a comparatively small batch of documents is evaluated for relating to the human interesting. We apply active learning techniques based on Support Vector Machine for evaluating successive batches, which is called relevance feedback. Finally, our proposed approach is very useful for document retrieval with relevance feedback experimentally.

...read moreread less

Journal Article•

A New Fuzzy Information Retrieval Method Based on Document Terms Reweighting Techniques

[...]

Yih-Jen Horng, Shyi-Ming Chen, Chia-Hoang Lee

01 Jan 2003-International journal of information and management sciences

TL;DR: The proposed method modifies the weights of document terms in document descriptor vectors based on the user’s relevance feedback to increase the retrieval effectiveness of the fuzzy information retrieval systems for document retrieval.

...read moreread less

Abstract: In this paper, we present a new method for fuzzy information retrieval based on document terms reweighting techniques. The proposed method modifies the weights of document terms in document descriptor vectors based on the user’s relevance feedback. After modifying the weights of terms in document descriptor vectors, the degrees of satisfaction of relevant documents with respect to the user’s query will increase, and the degrees of satisfaction of irrelevant documents with respect to the user’s query will decrease. Then, the modified document descriptor vectors can be used as personal profiles for future query processing. The proposed method can make fuzzy information retrieval systems more flexible and more intelligent to deal with documents retrieval. It can increase the retrieval effectiveness of the fuzzy information retrieval systems for document retrieval.

...read moreread less

Proceedings Article•10.1145/860435.860523•

Incorporating query term dependencies in language models for document retrieval

[...]

Munirathnam Srikanth¹, Rohini K. Srihari¹•Institutions (1)

University at Buffalo¹

28 Jul 2003

TL;DR: This work has shown that N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7].

...read moreread less

Abstract: Recent advances in Information Retrieval are based on using Statistical Language Models (SLM) for representing documents and evaluating their relevance to user queries [6, 3, 4]. Language Modeling (LM) has been explored in many natural language tasks including machine translation and speech recognition [1]. In LM approach to document retrieval, each document, D, is viewed to have its own language model, MD. Given a query, Q, documents are ranked based on the probability, P (Q|MD), of their language model generating the query. While the LM approach to information retrieval has been motivated from different perspectives [3, 4], most experiments have used smoothed unigram language models that assume term independence for estimating document language models. N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7]. Biterm language models [8] that ignore the word order constraint in bigram language models have been shown to perform better than bigram models. However, word order constraint cannot always be relaxed since a blind venetian is not a venetian blind. Term dependencies can be measured using their co-occurrence statistics. Nallapati and Allan [5] represent term dependencies in a sentence using a maximum spanning tree and generate a sentence tree language model for the story link detection task in TDT. Syntactic parse of user queries can provide clues for when the word order constraint can be relaxed. Syn-

...read moreread less

Next-generation information retrieval: integrating document and data retrieval based on xml

[...]

Michael Gertz, Jan-Marco Bremer

1 Jan 2003

TL;DR: This paper introduces integrated information retrieval (IIR), an XML-based retrieval approach that closes the gap between structured and semistructured data retrieval and document retrieval, and introduces the syntax and semantics of an extension of the XQuery language called XQuery/IR.

...read moreread less

Abstract: Data retrieval comprises exact queries that allow a user to specify a precisely defined subset of a data source. Document retrieval arranges elements of a given document collection according to their relevance to a set of query terms. For querying structured and semistructured data, data retrieval and document retrieval are two valuable and complementary techniques that, however, have never been fully integrated. In this dissertation, we introduce Integrated Information Retrieval (IIR), a conceptually new retrieval approach that closes this gap. We present syntax and semantics of an extension of the XQuery language called XQuery/IR. The extended language realizes IIR based on the Extensible Markup Language (XML) and allows users to formulate new kinds of valuable queries by nesting ranked document retrieval and precise data retrieval sub-queries. Furthermore, we detail index structures and efficient query processing approaches for implementing XQuery/IR. Based on a new identification scheme for nodes in a node-labeled tree structure such as underlying XML, the index structures require only a fraction of the space of comparable, existing index structures for just data retrieval. For semistructured data such as XML data, we also present a first distribution design approach, whose realization confirms the value of the new node identification and indexing scheme for applications beyond Integrated Information Retrieval.

...read moreread less

...

Expand