TL;DR: In this article, a term-to-concept vector is generated for each term in the document corpus, which describes a relationship between a term and each of the concept vectors (i.e., a term is related to a second term).
Abstract: The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms (431 - 438), documents and document corpora, where each document contains at least one term (431 - 438) and each document corpus contains at least one document. Based on a concept vector (420 - 424), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term (431 - 438) in the document corpus. The term-to-concept vector describes a relationship between teh term (431) and each of the concept vectors (420 - 424). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms (431 - 438) in the document corpus. The term-term matrix may then be processed and used for retireving information from the document corpus, such as the fact that a first term (431) is related to a second term (436).
TL;DR: This paper shows further how the Topic-based Vector Space Model can be fully implemented within the context of relational databases and facilitates the use of this approach by generic applications.
Abstract: This paper motivates and presents the Topic-based Vector Space Model (TVSM), a new vector-based approach for document comparison. The approach does not assume independence between terms and it is flexible regarding the specification of term-similarities. Stopword-list, stemming and thesaurus can be fully integrated into the model. This paper shows further how the TVSM can be fully implemented within the context of relational databases. This facilitates the use of this approach by generic applications. At the end short comparisons with other vector-based approaches namely the Vector Space Model (VSM) and the Generalized Vector Space Model (GVSM) are presented.
TL;DR: In this article, a probabilistic model for the retrieval of multimodal documents is presented, which is based on Bayesian decision theory and combines models for text-based search with models for visual search.
Abstract: We present a probabilistic model for the retrieval of multimodal documents. The model is based on Bayesian decision theory and combines models for text-based search with models for visual search. The textual model is based on the language modelling approach to text retrieval, and the visual information is modelled as a mixture of Gaussian densities. Both models have proved successful on various standard retrieval tasks. We evaluate the multimodal model on the search task of TREC's video track. We found that the disclosure of video material based on visual information only is still too difficult. Even with purely visual information needs, text-based retrieval still outperforms visual approaches. The probabilistic model is useful for text, visual, and multimedia retrieval. Unfortunately, simplifying assumptions that reduce its computational complexity degrade retrieval effectiveness. Regarding the question whether the model can effectively combine information from different modalities, we conclude that whenever both modalities yield reasonable scores, a combined run outperforms the individual runs.
TL;DR: This work uses kernel Canonical Correlation Analysis to learn a semantic representation of web images and their associated text and compares the approaches against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
Abstract: We use kernel Canonical Correlation Analysis to learn a semantic representation of web images and their associated text. In the application we look at two approaches of retrieving images based only on their content from a text query. The semantic space provides a common representation and enables a comparison between the text and image. We compare the approaches against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
TL;DR: It is conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus.
Abstract: This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus. The extracted terms are then used to estimate the similarity of papers in the computer science corpus using the standard Vector Space Model. The precision of retrieval using a term-based representation is compared with that of a word-based representation, and a link-based similarity metric based on the overlap of the local neighborhoods of the papers in the citation graph. The term-based approach ofiers comparable performance to the word-based approach, but potentially with a much smaller vocabulary size. Automatic term extraction in special text corpora is an interesting problem, which is becoming relevant as literature in speciflc scientiflc flelds such as medicine, biology and computer science explodes making it di‐cult to track the evolving terminology in the flelds [Kageura and Umino1996]. Early approaches to automatic term extraction were focused on information-theoretic approaches based on mutual information in detecting collocations [Manning and Schuetze1999]. Collocations are expressions that are composed of two or more words, the meaning of which is not easy to guess from the meanings of the component words. There are nuances in the detection of collocation that require linguistic criteria to resolve [Justeson and Katz1995]. Shallow linguistic criteria are based on acceptable sequences of part-of-speech tags. Part-of-speech tagging can be performed automatically [Brill1992]. A key problem is that of nesting, where subsets of consecutive words of terms consisting of multiple words would satisfy the statistical criteria for \termhood", but they would not be called terms. In the flrst part of this paper, we describe experiments with a state-of-the-art method, C-value/NC-value [Frantzi et al.2000], which combines statistical and linguistic information for automatic term extraction. We applied it to a special text corpus of computer science articles, which is of a difierent nature from the medical corpus on which the method was originally tested. We conflrmed that the performance of the method is equally good on our corpus, and we identifled some adjustments that the method required. In the second part of this paper, we use the terms extracted to estimate the similarity between two documents. We evaluate the quality of the similarity estimation based on terms in an information retrieval context. It is broadly believed that it is di‐cult to improve upon the bag-of-words representation as far as retrieval performance is concerned by using more sophisticated features or shallow linguistic techniques. Although retrieval based on terms did not show signiflcant improvement over a bag-of-words representation, our long-term objective is to cluster special text corpora into subareas, and automatically generate lexical ontologies from the clusters [Ayad and Kamel2002]. Terms in this context are of interest in themselves, and not purely as a vehicle to information retrieval. We are, furthermore, interested in similarity criteria taking into account proximity of terms [Koubarakis2001], for which again it is essential to work with terms, not words. The use of terms instead of words may also be preferable in information dissemination, where given a database of proflles (of c
TL;DR: Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page.
Abstract: In IR (information retrieval) systems based on the vector space model, the TF-IDF scheme is widely used to characterize documents However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting the contents of their hyperlinked neighboring pages In this paper, we first propose several approaches to refining the TF-IDF scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare the retrieval accuracy of our proposed approaches Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page
TL;DR: A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes, which tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.
Abstract: Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.
TL;DR: A new kind of kernel that is a generalization of Gaussian kernel is proposed to resolve the issue of infeasible kernels in region-based image retrieval systems that use variable-length representations.
Abstract: In this paper, the application of support vector machines (SVM) in relevance feedback for region-based image retrieval is investigated. Both the one class SVM as a class distribution estimator and two classes SVM as a classifier are taken into account. For the latter, two representative display strategies are studied. Since the common kernels often rely on inner product or L/sub p/ norm in the input space, they are infeasible in the region-based image retrieval systems that use variable-length representations. To resolve the issue, a new kind of kernel that is a generalization of Gaussian kernel is proposed. Experimental results on a database of 10,000 general-purpose images demonstrate the effectiveness and robustness of the proposed approach.
TL;DR: Comparisons on relevance feedback genetic techniques that follow the vector space model and one of the best traditional methods of relevance feedback--the Ide dec-hi method are carried out.
Abstract: The present work is the continuation of an earlier study which reviewed the literature on relevance feedback genetic techniques that follow the vector space model (the model that is most commonly used in this type of application), and implemented them so that they could be compared with each other as well as with one of the best traditional methods of relevance feedback--the Ide dec-hi method. We here carry out the comparisons on more test collections (Cranfield, CISI, Medline, and NPL), using the residual collection method for their evaluation as is recommended in this type of technique. We also add some fitness functions of our own design.
TL;DR: It is shown how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study.
Abstract: The main conclusion from the metrics-based evaluation of video retrieval systems at TREC's video track is that non-interactive image retrieval from general collections using visual information only is not yet feasible. We show how a detailed analysis of retrieval results -- looking beyond mean average precision (MAP) scores on topical relevance -- gives significant insight in the main problems with the visual part of the retrieval model under study. Such an analytical approach proves an important addition to standard evaluation measures.
TL;DR: A personalized search algorithm by using content-based filtering is presented in this paper, where the user model is represented as the probability distribution over the domain classification model.
Abstract: Traditional information retrieval technologies satisfy users need to a great extent. However, for their all-purpose characteristics, they can not satisfy any query from the different background, with the different intention and at the different time. A personalized search algorithm by using content-based filtering is presented in this paper. The user model is represented as the probability distribution over the domain classification model. A method of computing similarity and a method of revising user model are provided. Compared with the vector space model, the probability model is more effective on describing a users interests.
TL;DR: The merging process is based on the clustering concept where categories with similar characteristics are merged into the same cluster under certain distributed constraints and a performance measure called cluster entropy is proposed, which determines how well the categories from different sources are distributed over the resulting clusters.
TL;DR: This work uses kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text and compares against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
Abstract: We use kernel Canonical Correlation Analysis to learn a semantic representation of Web images and their associated text. This representation is used in two applications. In first application we consider classification of images into one of three categories. We use SVM in the semantic space and compare against the SVM on raw data and against previously published results using ICA. In the second application we retrieve images based only on their content from a text query. The semantic space provides a common representation and enables a comparison between the text and image. We compare against a standard cross-representation retrieval technique known as the Generalised Vector Space Model.
TL;DR: These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one and the approximation quality of the different mapping functions is compared.
Abstract: Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a “normalisation” function which maps the retrieval status value onto the probability of relevance (“mapping functions”). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.
TL;DR: This paper addresses three questions about the Language Modelling (LM) approach to information retrieval by presenting the INQUERY network model as a probabilistic classifier testing for whether the hypothesis of relevance holds given the evidence supplied by document and query.
Abstract: This paper addresses three questions about the Language Modelling (LM) approach to information retrieval. These questions are about LM and relevance. They arise because relevance has always been taken as fundamental to information retrieval (see, e.g. Saracevic [10] or Mizzaro [7]). Thus from the standpoint of retrieval theory, the presumption has been that as relevance is the key notion in retrieval (for how could it not be?), this should be explicitly recognised in any formal model of retrieval. The Probabilistic Model (PM) of retrieval does this very clearly. Turtle and Croft [13, 1] present the INQUERY network model as a probabilistic classifier testing for whether the hypothesis of relevance holds given the evidence supplied by document and query. The Vector Space Model assumes that some of the points in the multidimensional information space of a retrieval system represent relevant documents. But the LM account of what retrieval is about seems quite different: relevance does not formally figure in it at all. A retrieval model that does not mention relevance appears paradoxical. But the form in which the LM account is expressed immediately provokes the question:
TL;DR: In this paper, the authors describe the automatic keyword extraction (AKE) process for news characterization that uses several linguistic techniques to improve the current state of the text-based information retrieval.
Abstract: Newspapers are one of the most challenging domains for information retrieval systems: new articles appear everyday written in different languages, with multimedia contents and the news repositories may be updated in a matter of hours so information extraction is crucial to the metadata contents of the news. Further approaches of ”smart retrieval” have to cope with multimedia and multilingual features as well as have to obtain really good precision features in order to reach a high degree of user satisfaction with the retrieved documents. The paper focus is the description of the automatic keyword extraction (AKE) process for news characterization that uses several linguistic techniques to improve the current state of the text-based information retrieval. The first prototype implemented focusing in the AKE process (www.omnipaper.org) is described and some relevant performance features are included. Finally, some conclusions and comments are given regarding the role of the linguistic engineering in the web era.
TL;DR: Retrieval system based on latent semantic indexing (LSI) is presented, intended for practical industrial use within a control and information system operating at coking plant at Nova hut Ostrava.
Abstract: Retrieval system based on latent semantic indexing (LSI) is presented in our paper LSI is usually based on single value decomposition, which is time-consuming computation New, fast algorithm for computing the LSI was developed The algorithm was tested on text and image data collection Experimental results are given The algorithms are intended for practical industrial use within a control and information system operating at coking plant at Nova hut Ostrava The description of the industrial environment is given
TL;DR: This paper describes methods to improve a single classifier: boosting, discriminative training (DT) and automatic relevance feedback (ARF), and explores ways of deriving and combining uncorrelated classifiers in order to improve accuracy.
TL;DR: The main goal of this note is to introduce the notion of collection dependent “same context words”, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.
Abstract: The main goal of this note is to introduce the notion of collection dependent “same context words”. Two (or more) words are the “same context words” if they occur in the same (or similar) context across a given text collection. Each word w in the collection is associated with a profile P(w). The profile P(w) is the set of words occurring in sentences that contain w. We introduce a distance function in the set profiles, and use it to cluster words. Words contained in the same cluster are “same context words”. We select “same context words” for several text collections, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.
TL;DR: A Vector Space Model is proposed as a new methodological approach which allows the relationships between the elements of the Triple Helix Model (University, Industry, Government) in a spacial model by using the webpages of the National Research Councils of Germany and Spain as examples.
Abstract: The aim of this paper is to propose a Vector Space Model as a new methodological approach which allows us to present the relationships between the elements of the Triple Helix Model (University, Industry, Government) in a spacial model by using the webpages of the National Research Councils of Germany and Spain as examples. Outlinks of the Biomedicine and Biology centres of these national councils were analysed with the intention of representing graphically these relationships through the Vector Space Model that allows for Multidimensional Scaling in three dimensions. Results show a map with the differences and similarities between the Spanish and German cases. It may be concluded that these results could become a qualitative indicator of a scientific and technical reality.
TL;DR: This thesis is that a model developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications, and is tested by learning from a corpus the best document model.
Abstract: Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.
TL;DR: In this article, the problem of presenting data having a meaning close to a retrieval condition as a retrieval result when no information matched to the retrieval condition is present is solved by generating a re-retrieval item condition.
Abstract: PROBLEM TO BE SOLVED: To present data having a meaning close to a retrieval condition as a retrieval result when no information matched to the retrieval condition is present. SOLUTION: A range determination means 4 determines a range of retrieval based on position information 9a. A retrieval means 5 retrieves position dependent information matched to an inputted item condition 9b from a position dependent information database 2. A re-retrieval item condition generation means 6 traces back from a retrieval object classification item designated by the item condition 9b to an upper layer when the retrieval result satisfies a predetermined condition. The re-retrieval item condition generation means 6 sets another conception included in the classification time of the upper layer as a new retrieval object classification item to generate a re-retrieval item condition 9c. A re-retrieval means 7 performs a retrieval according to the re-retrieval item condition 9c. According to this, when no information matched to the retrieval condition is present, another information included in a conception wider than the retrieval condition is presented as the retrieval result. COPYRIGHT: (C)2005,JPO&NCIPI
TL;DR: A review of the development of statistically-based retrieval can be found in this paper, where the authors present a survey of statistical methods for document or text retrieval and their application in information and language processing.
Abstract: This paper reviews the development of statistically-based retrieval. Since the 1950s statistical techniques have clearly demonstrated their practical worth and statistical theories their staying power, for document or text retrieval. In the last decade the TREC programme, and the Web, have offered new retrieval challenges to which these methods have successfully risen. They are now one element in the much wider and very productive spread of statistical methods to all areas of information and language processing, in which innovative approaches to modelling their data and tasks are being applied.
TL;DR: The experimental results show that the proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements, which can improve the retrieval effectiveness of the information retrieval system.
Abstract: An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.
TL;DR: This article focuses on speeding up the information retrieval process in Arabic document base by using a root-based hierarchical indexing model, and results demonstrated that speed gain in the range of 50-100 can be achieved for typical queries.
Abstract: The world has recently witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. Text data mining, as a m...
TL;DR: The key observation is that kernel metrics may be nonlinear and highly dynamic in the input space but remain Euclidean in induced feature space, which enables us to learn arbitrary relevance functions without changing the index in feature space.
Abstract: Relevance feedback is an attractive approach to developing flexible metrics for content-based retrieval in image and video databases. Large image databases require an index structure in order to reduce nearest neighbor computation. However, flexible metrics can alter an input space in a highly nonlinear fashion, thereby rendering the index structure useless. Few systems have been developed that address the apparent flexible metric/indexing dilemma. This paper proposes kernel indexing to try to address this dilemma. The key observation is that kernel metrics may be nonlinear and highly dynamic in the input space but remain Euclidean in induced feature space. It is this linear invariance in feature space that enables us to learn arbitrary relevance functions without changing the index in feature space. As a result, kernel indexing supports efficient relevance feedback retrieval in large image databases. Experimental results using a large set of image data are very promising.
TL;DR: The proposed approach is very useful for document retrieval with relevance feedback experimentally and applies active learning techniques based on Support Vector Machine for evaluating successive batches, which is called relevance feedback.
Abstract: We investigate the following data mining problems from the document retrieval: From a large data set of documents, we need to find documents that relate to human interesting in as few iterations of human testing or checking as possible. In each iteration a comparatively small batch of documents is evaluated for relating to the human interesting. We apply active learning techniques based on Support Vector Machine for evaluating successive batches, which is called relevance feedback. Finally, our proposed approach is very useful for document retrieval with relevance feedback experimentally.
TL;DR: The proposed method modifies the weights of document terms in document descriptor vectors based on the user’s relevance feedback to increase the retrieval effectiveness of the fuzzy information retrieval systems for document retrieval.
Abstract: In this paper, we present a new method for fuzzy information retrieval based on document terms reweighting techniques. The proposed method modifies the weights of document terms in document descriptor vectors based on the user’s relevance feedback. After modifying the weights of terms in document descriptor vectors, the degrees of satisfaction of relevant documents with respect to the user’s query will increase, and the degrees of satisfaction of irrelevant documents with respect to the user’s query will decrease. Then, the modified document descriptor vectors can be used as personal profiles for future query processing. The proposed method can make fuzzy information retrieval systems more flexible and more intelligent to deal with documents retrieval. It can increase the retrieval effectiveness of the fuzzy information retrieval systems for document retrieval.
TL;DR: This work has shown that N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7].
Abstract: Recent advances in Information Retrieval are based on using Statistical Language Models (SLM) for representing documents and evaluating their relevance to user queries [6, 3, 4]. Language Modeling (LM) has been explored in many natural language tasks including machine translation and speech recognition [1]. In LM approach to document retrieval, each document, D, is viewed to have its own language model, MD. Given a query, Q, documents are ranked based on the probability, P (Q|MD), of their language model generating the query. While the LM approach to information retrieval has been motivated from different perspectives [3, 4], most experiments have used smoothed unigram language models that assume term independence for estimating document language models. N-gram, specifically, bigram language models that capture context provided by the previous word(s) perform better than unigram models [7]. Biterm language models [8] that ignore the word order constraint in bigram language models have been shown to perform better than bigram models. However, word order constraint cannot always be relaxed since a blind venetian is not a venetian blind. Term dependencies can be measured using their co-occurrence statistics. Nallapati and Allan [5] represent term dependencies in a sentence using a maximum spanning tree and generate a sentence tree language model for the story link detection task in TDT. Syntactic parse of user queries can provide clues for when the word order constraint can be relaxed. Syn-
TL;DR: This paper introduces integrated information retrieval (IIR), an XML-based retrieval approach that closes the gap between structured and semistructured data retrieval and document retrieval, and introduces the syntax and semantics of an extension of the XQuery language called XQuery/IR.
Abstract: Data retrieval comprises exact queries that allow a user to specify a precisely defined subset of a data source. Document retrieval arranges elements of a given document collection according to their relevance to a set of query terms. For querying structured and semistructured data, data retrieval and document retrieval are two valuable and complementary techniques that, however, have never been fully integrated.
In this dissertation, we introduce Integrated Information Retrieval (IIR), a conceptually new retrieval approach that closes this gap. We present syntax and semantics of an extension of the XQuery language called XQuery/IR. The extended language realizes IIR based on the Extensible Markup Language (XML) and allows users to formulate new kinds of valuable queries by nesting ranked document retrieval and precise data retrieval sub-queries. Furthermore, we detail index structures and efficient query processing approaches for implementing XQuery/IR. Based on a new identification scheme for nodes in a node-labeled tree structure such as underlying XML, the index structures require only a fraction of the space of comparable, existing index structures for just data retrieval. For semistructured data such as XML data, we also present a first distribution design approach, whose realization confirms the value of the new node identification and indexing scheme for applications beyond Integrated Information Retrieval.