Top 104 papers published in the topic of Vector space model in 2001

Showing papers on "Vector space model published in 2001"

Journal Article•

Bipartite graph partitioning and data clustering

[...]

Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon - Show less +1 more

07 May 2001-Lawrence Berkeley National Laboratory

TL;DR: In this article, a bipartite graph based data clustering method is proposed, where terms and documents are simultaneously grouped into semantically meaningful co-categories and subject descriptors.

...read moreread less

Abstract: Bipartite Graph Partitioning and Data Clustering* Hongyuan Zha Xiaofeng He Dept. of Comp. Sci. & Eng. Penn State Univ. State College, PA 16802 {zha,xhe}@cse.psu.edu Chris Ding Horst Simon NERSC Division Berkeley National Lab. Berkeley, CA 94720 {chqding,hdsimon} Qlbl. gov Ming Gu Dept. of Math. U.C. Berkeley Berkeley, CA 94720 mgu@math.berkeley.edu ABSTRACT M a n y data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, we propose a new data clustering method based on partitioning the underlying bipartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. We show that an approxi mate solution to the minimization problem can be obtained by computing a partial singular value decomposition ( S V D ) of the associated edge weight matrix of the bipartite graph. We point out the connection of our clustering algorithm to correspondence analysis used in multivariate analysis. We also briefly discuss the issue of assigning data objects to multiple clusters. In the experimental results, we apply our clustering algorithm to the problem of document clustering to illustrate its effectiveness and efficiency. 1. INTRODUCTION Cluster analysis is an important tool for exploratory data mining applications arising from many diverse disciplines. Informally, cluster analysis seeks to partition a given data set into compact clusters so that data objects within a clus ter are more similar than those in distinct clusters. The liter ature on cluster analysis is enormous including contributions from many research communities, (see [6, 9] for recent sur veys of some classical approaches.) M a n y traditional clus tering algorithms are based on the assumption that the given dataset consists of covariate information (or attributes) for each individual data object, and cluster analysis can be cast as a problem of grouping a set of n-dimensional vectors each representing a data object in the dataset. A familiar ex ample is document clustering using the vector space model [1]. Here each document is represented by an n-dimensional vector, and each coordinate of the vector corresponds to a term in a vocabulary of size n. This formulation leads to the so-called term-document matrix A = (oy) for the rep resentation of the collection of documents, where o y is the so-called term frequency, i.e., the number of times term i occurs in document j. In this vector space model terms and documents are treated asymmetrically with terms consid ered as the covariates or attributes of documents. It is also possible to treat both terms and documents as first-class citizens in a symmetric fashion, and consider a y as the fre quency of co-occurrence of term i and document j as is done, for example, in probabilistic latent semantic indexing [12]. In this paper, we follow this basic principle and propose a new approach to model terms and documents as vertices in a bipartite graph with edges of the graph indicating the co-occurrence of terms and documents. In addition we can optionally use edge weights to indicate the frequency of this co-occurrence. Cluster analysis for document collections in this context is based on a very intuitive notion: documents are grouped by topics, on one hand documents in a topic tend to more heavily use the same subset of terms which form a term cluster, and on the other hand a topic usually is characterized by a subset of terms and those documents heavily using those terms tend to be about that particular topic. It is this interplay of terms and documents which gives rise to what we call bi-clustering by which terms and documents are simultaneously grouped into semantically co- Categories and Subject Descriptors 11.3.3 [ I n f o r m a t i o n S e a r c h a n d R e t r i e v a l ] : Clustering; G.1.3 [ N u m e r i c a l L i n e a r A l g e b r a ] : Singular value de composition; G.2.2 [ G r a p h T h e o r y ] : G r a p h algorithms General Terms Algorithms, theory Keywords document clustering, bipartite graph, graph partitioning, spectral relaxation, singular value decomposition, correspon dence analysis *Part of this work was done while Xiaofeng He was a grad uate research assistant at N E R S C , Berkeley National Lab. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM '01 November 5-10, 2001, Atlanta, Georgia. U S A Copyright 2001 A C M X - X X X X X - X X - X / X X / X X ...$5.00. O u r clustering algorithm computes an approximate global optimal solution while probabilistic latent semantic indexing relies on the E M algorithm and therefore might be prone to local m i n i m a even with the help of some annealing process. x

...read moreread less

295 citations

Book Chapter•10.1007/978-1-4615-1733-7_20•

Efficient Clustering of Very Large Document Collections

[...]

Inderjit S. Dhillon, James Fan, Yuqiang Guan

1 Jan 2001

TL;DR: This paper presents a time and memory efficient technique for the entire clustering process, including the creation of the vector space model, and demonstrates how this efficiency is obtained by a memory-efficient multi-threaded preprocessing scheme and a fast clustering algorithm that fully exploits the sparsity of the data set.

...read moreread less

Abstract: An invaluable portion of scientific data occurs naturally in text form Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors It is a contemporary challenge to efficiently preprocess and cluster very large document collections In this paper we present a time and memory efficient technique for the entire clustering process, including the creation of the vector space model This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set We show that this entire process takes time that is linear in the size of the document collection Detailed experimental results are presented — a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption

...read moreread less

292 citations

Proceedings Article•10.1109/SPIRE.2001.989733•

Distributed query processing using partitioned inverted files

[...]

Claudine Badue¹, Berthier Ribeiro-Neto, Ricardo Baeza-Yates, Nivio Ziviani•Institutions (1)

Universidade Federal de Minas Gerais¹

13 Nov 2001

TL;DR: Experimental results on retrieval eficiency show that, within the framework, the global index partitioning outpe~orms the local index partitions in the distributed system.

...read moreread less

Abstract: In this paper; we study query processing in a distributed text database. The novelty is a real distributed architecture implementation that oflers concurrent query service. The distributed system adopts a network of workstations model and the client-server paradigm. The document collection is indexed with an imerted$le. We adopt two distinct strategies of index partitioning in the distributed system, namely local index partitioning and global indexpartitioning. In both strategies, documents are ranked using the vector space model along with a documentfiltering technique for fast ranking. We evaluate and compare the impact of the two index partitioning strategies on query processing per$ormance. Experimental results on retrieval eficiency show that, within our framework, the global index partitioning outpe~orms the local index partitioning.

...read moreread less

114 citations

Proceedings Article•10.1142/9789812799623_0035•

Predicting the sub-cellular location of proteins from text using support vector machines.

[...]

B. J. Stapley¹, Lawrence A. Kelley, Michael J.E. Sternberg•Institutions (1)

Lincoln's Inn¹

1 Dec 2001

TL;DR: An automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts, which out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers.

...read moreread less

Abstract: We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S. cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.

...read moreread less

93 citations

Patent•

Method for adapting a k-means text clustering to emerging data

[...]

William Scott Spangler¹•Institutions (1)

IBM¹

20 Sep 2001

TL;DR: In this article, a method and structure for clustering documents in datasets which include clustering first documents and a first dataset to produce first document classes, creating centroid seeds based on the first documents classes, and clustering second documents in a second dataset using the centroid seed, wherein the first dataset and the second dataset are related.

...read moreread less

Abstract: A method and structure for clustering documents in datasets which include clustering first documents and a first dataset to produce first document classes, creating centroid seeds based on the first document classes, and clustering second documents in a second dataset using the centroid seeds, wherein the first dataset and the second dataset are related. The clustering of the first documents in the first dataset forms a first dictionary of most common words in the first dataset and generates a first vector space model by counting, for each word in the first dictionary, a number of the first documents in which the word occurs, and clusters the first documents in the first dataset based on the first vector space model, and further generates a second vector space model by counting, for each word in the first dictionary, a number of the second documents in which the word occurs. Creation of the centroid seeds includes classifying second vector space model using the first document classes to produce a classified second vector space model and determining a mean of vectors in each class in the classified second vector space model, the mean includes the centroid seeds.

...read moreread less

92 citations

Journal Article•10.1023/A:1011297104922•

Innovative techniques for legal text retrieval

[...]

Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Mar 2001-Artificial Intelligence and Law

TL;DR: An overview of the state of the art of these innovativetechniques and their potential for legal text retrieval is given.

...read moreread less

Abstract: Legal text retrieval traditionally relies upon external knowledge sources such as thesauri and classification schemes, and an accurate indexing of the documents is often manually done. As a result not all legal documents can be effectively retrieved. However a number of current artificial intelligence techniques are promising for legal text retrieval. They sustain the acquisition of knowledge and the knowledge-rich processing of the content of document texts and information need, and of their matching. Currently, techniques for learning information needs, learning concept attributes of texts, information extraction, text classification and clustering, and text summarization need to be studied in legal text retrieval because of their potential for improving retrieval and decreasing the cost of manual indexing. The resulting query and text representations are semantically much richer than a set of key terms. Their use allows for more refined retrieval models in which some reasoning can be applied. This paper gives an overview of the state of the art of these innovativetechniques and their potential for legal text retrieval.

...read moreread less

79 citations

Proceedings Article•10.1145/500141.500261•

Indexing and retrieval of 3D models aided by active learning

[...]

Cha Zhang¹, Tsuhan Chen¹•Institutions (1)

Carnegie Mellon University¹

1 Oct 2001

TL;DR: The system can perform better than random annotation, and the retrieval result improves rapidly with the number of annotated samples, and relevance feedback is included in the system and combined with active learning, which provides better user-adoptive retrieval results.

...read moreread less

Abstract: We demonstrate a system for indexing and retrieval of 3D models aided by active learning. We propose a new set of region-based features for 3D models. Each model is treated as a solid volume with a uniform density. Features such as the volume-surface ratio, the moment invariants and the Fourier transform coefficients are efficiently calculated from the mesh model directly. Comparable retrieval performance is achieved with other features such as the cord histogram, the 3D shape spectrum, etc. To further improve the performance, we incorporate hidden annotation into our system. We propose to use active learning to improve the annotation efficiency. We show that with active learning, the system can perform better than random annotation, and the retrieval result improves rapidly with the number of annotated samples. Moreover, relevance feedback is included in the system and combined with active learning, which provides better user-adoptive retrieval results.

...read moreread less

79 citations

Journal Article•10.1023/A:1011359607594•

An Evaluation of Color-Spatial Retrieval Techniques for Large Image Databases

[...]

Kian-Lee Tan¹, Beng Chin Ooi¹, Chia Yeow Yee¹•Institutions (1)

National University of Singapore¹

01 May 2001-Multimedia Tools and Applications

TL;DR: An experimental evaluation of three color-spatial retrieval techniques—the signature-based technique, the partition-based algorithm and the cluster-based method are presented and can serve as guidelines in selecting a suitable technique and designing a new technique.

...read moreread less

Abstract: In a color-spatial retrieval technique, the color information is integrated with the knowledge of the colors' spatial distribution to facilitate content-based image retrieval. Several techniques have been proposed in the literature, but these works have been developed independently without much comparison. In this paper, we present an experimental evaluation of three color-spatial retrieval techniques—the signature-based technique, the partition-based algorithm and the cluster-based method. We implemented these techniques and compare them on their retrieval effectiveness and retrieval efficiency. The experimental study is performed on an image database consisting of 12,000 images. With the proliferation of image retrieval mechanisms and the lack of extensive performance study, the experimental results can serve as guidelines in selecting a suitable technique and designing a new technique.

...read moreread less

51 citations

Proceedings Article•10.1109/IVL.2001.990862•

Exploring the nature and variants of relevance feedback

[...]

Xiang Sean Zhou¹, T.S. Huang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

14 Dec 2001

TL;DR: The nature of the relevance feedback problem in a continuous representation space in the context of multimedia information retrieval is analyzed and the assumptions, implementations, and merits of various solutions in the literature are compared.

...read moreread less

Abstract: We analyze the nature of the relevance feedback problem in a continuous representation space in the context of multimedia information retrieval. Emphasis is put on exploring the uniqueness of the problem and comparing the assumptions, implementations, and merits of various solutions in the literature. An attempt is made to compile a list of critical issues to consider when designing a relevance feedback algorithm. With a state-of-the-art review as the main portion, the paper also offers some novel solutions and perspectives throughout the discussion.

...read moreread less

50 citations

Book Chapter•10.1007/3-540-45368-7_2•

Models in information retrieval

[...]

Norbert Fuhr

01 May 2001-Lecture Notes in Computer Science

TL;DR: Following Rijsbergen's approach of regarding IR as uncertain inference, models are distinguished according to the expressiveness of the underlying logic and the way uncertainty is handled to cope with the intrinsic uncertainty of IR.

...read moreread less

Abstract: Retrieval models form the theoretical basis for computing the answer to a query. They differ not only in the syntax and expressiveness of the query language, but also in the representation of the documents. Following Rijsbergen's approach of regarding IR as uncertain inference, we can distinguish models according to the expressiveness of the underlying logic and the way uncertainty is handled. Classical retrieval models are based on propositional logic. In the vector space model, documents and queries are represented as vectors in a vector space spanned by the index terms, and uncertainty is modelled by considering geometric similarity. Probabilistic models make assumptions about the distribution of terms in relevant and nonrelevant documents in order to estimate the probability of relevance of a document for a query. Language models compute the probability that the query is generated from a document. All these models can be interpreted within a framework that is based on a probabilistic concept space. For IR applications dealing not only with texts, but also with multimedia or factual data, propositional logic is not sufficient. Therefore, advanced IR models use restricted forms of predicate logic as basis. Terminological/description logics are rooted in semantic networks and terminological languages like e.g. KL-ONE. Datalog uses function-free horn clauses. Probabilistic versions of both approaches are able to cope with the intrinsic uncertainty of IR.

...read moreread less

46 citations

Extraction and Disambiguation of Acronym Meaning-Pairs in Medline

[...]

James Pustejovsky, José M. Castaño, Brent H. Cochran, Maciej Kotecki, Michael Morrell, Anna Rumshisky - Show less +2 more

1 Jan 2001

TL;DR: In this article, the authors present a system called Acromed which finds acronym-meaning pairs as part of a set of information extraction tools designed for processing and extracting data from abstracts in the Medline database.

...read moreread less

Abstract: Acronyms are widely used in biomedical and other technical texts. Understanding their meaning constitutes an important problem in the automatic extraction and mining of information from text. Moreover, an even harder problem is sense disambiguation of acronyms; that is, where a single acronym, termed a polynym, has a multiplicity of meanings, a common occurrence in the biomedical literature. In such cases, it is necessary to identify the correct corresponding sense for the polynym, which is often not directly specified in the text. Here we present a system called Acromed which finds acronym-meaning pairs as part of a set of information extraction tools designed for processing and extracting data from abstracts in the Medline database. Our strategy for finding acronym-meaning pairs differs from previous automated acronym extraction methods by incorporating shallow parsing of the text into the acronym recognition algorithm. The performance of our system has been tested with a highly diverse set of Medline texts, giving the highest results for precision and recall, thus far in the literature. We then present Polyfind, an algorithm for disambiguating polynyms, which uses a vector space model. Our disambiguation tests produced 97.62% accuracy in one test (on acronyms) and 86.6% accuracy in another (on aliases).

...read moreread less

Journal Article•

Research and Implementation of Text Categorization System Based on VSM

[...]

Pang Jian

01 Jan 2001-Application Research of Computers

TL;DR: This paper gives a research to several key techniques about text categorization, including vector space model, feature extraction, machine learning, and describes aText categorization model based on VSM.

...read moreread less

Abstract: In recent years , information processing turns more and more important for us to get useful information . Text categorization, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance. This paper gives a research to several key techniques about text categorization , including vector space model , feature extraction , machine learning . It also describes a text categorization model based on VSM, and gives the evaluations and results .

...read moreread less

Proceedings Article•10.1109/RIDE.2001.916491•

Telcordia LSI Engine: implementation and scalability issues

[...]

Chung-Min Chen¹, N. Stoffel¹, Mike Post¹, Chumki Basu¹, Devasis Bassu¹, C. Behrens¹ - Show less +2 more•Institutions (1)

Telcordia Technologies¹

1 Apr 2001

TL;DR: A production-level implementation of LSI is described, addressing scalability issues in the query engine and various components of the system, and lessons learned are presented.

...read moreread less

Abstract: Latent Semantic Indexing (LSI), a vector space-based approach to information retrieval, has been proven to be an effective tool in correlating and retrieving relevant documents. While much work has been published on LSI, most of it addresses the algorithmic or theoretical basis of the model. Little, if any, presents implementation issues in practice. We describe a production-level implementation of LSI. The system integrates components including document collection and preprocessing, singular value decomposition (SVD), multilingual processing, and a tree-based access method for similarity querying. We discuss implementation issues encountered during the development of the system. In particular, we address scalability issues in the query engine and various components of the system, and present lessons learned.

...read moreread less

Journal Article•10.1002/INT.1042•

Integrating WordNet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization

[...]

Mohammed Benkhalifa¹, Abdelhak Mouradi², Houssaine Bouyakhf²•Institutions (2)

Al Akhawayn University¹, Mohammed V University²

01 Aug 2001-International Journal of Intelligent Systems

TL;DR: For text categorization, the integration of external WordNet lexical information to supplement training data for a semi‐supervised clustering algorithm which uses a finite design set of labeled data to help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then terminates without the capacity to classify other objects.

...read moreread less

Abstract: The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the “semi-supervised agglomerative hierarchical clustering algorithm” (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data. © 2001 John Wiley & Sons, Inc.

...read moreread less

Journal Article•10.1016/S0164-1212(00)00095-9•

A fast content-based indexing and retrieval technique by the shape information in large image database

[...]

Dong-Ho Lee¹, Hyoung-Joo Kim¹•Institutions (1)

Seoul National University¹

01 Mar 2001-Journal of Systems and Software

TL;DR: An efficient content-based image retrieval (CBIR) system which employs the shape information of images to facilitate the retrieval process and it is shown that the image indexing method supports faster retrieval than other multi-dimensional indexing methods such as the R*-tree.

...read moreread less

Journal Article•10.1016/S0306-4573(00)00042-X•

TOFIR: a tool of facilitating information retrieval — introduce a visual retrieval model

[...]

Jin Zhang¹•Institutions (1)

University of Wisconsin–Milwaukee¹

01 Jul 2001-Information Processing and Management

TL;DR: A new method for the visualization of information retrieval is introduced, using angle attributes of a document to construct the angle–angle-based visual space and several traditional information retrieval evaluation models are visualized and interpreted.

...read moreread less

Abstract: The paper introduces a new method for the visualization of information retrieval. Angle attributes of a document are used to construct the angle–angle-based visual space. The retrieved documents are perceived, several traditional information retrieval evaluation models are visualized and interpreted, and new non-traditional retrieval control means based on the model are explored in the two-dimensional angle display space. The impacts of different metrics on the visualization of information retrieval are discussed. Ambiguity, future research directions and other relevant issues are also addressed.

...read moreread less

Proceedings Article•10.1145/383535.383545•

A machine learning approach to prior case retrieval

[...]

Khalid Al-Kofahi¹, Alex Tyrrell¹, Arun Vachher¹, Peter Jackson•Institutions (1)

University of Rochester¹

1 May 2001

TL;DR: A system that processes court opinions and retrieves related cases from a citator database, so that new cases can be linked to earlier ones that they impact, is described.

...read moreread less

Abstract: We describe a system that processes court opinions and retrieves related cases from a citator database, so that new cases can be linked to earlier ones that they impact. The design of the system combines information extraction, information retrieval and machine learning techniques in a novel way. The fully implemented program is capable of performing prior case retrieval at human levels of recall and acceptable levels of precision.

...read moreread less

Journal Article•10.1142/S0218488501000995•

An ordinal information retrieval model

[...]

Gloria Bordogna, Gabriella Pasi

01 Sep 2001-International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

TL;DR: An ordinal Information Retrieval model is proposed, which is formalised within fuzzy set theory and is based on the notion of linguistic granules of information to represent and manage the importance of both the index terms as descriptors of the information items and the query terms as descriptor of users' needs.

...read moreread less

Abstract: In this paper an ordinal Information Retrieval model is proposed, which is formalised within fuzzy set theory and is based on the notion of linguistic granules of information. Linguistic expressions are defined to represent and manage the importance of both the index terms as descriptors of the information items and the query terms (content selectors) as descriptors of users' needs. The advantage of this approach with respect to the (numeric) fuzzy IR models is that the query evaluation mechanism and the definition of the importance semantics are simplified.

...read moreread less

Proceedings Article•10.1109/ICTAI.2001.974472•

An expert recommendation system using concept-based relevance discernment

[...]

T. Yukawa¹, K. Kasahara, T. Kato, T. Kita•Institutions (1)

Nippon Telegraph and Telephone¹

7 Nov 2001

TL;DR: An expert recommendation system using concept-based relevance discernment is proposed, which processes the description of a technical topic as input and then finds engineers who have a high level of expertise in that area.

...read moreread less

Abstract: An expert recommendation system using concept-based relevance discernment is proposed. This system processes the description of a technical topic as input and then finds engineers who have a high level of expertise in that area. The technique employed is an extended vector space model that locates both technical topics and engineers in the same multi-dimensional space, and then calculates their relevance. This system can also retrieve engineers or documents that are related to afield matching a given engineer's technical interests. Such a system can be expected to play the role of a person's professional network, and be a valuable tool for knowledge management among several organizations.

...read moreread less

Proceedings Article•10.1109/SPIRE.2001.989746•

A model for the representation and focussed retrieval of structured documents based on fuzzy aggregation

[...]

Gabriella Kazai¹, Mounia Lalmas, Thomas Rölleke¹•Institutions (1)

Queen Mary University of London¹

13 Nov 2001

TL;DR: A model for representing structured documents to allow for their focussed retrieval is presented, founded on fuzzy aggregation, an approach based on the fuzzy representation of linguistic quantifiers and ordered weighted averaging operators.

...read moreread less

Abstract: Effective retrieval of structured documents should exploit the content and structural knowledge associated with the documents. This knowledge can be used to focus retrieval to the best entry points: document components that contain relevant information, and from which users can browse to retrieve further relevant components. To enable this, suitable representation methods must be developed. This paper presents a model for representing structured documents to allow for their focussed retrieval. The model is founded on fuzzy aggregation, an approach based on the fuzzy representation of linguistic quantifiers and ordered weighted averaging operators. By defining the representation of a document component as the fuzzy aggregation of its related components, we arrive at a document representation that supports the selection of best entry points.

...read moreread less

Proceedings Article•

Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data

[...]

Moongu Jeon¹, Haesun Park¹, J. Ben Rosen²•Institutions (2)

University of Minnesota¹, University of California, San Diego²

8 Feb 2001

TL;DR: The CentroidQR method is discussed in detail, and its classification properties with two different similarity measures of L2 and cosine are proved, and it is shown that its classification results were exactly the same as those of classification with full dimension when a certain classification algorithm is applied.

...read moreread less

Abstract: Dimension reduction in today’s vector space based information retrieval system is essential for improving computational efficiency in handling massive data. In our previous work we proposed a mathematical framework for lower dimensional representations of text data in vector space based information retrieval, and a couple of dimension reduction method using minimization and matrix rank reduction formula. One of our proposed methods is CentroidQR method which utilizes orthogonal transformation on centroids, and the test results showed that its classification results were exactly the same as those of classification with full dimension when a certain classification algorithm is applied. In this paper we discuss in detail the CentroidQR, and prove mathematically its classification properties with two different similarity measures of L2 and cosine. ∗The work of all three authors was supported in part by the National Science Foundation grant CCR-9901992. Dept. of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455, U.S.A., e-mail: jeon@cs.umn.edu. †Dept. of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455, U.S.A., e-mail: hpark@cs.umn.edu. ‡Dept. of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN 55455 and Dept. of Computer Science and Engineering, Univ. of California, San Diego, La Jolla, CA 92093, U.S.A. e-mail: jbrosen@cs.ucsd.edu.

...read moreread less

Proceedings Article•10.1109/HICSS.2001.927093•

A mining-based category evolution approach to managing online document categories

[...]

Chih-Ping Wei¹, Yuan-Xin Dong¹•Institutions (1)

National Sun Yat-sen University¹

3 Jan 2001

TL;DR: A mining-based category evolution technique to adjust document categories based on existing categories and their associated documents is proposed and empirical evaluation results indicate that the proposed technique was more effective than the category discovery approach and was insensitive to the quality of original categories.

...read moreread less

Abstract: With rapid expansion of the numbers and sizes of text repositories and improvements in global connectivity, the quantity of information available online as free-format text is growing exponentially. Many large organizations create and maintain huge volumes of textual information online, and there is a pressing need for support of efficient and effective information retrieval, filtering, and management. Text categorization, or the assignment of textual documents to one or more pre-defined categories based on their content, is an essential component of efficient management and retrieval of documents. Previously, research has focused predominantly on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns for a pre-defined set of categories. However, as documents accumulate, such categories may not capture a document's characteristics correctly. In this study, we propose a mining-based category evolution (MiCE) technique to adjust document categories based on existing categories and their associated documents. Empirical evaluation results indicate that the proposed technique, MiCE, was more effective than the category discovery approach and was insensitive to the quality of original categories.

...read moreread less

Book•

Lower dimensional representation of text data in vector space based information retrieval

[...]

Haesun Park¹, Moongu Jeon¹, J. Ben Rosen²•Institutions (2)

University of Minnesota¹, University of California, San Diego²

1 Jan 2001

TL;DR: A mathematical framework for lower dimensional representation of text data in vector space based information retrieval using minimization and matrix rank reduction formula and proposes a new approach which is more eecient and eeective than LSI/SVD when the authors have a priori information on the cluster structure of the data.

...read moreread less

Abstract: Dimension reduction in today's vector space based information retrieval system is essential for improving computational eeciency in handling massive data. In this paper, we propose a mathematical framework for lower dimensional representation of text data in vector space based information retrieval using minimization and matrix rank reduction formula. We illustrate how the commonly used Latent Semantic Indexing based on Singular Value Decomposition (LSI/SVD) can be derived as a method for dimension reduction from our mathematical framework. Then we propose a new approach which is more eecient and eeective than LSI/SVD when we have a priori information on the cluster structure of the data. Several advantages of the new methods are discussed over the LSI/SVD in terms of computational eeciency and data representation in the reduced dimensional space. Experimental results are presented to illustrate the eeectiveness of our approach in certain 1 classiication problem in reduced dimensional space. These results were computed using an information retrieval test system we are now developing. The results indicate that for a successful lower dimensional representation of data, it is important to incorporate a priori knowledge on data in dimension reduction.

...read moreread less

Journal Article•10.1023/A:1011458711300•

Integrating External Knowledge to Supplement Training Data in Semi-Supervised Learning for Text Categorization

[...]

Mohammed Benkhalifa¹, Abdelhak Mouradi², Houssaine Bouyakhf²•Institutions (2)

Al Akhawayn University¹, Mohammed V University²

01 Jul 2001-Information Retrieval

TL;DR: This paper suggests the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents.

...read moreread less

Abstract: Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the “Semi-Supervised Fuzzy c-Means” (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.

...read moreread less

Proceedings Article•10.1109/CEC.2001.934334•

Evolutionary learning of Web-document structure for information retrieval

[...]

Sun Kim¹, Byoung-Tak Zhang•Institutions (1)

Seoul National University¹

27 May 2001

TL;DR: The experimental results show that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and the proposed method has achieved about 10% improvement in retrieval accuracy.

...read moreread less

Abstract: Web documents have a number of tags indicating the structure of documents. The tag information can be utilized to improve the performance of document retrieval systems. The authors propose an approach to retrieve Web documents using HTML tags and then use a genetic algorithm to adapt the tag weights. This method uses a modified similarity measure based on the tag weights. A genetic learning method is used to select the tags for retrieval and get the optimal tag weights. We evaluate our method via experiments on conference pages and TREC document sets. The experimental results show that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval. The proposed method has achieved about 10% improvement in retrieval accuracy.

...read moreread less

Proceedings Article•10.1145/372202.372208•

Interactive document retrieval with relational learning

[...]

Masayuki Okabe¹, Seiji Yamada¹•Institutions (1)

Tokyo Institute of Technology¹

1 Mar 2001

TL;DR: The results show the approach to enhance the effective and efficiency of human help in interactive document retrieval enhances both effectiveness and efficiency compared to a normal method with only query vector.

...read moreread less

Abstract: paper describes an approach to enhance the effective- ness of human help in interactive document retrieval, where the system supports the user to find documents effectively through relevance feedback. At present vector space model is a typical representation method to realize relevance feed- back. However it can neither express relationship such as proximity nor keep several features separately. We sup- plement these defects with a set of rules, which are con- structed by relational learning and used to identify relevant documents. The learning algorithm consists of separate- and-conquer strategy and top-down heuristic search with limited backtracking. Background relations are made only from keywords, thus constructed rules represent useful key- word combinations to search relevant documents. We evalu- ate the effectiveness of our approach on document retrieval experiments using a test bed database. The results show our method enhances both effectiveness and efficiency compared to a normal method with only query vector.

...read moreread less

Proceedings Article•10.1109/ICII.2001.983042•

An approach of multi-hierarchy text classification

[...]

Shaohui Liu¹, Mingkai Dong, Haijun Zhang, Rong Li, Zhongzhi Shi - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

29 Oct 2001

TL;DR: Improves on the classical formula of calculating the term weight in the vector space model and proposes an approach to multi-hierarchy text classification based on thevector space model that is of high classification precision and recall.

...read moreread less

Abstract: Improves on the classical formula of calculating the term weight in the vector space model. Furthermore, an approach to multi-hierarchy text classification based on the vector space model is proposed. In this approach, all classes are organized as a tree according to some given hierarchical relations, and all the training documents in a class are combined into a class-document. In order to construct the class models, only the class-documents attached to the same node of the same layer are compared. When classifying the documents, one matching process is hierarchically performed from the root node to the leaf nodes until a corresponding subclass is found. The experiment and real systems indicate that the approach is of high classification precision and recall.

...read moreread less

Book Chapter•10.1007/3-540-45581-7_41•

Recognizing Ontology-Applicable Multiple-Record Web Documents

[...]

David W. Embley¹, Yiu-Kai Ng¹, Li Xu¹•Institutions (1)

Brigham Young University¹

27 Nov 2001

TL;DR: A technique for recognizing which multiplere-cord Web documents apply to an ontologically specified application and, based on machine-learned rules over these heuristic measurements, determines whether a Web document is applicable for a given ontology.

...read moreread less

Abstract: Automatically recognizing which Web documents are "of interest" for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiplere-cord Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructuredWeb document, we apply three heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to an application ontology, (2) an expected-value heuristic that compares the number and kind of values found in a document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machine-learned rules over these heuristic measurements, we determine whether a Web document is applicable for a given ontology. Our experimental results show that we have been able to achieve over 90% for both recall and precision, with an F-measure of about 95%.

...read moreread less

Proceedings Article•10.1109/ICDAR.2001.953858•

Experimental evaluation of passage-based document retrieval

[...]

Koichi Kise¹, Markus Junker¹, Andreas Dengel¹, Keinosuke Matsumoto²•Institutions (2)

German Research Centre for Artificial Intelligence¹, Osaka Prefecture University²

1 Sep 2001

TL;DR: This paper employs a passage-based method based on density distributions of query terms in documents, and compares it with three conventional methods: the vector space model, pseudo-feedback and latent semantic indexing.

...read moreread less

Abstract: Retrieval of electronic documents is a fundamental component for intelligent access to the contents of documents. For the retrieval of long documents, a method called passage-based document retrieval has proven to be effective. In this paper we experimentally show that the passage-based retrieval is also advantageous for dealing with short queries on condition that documents are long. We employ a passage-based method based on density distributions of query terms in documents, and compare it with three conventional methods: the vector space model, pseudo-feedback and latent semantic indexing.

...read moreread less

Proceedings Article•10.1145/383952.384034•

Analyses of multiple-evidence combinations for retrieval strategies

[...]

Abdur Chowdhury¹, Ophir Frieder¹, David A. Grossman¹, Catherine McCabe¹•Institutions (1)

Illinois Institute of Technology¹

1 Sep 2001

TL;DR: While Lee's experiments focused on different system result sets, this work shows that the improvements shown by Lee were likely produced by fusing ranking strategies less tuned than today’s measures, and current improvements are likely to be produced by systemic differences rather than ranking strategies.

...read moreread less

Abstract: Multiple-evidence techniques are touted as means to improve the effectiveness of systems. Belkin, et al. [1] examined the effects of various query representations. Fox, et al. [2] proposed several combination algorithms and found that combinations of the same types of runs (long and short queries within the vector space model) did not yield improvement and sometimes even degraded performance. He did achieve improvement over individual runs when merging different retrieval strategies (e.g., vector space and pnorm Boolean). Lee [3] further examined various combination algorithms for fusing result sets to improve effectiveness. He identified that, for multiple-evidence to improve system effectiveness, the retrieved sets must have higher relevance overlap than non-relevance overlap. Lee did not identify the exact difference needed to improve effectiveness. His results had a 125% difference in relevant and non-relevant overlap. While Lee's experiments focused on different system result sets, we focus on effective ranking strategies removing systemic differences of parsers, stemmers, phrase processing and weighting factors. We show that the improvements shown by Lee were likely produced by fusing ranking strategies less tuned than today’s measures, and current improvements are likely to be produced by systemic differences rather than ranking strategies.

...read moreread less

...

Expand