Top 103 Information Processing and Management papers published in 2005

National Chiao Tung University¹

TL;DR: In this paper, the authors examined the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences.

...read moreread less

Abstract: The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded $24 Million (US) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define AuthorRank as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).

...read moreread less

953 citations

Journal Article•10.1016/J.IPM.2004.04.003•

Text summarization using a trainable summarizer and latent semantic analysis

[...]

Jen-Yuan Yeh¹, Hao-Ren Ke¹, Wei-Pang Yang¹, I-Heng Meng¹•Institutions (1)

National Yunlin University of Science and Technology¹

TL;DR: One approach is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries, while the other uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map.

...read moreread less

Abstract: This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.

...read moreread less

296 citations

Journal Article•10.1016/J.IPM.2003.11.004•

Using the patent co-citation approach to establish a new patent classification system

[...]

Kuei-Kuei Lai, Shiao-Jun Wu¹•Institutions (1)

Katholieke Universiteit Leuven¹, Hungarian Academy of Sciences²

TL;DR: A new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management is proposed, based on co-citation analysis of bibliometrics to assist patent manager in understanding the basic patents for a specific industry.

...read moreread less

Abstract: The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.

...read moreread less

218 citations

Journal Article•10.1016/J.IPM.2005.03.021•

Combining full text and bibliometric information in mapping scientific disciplines

[...]

Patrick Glenisson¹, Wolfgang Glänzel², Frizo Janssens¹, Bart De Moor¹•Institutions (2)

University of Wisconsin–Milwaukee¹

TL;DR: Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods and confirm the main results of the pilot study that such hybrid methodology can be applied to both research evaluation and information retrieval.

...read moreread less

Abstract: In the present study results of an earlier pilot study by Glenisson, Glanzel and Persson are extended on the basis of larger sets of papers. Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods. The text mining methodology already introduced in the pilot study is applied to the complete publication year 2003 of the journal Scientometrics. Altogether 85 documents that can be considered research articles or notes have been selected for this exercise. The outcomes confirm the main results of the pilot study, namely, that such hybrid methodology can be applied to both research evaluation and information retrieval. Nevertheless, Scientometrics documents published in 2003 cover a much broader and more heterogeneous spectrum of bibliometrics and related research than those analysed in the pilot study. A modified subject classification based on the scheme used in an earlier study by Schoepflin and Glanzel has been applied for validation purposes.

...read moreread less

183 citations

Journal Article•10.1016/J.IPM.2003.12.001•

The impact of webpage content characteristics on webpage visibility in search engine results (part I)

[...]

Jin Zhang¹, Alexandra Dimitroff¹•Institutions (1)

TL;DR: Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full- Text in search engine results lists.

...read moreread less

Abstract: Content characteristics of a webpage include factors such as keyword position in a webpage, keyword duplication, layout, and their combination. These factors may impact webpage visibility in a search engine. Four hypotheses are presented relating to the impact of selected content characteristics on webpage visibility in search engine results lists. Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full-text.

...read moreread less

115 citations

Journal Article•10.1016/J.IPM.2005.01.006•

Review of Discovering knowledge in data, an introduction to data mining by Daniel T. Larose, Ph.D., John Wiley & Sons Inc., December 2004

[...]

Donald K. Wedding

University of Wisconsin–Milwaukee¹

103 citations

Journal Article•10.1016/J.IPM.2003.12.002•

The impact of metadata implementation on webpage visibility in search engine results (part II)

[...]

Jin Zhang¹, Alexandra Dimitroff¹•Institutions (1)

University of Western Ontario¹, Information Technology University²

TL;DR: Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective.

...read moreread less

Abstract: This paper discusses the impact of metadata implementation in a webpage on its visibility performance in a search engine results list. Influential internal and external factors of metadata implementation were identified. How these factors affect webpage visibility in a search engine results list was examined in an experimental study. Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective. To maximize the effects, these keywords should come from both title and full-text.

...read moreread less

94 citations

Journal Article•10.1016/J.IPM.2003.10.001•

A modeling approach to uncover hyperlink patterns: the case of Canadian universities

[...]

Liwen Vaughan¹, Mike Thelwall²•Institutions (2)

TL;DR: A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site, and showed that English universities are advantaged.

...read moreread less

Abstract: Hyperlink patterns between Canadian university Web sites were analyzed by a mathematical modeling approach. A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site. Higher faculty quality means more links. French universities received lower numbers of links to their Web sites than comparable English universities. Analysis of interlinking between pairs of universities also showed that English universities are advantaged. Universities are more likely to link to each other when the geographical distance between them is less than 3000 km, possibly reflecting the east vs. west divide that exists in Canadian society.

...read moreread less

88 citations

Journal Article•10.1016/J.IPM.2004.10.005•

Clustered SVD strategies in latent semantic indexing

[...]

Jing Gao¹, Jun Zhang¹•Institutions (1)

University of Kentucky¹

TL;DR: This work proposes to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which the truncated SVD is applied, and shows that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

...read moreread less

Abstract: The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

...read moreread less

75 citations

Journal Article•10.1016/J.IPM.2004.04.006•

A probabilistic model for stemmer generation

[...]

Michela Bacchin¹, Nicola Ferro¹, Massimo Melucci¹•Institutions (1)

University of Padua¹

TL;DR: A language-independent probabilistic model which can automatically generate stemmers is presented and a series of experiments shows that the stemmers generated by the probabilists are as effective as the ones based on linguistic knowledge.

...read moreread less

Abstract: In this paper we will present a language-independent probabilistic model which can automatically generate stemmers. Stemmers can improve the retrieval effectiveness of information retrieval systems, however the designing and the implementation of stemmers requires a laborious amount of effort due to the fact that documents and queries are often written or spoken in several different languages. The probabilistic model proposed in this paper aims at the development of stemmers used for several languages. The proposed model describes the mutual reinforcement relationship between stems and derivations and then provides a probabilistic interpretation. A series of experiments shows that the stemmers generated by the probabilistic model are as effective as the ones based on linguistic knowledge.

...read moreread less

65 citations

Journal Article•10.1016/J.IPM.2003.10.003•

Choosing document structure weights

[...]

Andrew Trotman¹•Institutions (1)

University of Otago¹

College of Business Administration¹

TL;DR: Analysis suggests BM25 cannot be improved using structure weighting, and vector space, probability, and Okapi BM25 ranking are extended to include structure Weighting.

...read moreread less

Abstract: Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure.Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.

...read moreread less

Journal Article•10.1016/J.IPM.2004.01.002•

Temporal document retrieval model for business news archives

[...]

Pawel Jan Kalczynski¹, Amy Y. Chou¹•Institutions (1)

TL;DR: The classical Vector Space Model is extended to the Temporal Document Retrieval Model (TDRM) that incorporates the proposed fuzzy representations of temporal expressions and a method for representing uncertainty based on fuzzy numbers is proposed.

...read moreread less

Abstract: Temporal expressions occurring in business news, such as "last week" or "at the end of this month," carry important information about the time context of the news document and were proved to be useful for document retrieval. We found that about 10% of these expressions are difficult to project onto the calendar due to the uncertainty about their bounds. This paper introduces a novel approach to representing temporal expressions. A user study is conducted to measure the degree of uncertainty for selected temporal expressions and a method for representing uncertainty based on fuzzy numbers is proposed. The classical Vector Space Model is extended to the Temporal Document Retrieval Model (TDRM) that incorporates the proposed fuzzy representations of temporal expressions.

...read moreread less

Journal Article•10.1016/J.IPM.2003.08.002•

Re-ranking algorithm using post-retrieval clustering for content-based image retrieval

[...]

Gunhan Park¹, Yunju Baek¹, Heung-Kyu Lee²•Institutions (2)

KAIST¹, Pusan National University²

TL;DR: A re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR) that achieves an improvement of retrieval effectiveness of over 10% on average in the average normalized modified retrieval rank (ANMRR) measure.

...read moreread less

Abstract: In this paper, we propose a re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR). In conventional CBIR systems, it is often observed that images visually dissimilar to a query image are ranked high in retrieval results. To remedy this problem, we utilize the similarity relationship of the retrieved results via post-retrieval clustering. In the first step of our method, images are retrieved using visual features such as color histogram. Next, the retrieved images are analyzed using hierarchical agglomerative clustering methods (HACM) and the rank of the results is adjusted according to the distance of a cluster from a query. In addition, we analyze the effects of clustering methods, querycluster similarity functions, and weighting factors in the proposed method. We conducted a number of experiments using several clustering methods and cluster parameters. Experimental results show that the proposed method achieves an improvement of retrieval effectiveness of over 10% on average in the average normalized modified retrieval rank (ANMRR) measure.

...read moreread less

Journal Article•10.1016/J.IPM.2005.03.017•

Domain analysis and information retrieval through the construction of heliocentric maps based on ISI-JCR category cocitation

[...]

Félix de Moya-Anegón¹, Benjamín Vargas-Quesada¹, Zaida Chinchilla-Rodríguez¹, Elena Corera-Álvarez¹, Victor Herrero-Solana¹, Francisco José Muñoz-Fernández¹ - Show less +2 more•Institutions (1)

University of Granada¹

TL;DR: The use of ISI-JCR categories as units of cocitation and measurement for the construction of heliocentric maps is proposed and used to illustrate, analyze and compare domains in terms of the categories and their interconnections or links.

...read moreread less

Abstract: We propose the use of ISI-JCR categories as units of cocitation and measurement for the construction of heliocentric maps. The use of a spatial metaphor allows us to illustrate, analyze and compare domains in terms of the categories and their interconnections or links. We can also move around within the structure of these domains for further analysis, and access the documents associated to the categories and to the links that cocite or relate them.

...read moreread less

Journal Article•10.1016/J.IPM.2004.06.009•

Empirical studies on the impact of lexical resources on CLIR performance

[...]

Jinxi Xu¹, Ralph Weischedel¹•Institutions (1)

BBN Technologies¹

Kyungpook National University¹

TL;DR: These experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers find that one can achieve an acceptable CLIR performance using only a bilingual term list.

...read moreread less

Abstract: In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: • One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). • However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. • If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. • While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.

...read moreread less

Journal Article•10.1016/J.IPM.2004.08.005•

Document indexing: a concept-based approach to term weight estimation

[...]

Bo-Yeong Kang¹, Sang-Jo Lee¹•Institutions (1)

TL;DR: A new indexing formalism is developed that considers not only the terms in a document, but also the concepts, and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document.

...read moreread less

Abstract: Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.

...read moreread less

Journal Article•10.1016/J.IPM.2005.03.013•

Challenges of scholarly publications on the web to the evaluation of science: a comparison of author visibility on the web and in print journals

[...]

Dangzhi Zhao¹•Institutions (1)

University of Alberta¹

TL;DR: Results from this study demonstrate both the importance and the feasibility of the use of multiple citation data sources in citation analysis studies of scholarly communication, and provide evidence for a developing "two tier" scholarly communication system.

...read moreread less

Abstract: This article reveals different patterns of scholarly communication in the XML research field on the web and in print journals in terms of author visibility, and challenges the common practice of exclusively using the ISI's databases to obtain citation counts as scientific performance indicators. Results from this study demonstrate both the importance and the feasibility of the use of multiple citation data sources in citation analysis studies of scholarly communication, and provide evidence for a developing "two tier" scholarly communication system.

...read moreread less

Journal Article•10.1016/J.IPM.2004.02.003•

Character contiguity in N -gram-based word matching: the case for Arabic text searching

[...]

Suleiman H. Mustafa¹•Institutions (1)

Yarmouk University¹

01 Jul 2005-Information Processing and Management

TL;DR: The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used.

...read moreread less

Abstract: This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments involving different levels of textual word stemming, a textual corpus containing about 25 thousand words (with a total size of about 160KB), and a set of 100 query textual words. The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used. The present results and the inconsistent findings of previous studies raise some questions regarding the efficiency of pure conventional N-gram matching and the ways in which it should be used in languages other than English.

...read moreread less

Journal Article•10.1016/J.IPM.2004.08.002•

Techniques for improving web retrieval effectiveness

[...]

Eui-Kyu Park¹, Dong-Yul Ra¹, Myung-Gil Jang•Institutions (1)

Yonsei University¹

Irina Marshakova-Shaikevich

TL;DR: Stratifying and re-ranking the retrieval list based on the maximum count of index terms in common between a sentence and a query resulted in significant improvement of performance.

...read moreread less

Abstract: This paper talks about several schemes for improving retrieval effectiveness that can be used in the named page finding tasks of web information retrieval (Overview of the TREC-2002 web track. In: Proceedings of the Eleventh Text Retrieval Conference TREC-2002, NIST Special Publication #500-251, 2003). These methods were applied on top of the basic information retrieval model as additional mechanisms to upgrade the system. Use of the title of web pages was found to be effective. It was confirmed that anchor texts of incoming links was beneficial as suggested in other works. Sentence-query similarity is a new type of information proposed by us and was identified to be the best information to take advantage of. Stratifying and re-ranking the retrieval list based on the maximum count of index terms in common between a sentence and a query resulted in significant improvement of performance. To demonstrate these facts a large-scale web information retrieval system was developed and used for experimentation.

...read moreread less

Journal Article•10.1016/J.IPM.2005.03.027•

Bibliometric maps of field of science

[...]

TL;DR: The present paper is devoted to two directions in algorithmic classificatory procedures: the journal co-citation analysis as an example of citation networks and lexical analysis of keywords in the titles and texts.

...read moreread less

Abstract: The present paper is devoted to two directions in algorithmic classificatory procedures: the journal co-citation analysis as an example of citation networks and lexical analysis of keywords in the titles and texts. What is common to those approaches is the general idea of normalization of deviations of the observed data from the mathematical expectation. The application of the same formula leads to discovery of statistically significant links between objects (journals in one case, keywords--in the other). The results of the journal co-citation analysis are reflected in tables and map for field "Women's Studies" and for field "Information Science and Library Science". An experimental attempt at establishing textual links between words was carried out on two samples from SSCI Data base: (1) EDUCATION and (2) ETHICS. The EDUCATION file included 2180 documents (of which 751 had abstracts); the ETHICS file included 807 documents (289 abstracts). Some examples of the results of this pilot study are given in tabular form. The binary links between words discovered in this way may form triplets or other groups with more than two member words.

...read moreread less

Journal Article•10.1016/J.IPM.2004.04.005•

Beyond information searching and browsing: acquiring knowledge from digital libraries

[...]

Ling Feng¹, Manfred A. Jeusfeld², J.J.A.C. Hoppenbrouwers²•Institutions (2)

University of Twente¹, Tilburg University²

Royal Netherlands Academy of Arts and Sciences¹

TL;DR: In this article, a two-layered digital library model is proposed to support different levels of human cognitive acts, thus enabling new kinds of knowledge exchange among library users, where the low layer of the model, namely, tactical cognition support layer, provides users with requested relevant documents, as searching and browsing do.

...read moreread less

Abstract: Digital libraries (DLs) are a resource for answering complex questions. Up to now, such systems mainly support keyword-based searching and browsing. The mapping from a research question to keywords and the assessment whether an article is relevant for a research question is completely with the user. In this paper, we present a two-layered digital library model. The aim is to enhance current DLs to support different levels of human cognitive acts, thus enabling new kinds of knowledge exchange among library users. The low layer of the model, namely, the tactical cognition support layer, provides users with requested relevant documents, as searching and browsing do. The upper layer of the model, namely, the strategic cognition support layer, not only provides users with relevant documents but also directly and intelligently answers users' cognitive questions. On the basis of the proposed model, we divide the DL information space into two subspaces, i.e., a knowledge subspace and a document subspace, where documents in the document subspace serves as the justification for the corresponding knowledge in the knowledge subspace. Detailed description of the knowledge subspace and its construction, as well as query facilities against the enhanced DLs for users' knowledge sharing and exchange, are particularly discussed.

...read moreread less

Journal Article•10.1016/J.IPM.2005.03.009•

New indicators for gender studies in Web networks

[...]

Hildrun Kretschmer¹, Isidro F. Aguillo•Institutions (1)

Katholieke Universiteit Leuven¹

TL;DR: New indicators for gender studies are introduced using the multi-authored publications of a set of 64 members of the COLLNET network and the results show strong gender differences in favour of publications with only male co-authors.

...read moreread less

Abstract: New indicators for gender studies are introduced using the multi-authored publications of a set of 64 members of the COLLNET network. Considering the sample of 223 papers and their visibility on the Web, analyses were performed on the sets comprised of publications on the one hand and the authors (social network analysis) on the other.The indicators developed include: Gender co-operation; Web visibility rates and Gender centrality in networks.The co-operation among COLLNET members results in a publication set with a different pattern to other gender studies in science, which show low contribution rates of women in collaboration activities. In the sample the collaboration patterns of female COLLNET members are almost the same as their male counterparts. However, when considering only the arrangement of gender in the by-line of publications, this is in favour of men.Regarding average Web visibility rates, there are no differences dependent upon the arrangement of gender in publications with both male and female co-authors. However, the results also show strong gender differences in favour of publications with only male co-authors.The centrality of female actors in the networks is correlated with the central positions of women in the International Society for Scientometrics and Informetrics (ISSI), where they are leading either as presidents or organisers of its international conferences.It may well be that, with webometric methods and new indicators for gender studies, there is a real continuation of gender equality in COLLNET over time.

...read moreread less

Journal Article•10.1016/J.IPM.2003.12.006•

Generic technologies for single- and multi-document summarization

[...]

Marie-Francine Moens¹, Roxana Angheluta¹, Jos Dumortier¹•Institutions (1)

National Chiao Tung University¹

TL;DR: The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.

...read moreread less

Abstract: The technologies for single- and multi-document summarization that are described and evaluated in this article can be used on heterogeneous texts for different summarization tasks. They refer to the extraction of important sentences from the documents, compressing the sentences to their essential or relevant content, and detecting redundant content across sentences. The technologies are tested at the Document Understanding Conference, organized by the National Institute of Standards and Technology, USA in 2002 and 2003. The system obtained good to very good results in this competition. We tested our summarization system also on a variety of English Encyclopedia texts and on Dutch magazine articles. The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.

...read moreread less

Journal Article•10.1016/J.IPM.2004.02.004•

Assessing learning outcomes in two information retrieval learning environments

[...]

Kai Halttunen¹, Kalervo Järvelin¹•Institutions (1)

University of Tampere¹

01 Jul 2005-Information Processing and Management

TL;DR: Assessment of learning outcomes in an experimental, but naturalistic, learning environment compared to more traditional instruction indicates that anchored instruction and scaffolding with an instructional tool, the IR Game, with performance feedback enables students to construct queries with fewer semantic knowledge errors in operational IR systems.

...read moreread less

Abstract: In order to design information retrieval (IR) learning environments and instruction, it is important to explore learning outcomes of different pedagogical solutions. Learning outcomes have seldom been evaluated in IR instruction. The particular focus of this study is the assessment of learning outcomes in an experimental, but naturalistic, learning environment compared to more traditional instruction. The 57 participants of an introductory course on IR were selected for this study, and the analysis illustrates their learning outcomes regarding both conceptual change and development of IR skill. Concept mapping of student essays was used to analyze conceptual change and log-files of search exercises provided data for performance assessment. Students in the experimental learning environment changed their conceptions more regarding linguistic aspects of IR and paid more emphasis on planning and management of search process. Performance assessment indicates that anchored instruction and scaffolding with an instructional tool, the IR Game, with performance feedback enables students to construct queries with fewer semantic knowledge errors also in operational IR systems.

...read moreread less

Journal Article•10.1016/J.IPM.2003.10.004•

A statistics-based approach to incrementally update inverted files

[...]

Wann-Yun Shieh¹, Chung-Ping Chung¹•Institutions (1)

University of Massachusetts Amherst¹

TL;DR: In this paper, a run-time statistics-based approach is proposed to allocate the spare space in an inverted file to avoid reorganization in the inverted file, and unused free space can be well controlled such that file access speed is not affected.

...read moreread less

Abstract: Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected.

...read moreread less

Journal Article•10.1016/J.IPM.2003.08.008•

Pattern matching in Huffman encoded texts

[...]

Shmuel T. Klein¹, Dana Shapira²•Institutions (2)

Bar-Ilan University¹, Brandeis University²

01 Jul 2005-Information Processing and Management

TL;DR: For a given text which has been encoded by a static Huffman code, the possibility of locating a given pattern directly in the compressed text is investigated and an algorithm is suggested which reduces the number of erroneously declared matches.

...read moreread less

Abstract: For a given text which has been encoded by a static Huffman code, the possibility of locating a given pattern directly in the compressed text is investigated. The main problem is one of synchronization, as an occurrence of the encoded pattern in the encoded text does not necessarily correspond to an occurrence of the pattern in the text. A simple algorithm is suggested which reduces the number of erroneously declared matches. The probability of such false matches is analyzed and empirically tested.

...read moreread less

Journal Article•10.1016/J.IPM.2004.06.008•

Structured queries, language modeling, and relevance modeling in cross-language information retrieval

[...]

Leah S. Larkey¹, Margaret E. Connell¹•Institutions (1)

TL;DR: It is found that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translalion probabilities confer a small but significant advantage.

...read moreread less

Abstract: Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries-one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translalion probabilities confer a small but significant advantage.

...read moreread less

Journal Article•10.1016/J.IPM.2004.04.001•

Searching digital music libraries

[...]

David Bainbridge¹, Michael Dewsnip¹, Ian H. Witten¹•Institutions (1)

University of Waikato¹

TL;DR: This paper focuses on methods for evaluating different symbolic music matching strategies, and describes a series of experiments that compare and contrast results obtained using three dominant paradigms.

...read moreread less

Abstract: There has been a recent explosion of interest in digital music libraries. In particular, interactive melody retrieval is a striking example of a search paradigm that differs radically from the standard full-text search. Many different techniques have been proposed for melody matching, but the area lacks standard databases that allow them to be compared on common grounds--and copyright issues have stymied attempts to develop such a corpus. This paper focuses on methods for evaluating different symbolic music matching strategies, and describes a series of experiments that compare and contrast results obtained using three dominant paradigms. Combining two of these paradigms yields a hybrid approach which is shown to have the best overall combination of efficiency and effectiveness.

...read moreread less

Journal Article•10.1016/J.IPM.2005.03.016•

Symmetry and other transformation features of Lorenz/Leimkuhler representations of informetric data

[...]

Quentin L. Burrell

Pohang University of Science and Technology¹

TL;DR: The use of Lorenz/Leimkuhler concentration curves in an informetric context is developed to provide additional and more incisive analytic tools for the practising informetrician.

...read moreread less

Abstract: In this paper we develop in particular the use of Lorenz/Leimkuhler concentration curves in an informetric context. Many of the features to be presented are akin to, or are adaptations of, ones that have featured in the econometric literature but not in informetrics. We acknowledge in particular our debt to Lambert [Lambert, P. J. (2001). The distribution and redistribution of income. Manchester: Manchester University Press] and Kleiber and Kotz [Kleiber, C., & Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. New Jersey: Wiley] for source material in the econometrics literature. Although the development is purely theoretical, the aim is to provide additional and more incisive analytic tools for the practising informetrician.

...read moreread less

Journal Article•10.1016/J.IPM.2003.11.001•

Probabilistic information retrieval model for a dependency structured indexing system

[...]

Changki Lee¹, Gary Geunbae Lee¹•Institutions (1)