Scispace (Formerly Typeset)
  1. Home
  2. Journals
  3. Information Processing and Management
  4. 2005
  1. Home
  2. Journals
  3. Information Processing and Management
  4. 2005
Showing papers in "Information Processing and Management in 2005"
Journal Article•10.1016/J.IPM.2005.03.012•
Co-authorship networks in the digital library research community

[...]

Xiaoming Liu1, Johan Bollen2, Michael L. Nelson2, Herbert Van de Sompel1•
Los Alamos National Laboratory1, Old Dominion University2
01 Dec 2005-Information Processing and Management
TL;DR: In this paper, the authors examined the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences.
Abstract: The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded $24 Million (US) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define AuthorRank as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).

953 citations

Journal Article•10.1016/J.IPM.2004.04.003•
Text summarization using a trainable summarizer and latent semantic analysis

[...]

Jen-Yuan Yeh1, Hao-Ren Ke1, Wei-Pang Yang1, I-Heng Meng1•
National Chiao Tung University1
01 Jan 2005-Information Processing and Management
TL;DR: One approach is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries, while the other uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map.
Abstract: This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.

296 citations

Journal Article•10.1016/J.IPM.2003.11.004•
Using the patent co-citation approach to establish a new patent classification system

[...]

Kuei-Kuei Lai, Shiao-Jun Wu1•
National Yunlin University of Science and Technology1
01 Mar 2005-Information Processing and Management
TL;DR: A new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management is proposed, based on co-citation analysis of bibliometrics to assist patent manager in understanding the basic patents for a specific industry.
Abstract: The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.

218 citations

Journal Article•10.1016/J.IPM.2005.03.021•
Combining full text and bibliometric information in mapping scientific disciplines

[...]

Patrick Glenisson1, Wolfgang Glänzel2, Frizo Janssens1, Bart De Moor1•
Katholieke Universiteit Leuven1, Hungarian Academy of Sciences2
01 Dec 2005-Information Processing and Management
TL;DR: Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods and confirm the main results of the pilot study that such hybrid methodology can be applied to both research evaluation and information retrieval.
Abstract: In the present study results of an earlier pilot study by Glenisson, Glanzel and Persson are extended on the basis of larger sets of papers. Full text analysis and traditional bibliometric methods are serially combined to improve the efficiency of the two individual methods. The text mining methodology already introduced in the pilot study is applied to the complete publication year 2003 of the journal Scientometrics. Altogether 85 documents that can be considered research articles or notes have been selected for this exercise. The outcomes confirm the main results of the pilot study, namely, that such hybrid methodology can be applied to both research evaluation and information retrieval. Nevertheless, Scientometrics documents published in 2003 cover a much broader and more heterogeneous spectrum of bibliometrics and related research than those analysed in the pilot study. A modified subject classification based on the scheme used in an earlier study by Schoepflin and Glanzel has been applied for validation purposes.

183 citations

Journal Article•10.1016/J.IPM.2003.12.001•
The impact of webpage content characteristics on webpage visibility in search engine results (part I)

[...]

Jin Zhang1, Alexandra Dimitroff1•
University of Wisconsin–Milwaukee1
01 May 2005-Information Processing and Management
TL;DR: Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full- Text in search engine results lists.
Abstract: Content characteristics of a webpage include factors such as keyword position in a webpage, keyword duplication, layout, and their combination. These factors may impact webpage visibility in a search engine. Four hypotheses are presented relating to the impact of selected content characteristics on webpage visibility in search engine results lists. Webpage visibility can be improved by increasing the frequency of keywords in the title, in the full-text and in both the title and full-text.

115 citations

Journal Article•10.1016/J.IPM.2005.01.006•
Review of Discovering knowledge in data, an introduction to data mining by Daniel T. Larose, Ph.D., John Wiley & Sons Inc., December 2004

[...]

Donald K. Wedding
01 Sep 2005-Information Processing and Management

103 citations

Journal Article•10.1016/J.IPM.2003.12.002•
The impact of metadata implementation on webpage visibility in search engine results (part II)

[...]

Jin Zhang1, Alexandra Dimitroff1•
University of Wisconsin–Milwaukee1
01 May 2005-Information Processing and Management
TL;DR: Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective.
Abstract: This paper discusses the impact of metadata implementation in a webpage on its visibility performance in a search engine results list. Influential internal and external factors of metadata implementation were identified. How these factors affect webpage visibility in a search engine results list was examined in an experimental study. Findings suggest that metadata is a good mechanism to improve webpage visibility, the metadata subject field plays a more important role than any other metadata field and keywords extracted from the webpage itself, particularly title or full-text, are most effective. To maximize the effects, these keywords should come from both title and full-text.

94 citations

Journal Article•10.1016/J.IPM.2003.10.001•
A modeling approach to uncover hyperlink patterns: the case of Canadian universities

[...]

Liwen Vaughan1, Mike Thelwall2•
University of Western Ontario1, Information Technology University2
01 Mar 2005-Information Processing and Management
TL;DR: A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site, and showed that English universities are advantaged.
Abstract: Hyperlink patterns between Canadian university Web sites were analyzed by a mathematical modeling approach. A multiple regression model was developed which shows that faculty quality and the language of the university are important predictors for links to a university Web site. Higher faculty quality means more links. French universities received lower numbers of links to their Web sites than comparable English universities. Analysis of interlinking between pairs of universities also showed that English universities are advantaged. Universities are more likely to link to each other when the geographical distance between them is less than 3000 km, possibly reflecting the east vs. west divide that exists in Canadian society.

88 citations

Journal Article•10.1016/J.IPM.2004.10.005•
Clustered SVD strategies in latent semantic indexing

[...]

Jing Gao1, Jun Zhang1•
University of Kentucky1
01 Sep 2005-Information Processing and Management
TL;DR: This work proposes to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which the truncated SVD is applied, and shows that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.
Abstract: The text retrieval method using latent semantic indexing (LSI) technique with truncated singular value decomposition (SVD) has been intensively studied in recent years. The SVD reduces the noise contained in the original representation of the term-document matrix and improves the information retrieval accuracy. Recent studies indicate that SVD is mostly useful for small homogeneous data collections. For large inhomogeneous datasets, the performance of the SVD based text retrieval technique may deteriorate. We propose to partition a large inhomogeneous dataset into several smaller ones with clustered structure, on which we apply the truncated SVD. Our experimental results show that the clustered SVD strategies may enhance the retrieval accuracy and reduce the computing and storage costs.

75 citations

Journal Article•10.1016/J.IPM.2004.04.006•
A probabilistic model for stemmer generation

[...]

Michela Bacchin1, Nicola Ferro1, Massimo Melucci1•
University of Padua1
01 Jan 2005-Information Processing and Management
TL;DR: A language-independent probabilistic model which can automatically generate stemmers is presented and a series of experiments shows that the stemmers generated by the probabilists are as effective as the ones based on linguistic knowledge.
Abstract: In this paper we will present a language-independent probabilistic model which can automatically generate stemmers. Stemmers can improve the retrieval effectiveness of information retrieval systems, however the designing and the implementation of stemmers requires a laborious amount of effort due to the fact that documents and queries are often written or spoken in several different languages. The probabilistic model proposed in this paper aims at the development of stemmers used for several languages. The proposed model describes the mutual reinforcement relationship between stems and derivations and then provides a probabilistic interpretation. A series of experiments shows that the stemmers generated by the probabilistic model are as effective as the ones based on linguistic knowledge.

65 citations

Journal Article•10.1016/J.IPM.2003.10.003•
Choosing document structure weights

[...]

Andrew Trotman1•
University of Otago1
01 Mar 2005-Information Processing and Management
TL;DR: Analysis suggests BM25 cannot be improved using structure weighting, and vector space, probability, and Okapi BM25 ranking are extended to include structure Weighting.
Abstract: Existing ranking schemes assume all term occurrences in a given document are of equal influence. Intuitively, terms occurring in some places should have a greater influence than those elsewhere. An occurrence in an abstract may be more important than an occurrence in the body text. Although this observation is not new, there remains the issue of finding good weights for each structure.Vector space, probability, and Okapi BM25 ranking are extended to include structure weighting. Weights are then selected for the TREC WSJ collection using a genetic algorithm. The learned weights are then tested on an evaluation set of queries. Structure weighted vector space inner product and structure weighted probabilistic retrieval show an about 5% improvement in mean average precision over their unstructured counterparts. Structure weighted BM25 shows nearly no improvement. Analysis suggests BM25 cannot be improved using structure weighting.
Journal Article•10.1016/J.IPM.2004.01.002•
Temporal document retrieval model for business news archives

[...]

Pawel Jan Kalczynski1, Amy Y. Chou1•
College of Business Administration1
01 May 2005-Information Processing and Management
TL;DR: The classical Vector Space Model is extended to the Temporal Document Retrieval Model (TDRM) that incorporates the proposed fuzzy representations of temporal expressions and a method for representing uncertainty based on fuzzy numbers is proposed.
Abstract: Temporal expressions occurring in business news, such as "last week" or "at the end of this month," carry important information about the time context of the news document and were proved to be useful for document retrieval. We found that about 10% of these expressions are difficult to project onto the calendar due to the uncertainty about their bounds. This paper introduces a novel approach to representing temporal expressions. A user study is conducted to measure the degree of uncertainty for selected temporal expressions and a method for representing uncertainty based on fuzzy numbers is proposed. The classical Vector Space Model is extended to the Temporal Document Retrieval Model (TDRM) that incorporates the proposed fuzzy representations of temporal expressions.
Journal Article•10.1016/J.IPM.2003.08.002•
Re-ranking algorithm using post-retrieval clustering for content-based image retrieval

[...]

Gunhan Park1, Yunju Baek1, Heung-Kyu Lee2•
KAIST1, Pusan National University2
01 Mar 2005-Information Processing and Management
TL;DR: A re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR) that achieves an improvement of retrieval effectiveness of over 10% on average in the average normalized modified retrieval rank (ANMRR) measure.
Abstract: In this paper, we propose a re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR). In conventional CBIR systems, it is often observed that images visually dissimilar to a query image are ranked high in retrieval results. To remedy this problem, we utilize the similarity relationship of the retrieved results via post-retrieval clustering. In the first step of our method, images are retrieved using visual features such as color histogram. Next, the retrieved images are analyzed using hierarchical agglomerative clustering methods (HACM) and the rank of the results is adjusted according to the distance of a cluster from a query. In addition, we analyze the effects of clustering methods, querycluster similarity functions, and weighting factors in the proposed method. We conducted a number of experiments using several clustering methods and cluster parameters. Experimental results show that the proposed method achieves an improvement of retrieval effectiveness of over 10% on average in the average normalized modified retrieval rank (ANMRR) measure.
Journal Article•10.1016/J.IPM.2005.03.017•
Domain analysis and information retrieval through the construction of heliocentric maps based on ISI-JCR category cocitation

[...]

Félix de Moya-Anegón1, Benjamín Vargas-Quesada1, Zaida Chinchilla-Rodríguez1, Elena Corera-Álvarez1, Victor Herrero-Solana1, Francisco José Muñoz-Fernández1 •
University of Granada1
01 Dec 2005-Information Processing and Management
TL;DR: The use of ISI-JCR categories as units of cocitation and measurement for the construction of heliocentric maps is proposed and used to illustrate, analyze and compare domains in terms of the categories and their interconnections or links.
Abstract: We propose the use of ISI-JCR categories as units of cocitation and measurement for the construction of heliocentric maps. The use of a spatial metaphor allows us to illustrate, analyze and compare domains in terms of the categories and their interconnections or links. We can also move around within the structure of these domains for further analysis, and access the documents associated to the categories and to the links that cocite or relate them.
Journal Article•10.1016/J.IPM.2004.06.009•
Empirical studies on the impact of lexical resources on CLIR performance

[...]

Jinxi Xu1, Ralph Weischedel1•
BBN Technologies1
01 May 2005-Information Processing and Management
TL;DR: These experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers find that one can achieve an acceptable CLIR performance using only a bilingual term list.
Abstract: In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: • One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). • However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. • If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. • While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
Journal Article•10.1016/J.IPM.2004.08.005•
Document indexing: a concept-based approach to term weight estimation

[...]

Bo-Yeong Kang1, Sang-Jo Lee1•
Kyungpook National University1
01 Sep 2005-Information Processing and Management
TL;DR: A new indexing formalism is developed that considers not only the terms in a document, but also the concepts, and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document.
Abstract: Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.
Journal Article•10.1016/J.IPM.2005.03.013•
Challenges of scholarly publications on the web to the evaluation of science: a comparison of author visibility on the web and in print journals

[...]

Dangzhi Zhao1•
University of Alberta1
01 Dec 2005-Information Processing and Management
TL;DR: Results from this study demonstrate both the importance and the feasibility of the use of multiple citation data sources in citation analysis studies of scholarly communication, and provide evidence for a developing "two tier" scholarly communication system.
Abstract: This article reveals different patterns of scholarly communication in the XML research field on the web and in print journals in terms of author visibility, and challenges the common practice of exclusively using the ISI's databases to obtain citation counts as scientific performance indicators. Results from this study demonstrate both the importance and the feasibility of the use of multiple citation data sources in citation analysis studies of scholarly communication, and provide evidence for a developing "two tier" scholarly communication system.
Journal Article•10.1016/J.IPM.2004.02.003•
Character contiguity in N -gram-based word matching: the case for Arabic text searching

[...]

Suleiman H. Mustafa1•
Yarmouk University1
01 Jul 2005-Information Processing and Management
TL;DR: The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used.
Abstract: This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments involving different levels of textual word stemming, a textual corpus containing about 25 thousand words (with a total size of about 160KB), and a set of 100 query textual words. The results of the hybrid approach showed significant performance improvement over the conventional contiguous approach, especially in the cases where stemming was used. The present results and the inconsistent findings of previous studies raise some questions regarding the efficiency of pure conventional N-gram matching and the ways in which it should be used in languages other than English.
Journal Article•10.1016/J.IPM.2004.08.002•
Techniques for improving web retrieval effectiveness

[...]

Eui-Kyu Park1, Dong-Yul Ra1, Myung-Gil Jang•
Yonsei University1
01 Sep 2005-Information Processing and Management
TL;DR: Stratifying and re-ranking the retrieval list based on the maximum count of index terms in common between a sentence and a query resulted in significant improvement of performance.
Abstract: This paper talks about several schemes for improving retrieval effectiveness that can be used in the named page finding tasks of web information retrieval (Overview of the TREC-2002 web track. In: Proceedings of the Eleventh Text Retrieval Conference TREC-2002, NIST Special Publication #500-251, 2003). These methods were applied on top of the basic information retrieval model as additional mechanisms to upgrade the system. Use of the title of web pages was found to be effective. It was confirmed that anchor texts of incoming links was beneficial as suggested in other works. Sentence-query similarity is a new type of information proposed by us and was identified to be the best information to take advantage of. Stratifying and re-ranking the retrieval list based on the maximum count of index terms in common between a sentence and a query resulted in significant improvement of performance. To demonstrate these facts a large-scale web information retrieval system was developed and used for experimentation.
Journal Article•10.1016/J.IPM.2005.03.027•
Bibliometric maps of field of science

[...]

Irina Marshakova-Shaikevich
01 Dec 2005-Information Processing and Management
TL;DR: The present paper is devoted to two directions in algorithmic classificatory procedures: the journal co-citation analysis as an example of citation networks and lexical analysis of keywords in the titles and texts.
Abstract: The present paper is devoted to two directions in algorithmic classificatory procedures: the journal co-citation analysis as an example of citation networks and lexical analysis of keywords in the titles and texts. What is common to those approaches is the general idea of normalization of deviations of the observed data from the mathematical expectation. The application of the same formula leads to discovery of statistically significant links between objects (journals in one case, keywords--in the other). The results of the journal co-citation analysis are reflected in tables and map for field "Women's Studies" and for field "Information Science and Library Science". An experimental attempt at establishing textual links between words was carried out on two samples from SSCI Data base: (1) EDUCATION and (2) ETHICS. The EDUCATION file included 2180 documents (of which 751 had abstracts); the ETHICS file included 807 documents (289 abstracts). Some examples of the results of this pilot study are given in tabular form. The binary links between words discovered in this way may form triplets or other groups with more than two member words.
Journal Article•10.1016/J.IPM.2004.04.005•
Beyond information searching and browsing: acquiring knowledge from digital libraries

[...]

Ling Feng1, Manfred A. Jeusfeld2, J.J.A.C. Hoppenbrouwers2•
University of Twente1, Tilburg University2
01 Jan 2005-Information Processing and Management
TL;DR: In this article, a two-layered digital library model is proposed to support different levels of human cognitive acts, thus enabling new kinds of knowledge exchange among library users, where the low layer of the model, namely, tactical cognition support layer, provides users with requested relevant documents, as searching and browsing do.
Abstract: Digital libraries (DLs) are a resource for answering complex questions. Up to now, such systems mainly support keyword-based searching and browsing. The mapping from a research question to keywords and the assessment whether an article is relevant for a research question is completely with the user. In this paper, we present a two-layered digital library model. The aim is to enhance current DLs to support different levels of human cognitive acts, thus enabling new kinds of knowledge exchange among library users. The low layer of the model, namely, the tactical cognition support layer, provides users with requested relevant documents, as searching and browsing do. The upper layer of the model, namely, the strategic cognition support layer, not only provides users with relevant documents but also directly and intelligently answers users' cognitive questions. On the basis of the proposed model, we divide the DL information space into two subspaces, i.e., a knowledge subspace and a document subspace, where documents in the document subspace serves as the justification for the corresponding knowledge in the knowledge subspace. Detailed description of the knowledge subspace and its construction, as well as query facilities against the enhanced DLs for users' knowledge sharing and exchange, are particularly discussed.
Journal Article•10.1016/J.IPM.2005.03.009•
New indicators for gender studies in Web networks

[...]

Hildrun Kretschmer1, Isidro F. Aguillo•
Royal Netherlands Academy of Arts and Sciences1
01 Dec 2005-Information Processing and Management
TL;DR: New indicators for gender studies are introduced using the multi-authored publications of a set of 64 members of the COLLNET network and the results show strong gender differences in favour of publications with only male co-authors.
Abstract: New indicators for gender studies are introduced using the multi-authored publications of a set of 64 members of the COLLNET network. Considering the sample of 223 papers and their visibility on the Web, analyses were performed on the sets comprised of publications on the one hand and the authors (social network analysis) on the other.The indicators developed include: Gender co-operation; Web visibility rates and Gender centrality in networks.The co-operation among COLLNET members results in a publication set with a different pattern to other gender studies in science, which show low contribution rates of women in collaboration activities. In the sample the collaboration patterns of female COLLNET members are almost the same as their male counterparts. However, when considering only the arrangement of gender in the by-line of publications, this is in favour of men.Regarding average Web visibility rates, there are no differences dependent upon the arrangement of gender in publications with both male and female co-authors. However, the results also show strong gender differences in favour of publications with only male co-authors.The centrality of female actors in the networks is correlated with the central positions of women in the International Society for Scientometrics and Informetrics (ISSI), where they are leading either as presidents or organisers of its international conferences.It may well be that, with webometric methods and new indicators for gender studies, there is a real continuation of gender equality in COLLNET over time.
Journal Article•10.1016/J.IPM.2003.12.006•
Generic technologies for single- and multi-document summarization

[...]

Marie-Francine Moens1, Roxana Angheluta1, Jos Dumortier1•
Katholieke Universiteit Leuven1
01 May 2005-Information Processing and Management
TL;DR: The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.
Abstract: The technologies for single- and multi-document summarization that are described and evaluated in this article can be used on heterogeneous texts for different summarization tasks. They refer to the extraction of important sentences from the documents, compressing the sentences to their essential or relevant content, and detecting redundant content across sentences. The technologies are tested at the Document Understanding Conference, organized by the National Institute of Standards and Technology, USA in 2002 and 2003. The system obtained good to very good results in this competition. We tested our summarization system also on a variety of English Encyclopedia texts and on Dutch magazine articles. The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.
Journal Article•10.1016/J.IPM.2004.02.004•
Assessing learning outcomes in two information retrieval learning environments

[...]

Kai Halttunen1, Kalervo Järvelin1•
University of Tampere1
01 Jul 2005-Information Processing and Management
TL;DR: Assessment of learning outcomes in an experimental, but naturalistic, learning environment compared to more traditional instruction indicates that anchored instruction and scaffolding with an instructional tool, the IR Game, with performance feedback enables students to construct queries with fewer semantic knowledge errors in operational IR systems.
Abstract: In order to design information retrieval (IR) learning environments and instruction, it is important to explore learning outcomes of different pedagogical solutions. Learning outcomes have seldom been evaluated in IR instruction. The particular focus of this study is the assessment of learning outcomes in an experimental, but naturalistic, learning environment compared to more traditional instruction. The 57 participants of an introductory course on IR were selected for this study, and the analysis illustrates their learning outcomes regarding both conceptual change and development of IR skill. Concept mapping of student essays was used to analyze conceptual change and log-files of search exercises provided data for performance assessment. Students in the experimental learning environment changed their conceptions more regarding linguistic aspects of IR and paid more emphasis on planning and management of search process. Performance assessment indicates that anchored instruction and scaffolding with an instructional tool, the IR Game, with performance feedback enables students to construct queries with fewer semantic knowledge errors also in operational IR systems.
Journal Article•10.1016/J.IPM.2003.10.004•
A statistics-based approach to incrementally update inverted files

[...]

Wann-Yun Shieh1, Chung-Ping Chung1•
National Chiao Tung University1
01 Mar 2005-Information Processing and Management
TL;DR: In this paper, a run-time statistics-based approach is proposed to allocate the spare space in an inverted file to avoid reorganization in the inverted file, and unused free space can be well controlled such that file access speed is not affected.
Abstract: Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected.
Journal Article•10.1016/J.IPM.2003.08.008•
Pattern matching in Huffman encoded texts

[...]

Shmuel T. Klein1, Dana Shapira2•
Bar-Ilan University1, Brandeis University2
01 Jul 2005-Information Processing and Management
TL;DR: For a given text which has been encoded by a static Huffman code, the possibility of locating a given pattern directly in the compressed text is investigated and an algorithm is suggested which reduces the number of erroneously declared matches.
Abstract: For a given text which has been encoded by a static Huffman code, the possibility of locating a given pattern directly in the compressed text is investigated. The main problem is one of synchronization, as an occurrence of the encoded pattern in the encoded text does not necessarily correspond to an occurrence of the pattern in the text. A simple algorithm is suggested which reduces the number of erroneously declared matches. The probability of such false matches is analyzed and empirically tested.
Journal Article•10.1016/J.IPM.2004.06.008•
Structured queries, language modeling, and relevance modeling in cross-language information retrieval

[...]

Leah S. Larkey1, Margaret E. Connell1•
University of Massachusetts Amherst1
01 May 2005-Information Processing and Management
TL;DR: It is found that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translalion probabilities confer a small but significant advantage.
Abstract: Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries-one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translalion probabilities confer a small but significant advantage.
Journal Article•10.1016/J.IPM.2004.04.001•
Searching digital music libraries

[...]

David Bainbridge1, Michael Dewsnip1, Ian H. Witten1•
University of Waikato1
01 Jan 2005-Information Processing and Management
TL;DR: This paper focuses on methods for evaluating different symbolic music matching strategies, and describes a series of experiments that compare and contrast results obtained using three dominant paradigms.
Abstract: There has been a recent explosion of interest in digital music libraries. In particular, interactive melody retrieval is a striking example of a search paradigm that differs radically from the standard full-text search. Many different techniques have been proposed for melody matching, but the area lacks standard databases that allow them to be compared on common grounds--and copyright issues have stymied attempts to develop such a corpus. This paper focuses on methods for evaluating different symbolic music matching strategies, and describes a series of experiments that compare and contrast results obtained using three dominant paradigms. Combining two of these paradigms yields a hybrid approach which is shown to have the best overall combination of efficiency and effectiveness.
Journal Article•10.1016/J.IPM.2005.03.016•
Symmetry and other transformation features of Lorenz/Leimkuhler representations of informetric data

[...]

Quentin L. Burrell
01 Dec 2005-Information Processing and Management
TL;DR: The use of Lorenz/Leimkuhler concentration curves in an informetric context is developed to provide additional and more incisive analytic tools for the practising informetrician.
Abstract: In this paper we develop in particular the use of Lorenz/Leimkuhler concentration curves in an informetric context. Many of the features to be presented are akin to, or are adaptations of, ones that have featured in the econometric literature but not in informetrics. We acknowledge in particular our debt to Lambert [Lambert, P. J. (2001). The distribution and redistribution of income. Manchester: Manchester University Press] and Kleiber and Kotz [Kleiber, C., & Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. New Jersey: Wiley] for source material in the econometrics literature. Although the development is purely theoretical, the aim is to provide additional and more incisive analytic tools for the practising informetrician.
Journal Article•10.1016/J.IPM.2003.11.001•
Probabilistic information retrieval model for a dependency structured indexing system

[...]

Changki Lee1, Gary Geunbae Lee1•
Pohang University of Science and Technology1
01 Mar 2005-Information Processing and Management
TL;DR: This paper presents a new method of incorporating term dependence into a probabilistic retrieval model by adapting a dependency structured indexing system using a dependency parse tree and Chow Expansion to compensate the weakness of the assumption.
Abstract: Most previous information retrieval (IR) models assume that terms of queries and documents are statistically independent from each other. However, conditional independence assumption is obviously and openly understood to be wrong, so we present a new method of incorporating term dependence into a probabilistic retrieval model by adapting a dependency structured indexing system using a dependency parse tree and Chow Expansion to compensate the weakness of the assumption. In this paper, we describe a theoretic process to apply the Chow Expansion to the general probabilistic models and the state-of-the-art 2-Poisson model. Through experiments on document collections in English and Korean, we demonstrate that the incorporation of term dependences using Chow Expansion contributes to the improvement of performance in probabilistic IR systems.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve