TL;DR: A novel extractive approach based on manifold-ranking of sentences to this summarization task can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.
Abstract: Topic-focused multi-document summarization aims to produce a summary biased to a given topic or user profile. This paper presents a novel extractive approach based on manifold-ranking of sentences to this summarization task. The manifold-ranking process can naturally make full use of both the relationships among all the sentences in the documents and the relationships between the given topic and the sentences. The ranking score is obtained for each sentence in the manifold-ranking process to denote the biased information richness of the sentence. Then the greedy algorithm is employed to impose diversity penalty on each sentence. The summary is produced by choosing the sentences with both high biased information richness and high information novelty. Experiments on DUC2003 and DUC2005 are performed and the ROUGE evaluation results show that the proposed approach can significantly outperform existing approaches of the top performing systems in DUC tasks and baseline approaches.
TL;DR: A new approach to automatic summarization based on neural nets, called NetSum, that extracts a set of features from each sentence that helps identify its importance in the document, and applies novel features based on news search query logs and Wikipedia entities.
Abstract: We present a new approach to automatic summarization based on neural nets, called NetSum. We extract a set of features from each sentence that helps identify its importance in the document. We apply novel features based on news search query logs and Wikipedia entities. Using the RankNet learning algorithm, we train a pair-based sentence ranker to score every sentence in the document and identify the most important sentences. We apply our system to documents gathered from CNN.com, where each document includes highlights and an article. Our system significantly outperforms the standard baseline in the ROUGE-1 measure on over 70% of our document set.
TL;DR: The results of a user study comparing the PubCloud tag-cloud summarization of query results with the standard result list provided by PubMed indicated that the tag cloud interface is advantageous in presenting descriptive information and in reducing user frustration but that it is less effective at the task of enabling users to discoverrelations between concepts.
Abstract: In this paper, we describe an application, PubCloud that uses tagclouds for the summarization of results from queries over thePubMed database of biomedical literature. PubCloud responds toqueries of this database with tag clouds generated from wordsextracted from the abstracts returned by the query. The results ofa user study comparing the PubCloud tag-cloud summarization ofquery results with the standard result list provided by PubMedindicated that the tag cloud interface is advantageous in presenting descriptive information and in reducing user frustrationbut that it is less effective at the task of enabling users to discoverrelations between concepts.
TL;DR: In this article, the authors present methods, devices, systems and tools that allow the summarization of text, audio, and audiovisual presentations, such as movies, into less lengthy forms.
Abstract: Methods, devices, systems and tools are presented that allow the summarization of text, audio, and audiovisual presentations, such as movies, into less lengthy forms. High-content media files are shortened in a manner that preserves important details, by splitting the files into segments, rating the segments, and reassembling preferred segments into a final abridged piece. Summarization of media can be customized by user selection of criteria, and opens new possibilities for delivering entertainment, news, and information in the form of dense, information-rich content that can be viewed by means of broadcast or cable distribution, “on-demand” distribution, internet and cell phone digital video streaming, or can be downloaded onto an iPod™ and other portable video playback devices.
TL;DR: A novel iterative reinforcement approach to simultaneously extractingsummary and keywords from single document under the assumption that the summary and keywords of a document can be mutually boosted.
Abstract: Though both document summarization and keyword extraction aim to extract concise representations from documents, these two tasks have usually been investigated independently. This paper proposes a novel iterative reinforcement approach to simultaneously extracting summary and keywords from single document under the assumption that the summary and keywords of a document can be mutually boosted. The approach can naturally make full use of the reinforcement between sentences and keywords by fusing three kinds of relationships between sentences and words, either homogeneous or heterogeneous. Experimental results show the effectiveness of the proposed approach for both tasks. The corpus-based approach is validated to work almost as well as the knowledge-based approach for computing word semantics.
TL;DR: Two independent methods for identifying salient sentences in biomedical texts using concepts derived from domain-specific resources are presented and it is shown that the best performance is achieved when the two methods are combined.
Abstract: Text summarization is a method for data reduction. The use of text summarization enables users to reduce the amount of text that must be read while still assimilating the core information. The data reduction offered by text summarization is particularly useful in the biomedical domain, where physicians must continuously find clinical trial study information to incorporate into their patient treatment efforts. Such efforts are often hampered by the high-volume of publications. This paper presents two independent methods (BioChain and FreqDist) for identifying salient sentences in biomedical texts using concepts derived from domain-specific resources. Our semantic-based method (BioChain) is effective at identifying thematic sentences, while our frequency-distribution method (FreqDist) removes information redundancy. The two methods are then combined to form a hybrid method (ChainFreq). An evaluation of each method is performed using the ROUGE system to compare system-generated summaries against a set of manually-generated summaries. The BioChain and FreqDist methods outperform some common summarization systems, while the ChainFreq method improves upon the base approaches. Our work shows that the best performance is achieved when the two methods are combined. The paper also presents a brief physician's evaluation of three randomly-selected papers from an evaluation corpus to show that the author's abstract does not always reflect the entire contents of the full-text.
TL;DR: A new data set specifically designed for the evaluation of systems for book summarization is introduced, and summarization techniques that explicitly account for the length of the documents are described.
Abstract: Most of the text summarization research carried out to date has been concerned with the summarization of short documents (e.g., news stories, technical reports), and very little work if any has been done on the summarization of very long documents. In this paper, we try to address this gap and explore the problem of book summarization. We introduce a new data set specifically designed for the evaluation of systems for book summarization, and describe summarization techniques that explicitly account for the length of the documents.
TL;DR: A coherent graph-based semantic clustering and summarization approach for biomedical literature that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries.
Abstract: Background
A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature.
TL;DR: A principled comparison between the two most commonly used schemes for assigning importance to words in the context of query focused multi-document summarization finds that log-likelihood ratio is more suitable for query-focused summarization since, unlike raw frequency, it is more sensitive to the integration of the information need defined by the user.
Abstract: The increasing complexity of summarization systems makes it difficult to analyze exactly which modules make a difference in performance. We carried out a principled comparison between the two most commonly used schemes for assigning importance to words in the context of query focused multi-document summarization: raw frequency (word probability) and log-likelihood ratio. We demonstrate that the advantages of log-likelihood ratio come from its known distributional properties which allow for the identification of a set of words that in its entirety defines the aboutness of the input. We also find that LLR is more suitable for query-focused summarization since, unlike raw frequency, it is more sensitive to the integration of the information need defined by the user.
TL;DR: This novel framework for summarization has the advantage of producing highly responsive summaries, as indicated by the evaluation results.
Abstract: Generating summaries that meet the information needs of a user relies on (1) several forms of question decomposition; (2) different summarization approaches; and (3) textual inference for combining the summarization strategies. This novel framework for summarization has the advantage of producing highly responsive summaries, as indicated by the evaluation results.
TL;DR: It is found that acoustic and structural features are more important for Broadcast News summarization due to the speaking styles of anchors and reporters, as well as typical news story flow.
Abstract: We carry out a comprehensive study of acoustic/prosodic, linguistic and structural features for speech summarization, contrasting two genres of speech, namely Broadcast News and Lecture Speech. We find that acoustic and structural features are more important for Broadcast News summarization due to the speaking styles of anchors and reporters, as well as typical news story flow. Due to the relatively small contribution of lexical features, Broadcast News summarization does not depend heavily on ASR accuracies. We use SVM based summarizer to select the best features for extractive summarization, and obtain state-of-the-art performances: ROUGE-L F-measure of 0.64 for Mandarin Broadcast News, and 0.65 for Mandarin Lecture Speech. In the case of Lecture Speech summarization where lexical features are more important, we make the surprising discovery that summarization performance is very high (0.63 ROUGE-L F-measure) even when the ASR accuracy is low (21% CER). Index Terms: speech summarization
TL;DR: This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarization within each cluster.
Abstract: Almost all existing methods conduct the summarization tasks for single documents separately without interactions for each document under the assumption that the documents are considered independent of each other. This paper proposes a novel framework called CollabSum for collaborative single document summarizations by making use of mutual influences of multiple documents within a cluster context. In this study, CollabSum is implemented by first employing the clustering algorithm to obtain appropriate document clusters and then exploiting the graph-ranking based algorithm for collaborative document summarizations within each cluster. Both the with-document and cross-document relationships between sentences are incorporated in the algorithm. Experiments on the DUC2001 and DUC2002 datasets demonstrate the encouraging performance of the proposed approach. Different clustering algorithms have been investigated and we find that the summarization performance relies positively on the quality of document cluster.
TL;DR: Today, with digitally stored information available in abundance, even for many minor languages, this information must by some means be filtered and extracted in order to avoid drowning in it.
Abstract: Today, with digitally stored information available in abundance, even for many minor languages, this information must by some means be filtered and extracted in order to avoid drowning in it. Autom ...
TL;DR: The Document Understanding Conference (DUC) 2005 evaluation had a single user-oriented, question-focused summarization task, which was to synthesize from a set of 25--50 documents a well-organized, fluent answer to a complex question as discussed by the authors.
Abstract: The Document Understanding Conference (DUC) 2005 evaluation had a single user-oriented, question-focused summarization task, which was to synthesize from a set of 25--50 documents a well-organized, fluent answer to a complex question The evaluation shows that the best summarization systems have difficulty extracting relevant sentences in response to complex questions (as opposed to representative sentences that might be appropriate to a generic summary) The relatively generous allowance of 250 words for each answer also reveals how difficult it is for current summarization systems to produce fluent text from multiple documents
TL;DR: A new user query based text summarization technique that makes use of WordNet, a general knowledge source from Princeton University, is proposed that is specially tuned to summarize medical documents by integrating Unified Medical Language System, a medical ontologyknowledge source from National Library of Medicine.
Abstract: As huge amounts of knowledge are created rapidly, effective information access becomes an important issue. Especially for critical domains, such as medical and financial areas, efficient retrieval of concise and relevant information is highly desired. In this paper we propose a new user query based text summarization technique that makes use of WordNet, a general knowledge source from Princeton University. Our summarization system is specially tuned to summarize medical documents by integrating Unified Medical Language System, a medical ontology knowledge source from National Library of Medicine. We participated in the Document Understanding Conference 2007 Main Task and ranked in the middle tier of 32 systems.
TL;DR: A method for automatic summarization based on a Markov model of the source text, by a simple greedy word selection strategy, is presented, and summaries with high ROUGE-scores are generated.
Abstract: We show some limitations of the ROUGE evaluation method for automatic summarization. We present a method for automatic summarization based on a Markov model of the source text. By a simple greedy word selection strategy, summaries with high ROUGE-scores are generated. These summaries would however not be considered good by human readers. The method can be adapted to trick different settings of the ROUGEeval package.
TL;DR: An automated content analysis and summarization framework for creating moving-image summaries for narrative videos aimed at preserving the story line to the level that users can watch the summary instead of the original content.
Abstract: The movie industry produces thousands of feature films and TV series annually. Such massive data volumes would take consumers more than a lifetime to watch. Therefore, summarization of narrative media, which engages in providing concise and informative video summaries, has become a popular topic of research. However, most of the summarization solutions so far aim to represent just the overall atmosphere of the video at the expense of the story line. In this paper we describe a novel approach for automated creation of summaries for narrative videos. We propose an automated content analysis and summarization framework for creating moving-image summaries. We aim at preserving the story line to the level that users can watch the summary instead of the original content. Our solution is based on textual cues available in subtitles and movie scripts. We extract features like keywords, main characters names and presence, and combine them in an importance function to identify the moments most relevant for preserving the story line. We develop several summarization methods and evaluate the quality of the resulting summaries in terms of user understanding and user satisfaction through a user test.
TL;DR: A parse-and-trim approach with a novel technique for producing multiple alternative compressions for source sentences and using weighted features of these candidates to construct summaries for multi-document summarization.
Abstract: We perform multi-document summarization by generating compressed versions of source sentences as summary candidates and using weighted features of these candidates to construct summaries. We combine a parse-and-trim approach with a novel technique for producing multiple alternative compressions for source sentences. In addition, we use a novel method for tuning the feature weights that maximizes the change in the ROUGE-2 score ( ROUGE) between the already existing summary state and the new state that results from the addition of the candidate under consideration. We also describe experiments using a new paraphrase-based feature for redundancy checking. Finally, we present the results of our DUC2007 submissions and some ideas for future work.
TL;DR: An efficient algorithm for language independent generic extractive summarization for single document based on structural and statistical factors is described, which shows that the method performs equally well regardless of the language.
Abstract: This paper describes an efficient algorithm for language independent generic extractive summarization for single document The algorithm is based on structural and statistical (rather than semantic) factors Through evaluations performed on a single-document summarization for English, Hindi, Gujarati and Urdu documents, we show that the method performs equally well regardless of the language The algorithm has been applied on DUC data for English documents and various newspaper articles for other languages with corresponding stop words list and modified stemmer The results of summarization have been compared with DUC 2002 data using degree of representativeness For other languages, the degree of representativeness we get is highly encouraging
TL;DR: This work proposes the TimedTextRank algorithm to make use of the temporal information of documents based on the graph-ranking based algorithm for dynamic multi-document summarization.
Abstract: Graph-ranking based algorithms (e.g. TextRank) have been proposed for multi-document summarization in recent years. However, these algorithms miss an important dimension, the temporal dimension, for summarizing evolving topics. For an evolving topic, recent documents are usually more important than earlier documents because recent documents contain much more novel information than earlier documents and a novelty-oriented summary should be more appropriate to reflect the changing topic. We propose the TimedTextRank algorithm to make use of the temporal information of documents based on the graph-ranking based algorithm. A preliminary study is performed to demonstrate the effectiveness of the proposed TimedTextRank algorithm for dynamic multi-document summarization.
TL;DR: A new summarization method, which uses non-negative matrix factorization (NMF) and K-means clustering, is introduced to extract meaningful sentences from multi-documents and has better performance than other methods using the LSA, the Kmeans, and the NMF.
Abstract: In this paper, a new summarization method, which uses non-negative matrix factorization (NMF) and K-means clustering, is introduced to extract meaningful sentences from multi-documents. The proposed method can improve the quality of document summaries because the inherent semantics of the documents are well reflected by using the semantic features calculated by NMF and the sentences most relevant to the given topic are extracted efficiently by using the semantic variables derived by NMF. Besides, it uses K-means clustering to remove noises so that it can avoid the biased inherent semantics of the documents to be reflected in summaries. We perform detail experiments with the well-known DUC test dataset. The experimental results demonstrate that the proposed method has better performance than other methods using the LSA, the Kmeans, and the NMF.
TL;DR: A comprehensive evaluation demonstrates how the social summarization technique can generate summaries that are of significantly higher quality that those produced by a number of leading alternatives.
Abstract: We describe a novel document summarization technique that uses informational cues, such as social bookmarks or search queries, as the basis for summary construction by leveraging the snippet-generation capabilities of standard search engines. A comprehensive evaluation demonstrates how the social summarization technique can generate summaries that are of significantly higher quality that those produced by a number of leading alternatives.
TL;DR: Four modern systems of automatic text summarization are tested on the basis of a model vocabulary composed by subjects and principles for evaluation of the efficiency of the current systems are described.
Abstract: Four modern systems of automatic text summarization are tested on the basis of a model vocabulary composed by subjects. Distribution of terms of the vocabulary in the source text is compared with their distribution in summaries of different length generated by the systems. Principles for evaluation of the efficiency of the current systems of automatic text summarization are described.
TL;DR: An approach which draws on methods from each of these areas of information retrieval, topical summarization, and Information Extraction is presented, and the effectiveness of this approach with a query-focused summarization approach is compared.
Abstract: This paper addresses the task of providing extended responses to questions regarding specialized topics. This task is an amalgam of information retrieval, topical summarization, and Information Extraction (IE). We present an approach which draws on methods from each of these areas, and compare the effectiveness of this approach with a query-focused summarization approach. The two systems are evaluated in the context of the prosecution queries like those in the DARPA GALE distillation evaluation.
TL;DR: A clustering algorithm based on fuzzy set theory, which is easy to implement and integrate into a personal information system, generates a highly flexible data structure for topic analysis and summarization, and also delivers excellent performance is shown.
Abstract: Large document collections, such as those delivered by Internet search engines, are difficult and time-consuming for users to read and analyse. The detection of common and distinctive topics within a document set, together with the generation of multi-document summaries, can greatly ease the burden of information management. We show how this can be achieved with a clustering algorithm based on fuzzy set theory, which (i) is easy to implement and integrate into a personal information system, (ii) generates a highly flexible data structure for topic analysis and summarization, and (iii) also delivers excellent performance.
TL;DR: This paper presents a text summarizer for Bangla, which uses some extraction methods for text summarization.
Abstract: Text summarization is the technique which automatically creates an abstract or summary of a text. The technique has been developed for many years. So a survey has been done on different summarization techniques. No work in this area has been done for Bangla language. This paper presents a text summarizer for Bangla, which uses some extraction methods for text summarization.
TL;DR: The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion, and the cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization.
Abstract: Existing methods for single document summarization usually make use of only the information contained in the specified document This paper proposes the technique of document expansion to provide more knowledge to help single document summarization A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and then the graph-ranking based algorithm is applied on the expanded document set for extracting sentences from the single document, by making use of both the within-document relationships between sentences of the specified document and the cross-document relationships between sentences of all documents in the document set The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion The cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization
TL;DR: The fractal summarization model for document summarization on handheld devices, developed based on the fractal theory, is introduced and the three-tier architecture with the middle-tier conducting the major computation is discussed.
Abstract: Wireless access with handheld devices is a promising addition to the WWW and traditional electronic business. Handheld devices provide convenience and portable access to the huge information space on the Internet without requiring users to be stationary with network connection. Many customer-centered m-services applications have been developed. The mobile computing, however, should be extended to decision support in an organization. There is a desire of accessing most update and accurate information on handheld devices for fast decision making in an organization. Unfortunately, loading and visualizing large documents on handheld devices are impossible due to their shortcomings. In this paper, we introduce the fractal summarization model for document summarization on handheld devices. Fractal summarization is developed based on the fractal theory. It generates a brief skeleton of summary at the first stage, and the details of the summary on different levels of the document are generated on demands of users. Such interactive summarization reduces the computation load in comparing with the generation of the entire summary in one batch by the traditional automatic summarization, which is ideal for wireless access. The three-tier architecture with the middle-tier conducting the major computation is also discussed. Visualization of summary on handheld devices is also investigated. The automatic summarization, the three-tier architecture, and the information visualization are potential solutions to the existing problems in information delivery to handheld devices for mobile commerce.
TL;DR: By using a machine reading (MR) framework in order to construct representations of the knowledge inferable from a text collection, Language Computer Corporation’s GISTEXTER systems were able to create coherent sets of iupdatei summaries that were likely to contain inewi information that could not be inferred from any previously considered document.
Abstract: In this paper, we describe Language Computer Corporation’s GISTEXTER question-focused and update-based multidocument summarization (MDS) systems. We show that by using a machine reading (MR) framework in order to construct representations of the knowledge inferable from a text collection, we were able to create coherent sets of iupdatei summaries that were likely to contain inewi information that could not be inferred from any previously considered document. Details of our DUC 2007 Main Task submission are provided as well.
TL;DR: A linguistically-motivated model for automatic summarization of medical articles in Spanish that takes into account the textual, lexical, discursive, syntactic and communicative dimensions and is suitable to provide high quality summarizations.
Abstract: In this article, we present the current state of our work on a linguistically-motivated model for automatic summarization of medical articles in Spanish. The model takes into account the results of an empirical study which reveals that, on the one hand, domain-specific summarization criteria can often be derived from the summaries of domain specialists, and, on the other hand, adequate summarization strategies must be multidimensional, i.e., cover various types of linguistic clues. We take into account the textual, lexical, discursive, syntactic and communicative dimensions. This is novel in the field of summarization. The experiments carried out so far indicate that our model is suitable to provide high quality summarizations.