Cut-and-paste text summarization

Open Access

Cut-and-paste text summarization

- 01 Jan 2002

31

TL;DR: This thesis presents a cut-and-paste approach to addressing the text generation problem in domain-independent, single-document summarization, and built a large-scale, reusable lexicon by combining multiple, heterogeneous resources.

Abstract: Automatic text summarization provides a concise summary for a document. In this thesis, we present a cut-and-paste approach to addressing the text generation problem in domain-independent, single-document summarization. We found that professional abstractors often reuse the text in an original document for producing the text in a summary. But rather than simply extracting the original text, as in most existing automatic summarizers, humans often edit the extracted sentences. We call such editing operations “revision operations”. Our summarizer simulates two revision operations that are frequently used by humans: sentence reduction and sentence combination. Sentence reduction removes inessential phrases from sentences and sentence combination merges sentences and phrases together. The sentence reduction algorithm we propose relies on multiple sources of knowledge to decide when it is appropriate to delete a phrase from a sentence, including linguistic knowledge, probabilities trained from corpus examples, and context information. The sentence combination module relies on a set of rules to decide how to combine sentences and phrases and when to combine them. Sentence reduction aims to improve the conciseness of generated summaries and sentence combination aims to improve the coherence of generated summaries. We call this approach “cut-and-paste” since it produces summaries by excerpting and combining sentences and phrases from original documents, unlike the extraction technique which produces summaries by simply extracting sentences or passages. Our work also includes a Hidden Markov Model based sentence decomposition program which analyzes human-written summaries. The decomposition program identifies where the phrases of a summary originate in the original document, producing an aligned corpus of summaries and articles that we use to train and evaluate the summarizer. We also built a large-scale, reusable lexicon by combining multiple, heterogeneous resources. The lexicon contains lexical, syntactic, and semantic knowledge. It can be used in many applications.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Book

Automatic Summarization

Ani Nenkova, +2 more

- 27 Jun 2011

TL;DR: The challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field are discussed.

...read moreread less

889

•Book

Machine Learning for Text

Charu C. Aggarwal

- 01 Feb 2019

TL;DR: This textbook covers machine learning topics for text in detail and targets graduate students in computer science, as well as researchers, professors, and industrialpractitioners working in these related fields.

...read moreread less

229

•Proceedings Article•10.3115/1708322.1708329

Dependency tree based sentence compression

Katja Filippova, +1 more

- 12 Jun 2008

TL;DR: A novel unsupervised method for sentence compression which relies on a dependency tree representation and shortens sentences by removing subtrees and it is demonstrated that the choice of the parser affects the performance of the system.

...read moreread less

150

•Proceedings Article•10.3115/1613715.1613741

Sentence Fusion via Dependency Graph Compression

Katja Filippova, +1 more

- 25 Oct 2008

TL;DR: A novel unsupervised sentence fusion method which is applied to a corpus of biographies in German and outperforms the fusion approach of Barzilay & McKeown (2005) with respect to readability.

...read moreread less

146

•Proceedings Article

Sentence Compression for Automated Subtitling: A Hybrid Approach

Vincent Vandeghinste, +1 more

- 01 Jan 2004

TL;DR: This paper describes how an input sentence gets analysed by using a.o. a tagger, a shallow parser and a subordinate clause detector, and how, based on this analysis, several compressed versions of this sentence are generated, each with an associated estimated probability.

...read moreread less

64

...

Expand

References

Journal Article•10.1109/5.18626

A tutorial on hidden Markov models and selected applications in speech recognition

Lawrence R. Rabiner

- 01 Feb 1989

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.

...read moreread less

24.3K

•Book

Computational analysis of present-day American English

Henry Kučera, +5 more

- 01 Jan 1967

7.7K

Journal Article•10.1109/TIT.1967.1054010

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

Andrew J. Viterbi

- 01 Apr 1967

- IEEE Transactions on Information Theory

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

...read moreread less

7.6K

•Journal Article•10.1093/IJL/3.4.235

Introduction to WordNet: An On-line Lexical Database

George A. Miller, +4 more

- 01 Dec 1990

- International Journal of Lexicography

TL;DR: Standard alphabetical procedures for organizing lexical information put together words that are spelled alike and scatter words with similar or related meanings haphazardly through the list.

...read moreread less

5.5K

Book Chapter•10.1016/B978-1-55860-377-6.50023-2

Fast effective rule induction

William W. Cohen

- 09 Jul 1995

TL;DR: This paper evaluates the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems, and proposes a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5 and C 4.5rules with respect to error rates, but much more efficient on large samples.

...read moreread less

4.5K

...

Expand

Cut-and-paste text summarization

Chat with Paper

AI Agents for this Paper

Citations

Automatic Summarization

Machine Learning for Text

Dependency tree based sentence compression

Sentence Fusion via Dependency Graph Compression

Sentence Compression for Automated Subtitling: A Hybrid Approach

References

A tutorial on hidden Markov models and selected applications in speech recognition

Computational analysis of present-day American English

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

Introduction to WordNet: An On-line Lexical Database

Fast effective rule induction

Related Papers (5)

The automatic creation of literature abstracts

A trainable document summarizer

Multiple alternative sentence compressions as a tool for automatic summarization tasks

Using hidden Markov modeling to decompose human-written summaries : Summarization

Corpus Based Extractive Document Summarization for Indic Script