Supervised Grammar Induction using Training Data with Limited Constituent Information
Rebecca Hwa
- 20 Jun 1999
- pp 73-79
TL;DR: This paper showed that the most informative linguistic constituents are the higher nodes in the parse trees, typically denoting complex noun phrases and sentential clauses, while base noun phrases account for only 20% of all constituents.
read more
Abstract: Corpus-based grammar induction generally relies on hand-parsed training data to learn the structure of the language. Unfortunately, the cost of building large annotated corpora is prohibitively expensive. This work aims to improve the induction strategy when there are few labels in the training data. We show that the most informative linguistic constituents are the higher nodes in the parse trees, typically denoting complex noun phrases and sentential clauses. They account for only 20% of all constituents. For inducing grammars from sparsely labeled training data (e.g., only higher-level constituent labels), we propose an adaptation strategy, which produces grammars that parse almost as well as grammars induced from fully labeled corpora. Our results suggest that for a partial parser to replace human annotators, it must be able to automatically extract higher-level constituents rather than base noun phrases.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Domain Adaptation Problems: A DASVM Classification Technique and a Circular Validation Strategy
TL;DR: Experimental results confirmed the effectiveness and the reliability of both the DASVM technique and the proposed circular validation strategy for validating the learning of domain adaptation classifiers when no true labels for the target--domain instances are available.
Reranking and Self-Training for Parser Adaptation
David McClosky,Eugene Charniak,Mark Johnson +2 more
- 17 Jul 2006
TL;DR: The reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2% and use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data.
•Dissertation
From Distributional to Semantic Similarity
James Curran
- 01 Jan 2004
TL;DR: This dissertation describes how to extract contexts from a corpus of over 2 billion words and introduces a new context-weighted approximation algorithm with bounded complexity in context vector size that significantly reduces the system runtime with only a minor performance penalty.
301
Object Recognition in 3D Point Clouds Using Web Data and Domain Adaptation
Kevin Lai,Dieter Fox +1 more
TL;DR: This paper shows how to significantly reduce the need for manually labeled training data by leveraging data sets available on the World Wide Web by using objects from Google’s 3D Warehouse to train an object detection system for 3D point clouds collected by robots navigating through both urban and indoor environments.
Unsupervised Recurrent Neural Network Grammars
Yoon Kim,Alexander M. Rush,Lei Yu,Adhiguna Kuncoro,Chris Dyer,Gábor Melis +5 more
- 07 Apr 2019
TL;DR: An inference network parameterized as a neural CRF constituency parser is developed to maximize the evidence lower bound and apply amortized variational inference to unsupervised learning of RNNGs.
155
References
Building a large annotated corpus of English: the penn treebank
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Language identification in the limit
TL;DR: It was found that theclass of context-sensitive languages is learnable from an informant, but that not even the class of regular languages is learningable from a text.
3.8K
The ATIS spoken language systems pilot corpus
Charles T. Hemphill,John J. Godfrey,George R. Doddington +2 more
- 24 Jun 1990
TL;DR: This pilot marks the first full-scale attempt to collect a corpus to measure progress in Spoken Language Systems that include both a speech and natural language component and provides guidelines for future efforts.
1K
An Empirical Evaluation
A. Jefferson Offutt,Stephen D. Lee +1 more
- 01 Jan 1994
TL;DR: This study evaluates the impact of alternative design concepts on the performance of 30 airline pilots interacting with a cooperative system designed to support enroute flight planning and develops recommendations for guiding the design of cooperative systems.
455
•Proceedings Article
Tree-bank Grammars
Eugene Charniak
- 04 Aug 1996
TL;DR: This paper presents results on a tree-bank grammar based on the Penn WaII Street Journal tree bank that outperforms other non-word-based statistical parsers/grammars on this corpus and outperforms parsers that consider the input as a string of tags and ignore the actual words of the corpus.
345
Related Papers (5)
Fernando Pereira,Yves Schabes +1 more
- 28 Jun 1992
Eugene Charniak
- 29 Apr 2000
Michiel Bacchiani,Brian Roark +1 more
- 06 Apr 2003