Open AccessProceedings Article
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark,Minh-Thang Luong,Quoc V. Le,Christopher D. Manning +3 more
- 30 Apr 2020
TL;DR: This paper proposed a more sample-efficient pre-training task called replaced token detection, which corrupts the input by replacing some input tokens with plausible alternatives sampled from a small generator network and then predicts whether each token in the corrupted input was replaced by a generator sample or not.
read more
Abstract: While masked language modeling (MLM) pre-training methods such as BERT produce excellent results on downstream NLP tasks, they require large amounts of compute to be effective. These approaches corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some input tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the model learns from all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where we match the performance of RoBERTa, the current state-of-the-art pre-trained transformer, while using less than 1/4 of the compute.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary
TL;DR: This article proposed a metric to evaluate the content quality of a summary using question-answering (QA), which directly measures a summary's information overlap with a reference, making them fundamentally different than text overlap metrics.
WebFormer: The Web-page Transformer for Structure Information Extraction
Qiang Wang,Yi Fang,Anirudh Reddy Ravula,Fuli Feng,Xiaojun Quan,Dongfang Liu +5 more
- 01 Feb 2022
TL;DR: WebFormer is introduced, a Web-page transFormer model for structure information extraction from web documents that designs HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention and constructs rich attention patterns between HTML tokens and text tokens.
PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting
Hao Xue,Flora D.Salim +1 more
- 20 Sep 2022
TL;DR: This paper proposed a new forecasting paradigm called prompt-based time series forecasting (PromptCast), where the numerical input and output are transformed into prompts and the forecasting task is framed in a sentence-to-sentence manner, making it possible to directly apply language models for forecasting purposes.
60
SLM: Learning a Discourse Language Representation with Sentence Unshuffling
Haejun Lee,Drew A. Hudson,Kangwook Lee,Christopher D. Manning +3 more
- 30 Oct 2020
TL;DR: Sentence-level Language Modeling is introduced, a new pre-training objective for learning a discourse language representation in a fully self-supervised manner by shuffling the sequence of input sentences and training a hierarchical transformer model to reconstruct the original ordering.
•Posted Content
Pre-training Text-to-Text Transformers for Concept-centric Common Sense
TL;DR: It is shown that while only incrementally pre-trained on a relatively small corpus for a few steps, CALM outperforms baseline methods by a consistent margin and even comparable with some larger PTLMs, which suggests that CALM can serve as a general, plug-and-play method for improving the commonsense reasoning ability of a PTLM.
59
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
Generative Adversarial Nets
Ian Goodfellow,Jean Pouget-Abadie,Mehdi Mirza,Bing Xu,David Warde-Farley,Sherjil Ozair,Aaron Courville,Yoshua Bengio +7 more
- 08 Dec 2014
TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.