ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Open AccessProceedings Article

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

- 30 Apr 2020

1.6K

TL;DR: This paper proposed a more sample-efficient pre-training task called replaced token detection, which corrupts the input by replacing some input tokens with plausible alternatives sampled from a small generator network and then predicts whether each token in the corrupted input was replaced by a generator sample or not.

Abstract: While masked language modeling (MLM) pre-training methods such as BERT produce excellent results on downstream NLP tasks, they require large amounts of compute to be effective. These approaches corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some input tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the model learns from all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where we match the performance of RoBERTa, the current state-of-the-art pre-trained transformer, while using less than 1/4 of the compute.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Jianlin Su, +4 more

- 20 Apr 2021

- arXiv: Computation and Language

TL;DR: The authors proposed a rotary position embedding (RoPE) to encode absolute position information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation, which has valuable properties such as flexibility of being expand to any sequence length, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear selfattention with relative position encoding.

...read moreread less

109

•Posted Content

Pix2seq: A Language Modeling Framework for Object Detection

Ting Chen, +4 more

- 22 Sep 2021

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Pix2Seq as mentioned in this paper cast object detection as a language modeling task conditioned on the observed pixel inputs, where object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens and train a neural network to perceive the image and generate the desired sequence.

...read moreread less

108

Journal Article•10.1016/J.JBI.2020.103637

Language models are an effective representation learning technique for electronic health record data.

Ethan Steinberg, +5 more

- 01 Jan 2021

- Journal of Biomedical Informatics

TL;DR: It is demonstrated that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant.

...read moreread less

106

•Journal Article

Multi-Head Attention: Collaborate Instead of Concatenate

Jean-Baptiste Cordonnier, +2 more

- 04 May 2021

- arXiv: Learning

TL;DR: A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture.

...read moreread less

104

•Posted Content

A Survey of Active Learning for Text Classification using Deep Neural Networks.

Christopher Schröder, +1 more

- 17 Aug 2020

- arXiv: Computation and Language

TL;DR: A taxonomy of query strategies is constructed, which distinguishes between data-based, model- based, and prediction-based instance selection, and investigate the prevalence of these classes in recent research, and connects the respective query strategies to the taxonomy.

...read moreread less

102

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Journal Article•10.3156/JSOFT.29.5_177_2

Generative Adversarial Nets

Ian Goodfellow, +7 more

- 08 Dec 2014

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

48.6K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Proceedings Article

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013

TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

...read moreread less

27.5K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K