Linguistic Knowledge and Transferability of Contextual Representations

Open AccessPosted Content

Linguistic Knowledge and Transferability of Contextual Representations

- 21 Mar 2019

240

TL;DR: This paper investigated the transferability of contextual word representations derived from large-scale neural language models with a suite of seventeen diverse probing tasks and found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

Abstract: Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of seventeen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

NwQM: A neural quality assessment framework for Wikipedia

Bhanu Prakash Reddy, +3 more

- 14 Oct 2020

- arXiv: Social and Information Networks

TL;DR: A novel deep learning model is proposed which accumulates signals from several key information sources such as article text, meta data and images to obtain improved Wikipedia article representation and shows 8% improvement over state-of-the-art approaches with detailed ablation studies.

...read moreread less

5

•Proceedings Article•10.18653/V1/2021.CMCL-1.25

Representation and Pre-Activation of Lexical-Semantic Knowledge in Neural Language Models

Steven Derby, +2 more

- 01 Jun 2021

- Cellular and Molecular Life Sciences

TL;DR: A systematic analysis of how closely the intermediate layers from LSTM and trans former language models correspond to human semantic knowledge indicates that the transformer models are better at capturing semantic knowledge relating to lexical concepts, both during word prediction and when retention is required.

...read moreread less

5

•Posted Content

Back Attention Knowledge Transfer for Low-resource Named Entity Recognition.

Linghao Sun, +2 more

- 04 Jun 2019

- arXiv: Computation and Language

TL;DR: Back Attention Network is proposed, a general approach that uses a translation system to translate other language sentences into English and then applies a new mechanism named back attention knowledge transfer to obtain task-specific information from pre-trained high-resource languages NER model.

...read moreread less

5

•Posted Content

EntEval: A Holistic Evaluation Benchmark for Entity Representations

Mingda Chen, +4 more

- 31 Aug 2019

- arXiv: Computation and Language

TL;DR: This work proposes EntEval: a test suite of diverse tasks that require nontrivial understanding of entities including entity typing, entity similarity, entity relation prediction, and entity disambiguation, and develops training techniques for learning better entity representations by using natural hyperlink annotations in Wikipedia.

...read moreread less

4

•Journal Article•10.3390/APP11177814

From General Language Understanding to Noisy Text Comprehension

Buddhika Kasthuriarachchy, +3 more

- 25 Aug 2021

- Applied Sciences

TL;DR: It is shown that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.

...read moreread less

4

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

•Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

25.7K

Proceedings Article•10.18653/V1/N19-1423

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

24.6K

...

Expand

Linguistic Knowledge and Transferability of Contextual Representations

Chat with Paper

AI Agents for this Paper

Citations

NwQM: A neural quality assessment framework for Wikipedia

Representation and Pre-Activation of Lexical-Semantic Knowledge in Neural Language Models

Back Attention Knowledge Transfer for Low-resource Named Entity Recognition.

EntEval: A Holistic Evaluation Benchmark for Entity Representations

From General Language Understanding to Noisy Text Comprehension

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

Glove: Global Vectors for Word Representation

Neural Machine Translation by Jointly Learning to Align and Translate

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Linguistic Knowledge and Transferability of Contextual Representations

A Structural Probe for Finding Syntax in Word Representations

Attention is All you Need

BERT Rediscovers the Classical NLP Pipeline