Open AccessPosted Content
Linguistic Knowledge and Transferability of Contextual Representations
TL;DR: This paper investigated the transferability of contextual word representations derived from large-scale neural language models with a suite of seventeen diverse probing tasks and found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
read more
Abstract: Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of seventeen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Predicting Discourse Trees from Transformer-based Neural Summarizers
Wen Xiao,Patrick Huber,Giuseppe Carenini +2 more
- 01 Jun 2021
TL;DR: Experiments across models and datasets reveal that the summarizer learns both, dependency- and constituency-style discourse information, which is typically encoded in a single head, covering long- and short-distance discourse dependencies.
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
Xiaochuang Han,Byron C. Wallace,Yulia Tsvetkov +2 more
- 01 Jul 2020
TL;DR: The authors investigate the use of influence functions for NLP, providing an alternative approach to interpreting neural text classifiers, and find that influence functions are particularly useful for natural language inference, a task in which saliency maps may not have clear interpretation.
Similarity Analysis of Contextual Word Representation Models
John M. Wu,Yonatan Belinkov,Hassan Sajjad,Nadir Durrani,Fahim Dalvi,James Glass +5 more
- 03 May 2020
TL;DR: This article measured the similarity of the internal representations and attention of word representations and found that models within the same family are more similar to one another, as may be expected, while different architectures have rather similar representations but different individual neurons.
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
Mengjie Zhao,Tao Lin,Fei Mi,Martin Jaggi,Hinrich Schütze +4 more
- 01 Nov 2020
TL;DR: Analyzing the loss landscape, it is shown that Masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy, confirming that masking can be utilized as an efficient alternative to finetuned.
•Posted Content
Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations
TL;DR: This work proposes DiscoEval, a test suite of tasks to evaluate whether sentence representations include broader context information, and proposes a variety of training objectives that makes use of natural annotations from Wikipedia to build sentence encoders capable of modeling discourse.
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
•Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau,Kyunghyun Cho,Yoshua Bengio +2 more
- 01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
25.7K
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova +3 more
- 11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
24.6K
Related Papers (5)
John Hewitt,Christopher D. Manning +1 more
- 01 Jun 2019
Ian Tenney,Dipanjan Das,Ellie Pavlick +2 more
- 15 May 2019