LinkBERT: Pretraining Language Models with Document Links

doi:10.48550/arXiv.2203.15827

Proceedings Article10.48550/arXiv.2203.15827

LinkBERT: Pretraining Language Models with Document Links

Michihiro Yasunaga, +2 more

- 29 Mar 2022

Vol. abs/2203.15827

225

TL;DR: This work proposes LinkBERT, an LM pretraining method that leverages links between documents that outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 1: Document links (e.g. hyperlinks) can provide salient multi-hop knowledge. For instance, the Wikipedia article “Tidal Basin” (left) describes that the basin hosts “National Cherry Blossom Festival”. The hyperlinked article (right) reveals that the festival celebrates “Japanese cherry trees”. Taken together, the link offers new knowledge not available in a single document (e.g. “Tidal Basin has Japanese cherry trees”), which can be useful for various applications, including answering a question “What trees can you see at Tidal Basin?”. We aim to leverage document links to incorporate more knowledge into language model pretraining.

Table 4: Few-shot QA performance (F1) when 10% of finetuning data is used. LinkBERT attains large gains, suggesting that it internalizes more knowledge than BERT in pretraining.

Table 5: Ablation study on what linked documents to feed into LM pretraining (§4.3).

Table 6: Ablation study on the document relation prediction (DRP) objective in LM pretraining (§4.2).

Table 3: Performance (F1) on SQuAD when distracting documents are added to the context. While BERT incurs a large drop in F1, LinkBERT does not, suggesting its robustness in understanding document relations.

Table 1: Performance (F1) on MRQA question answering datasets. LinkBERT consistently outperforms BERT on all datasets across the -tiny, -base, and -large scales. The gain is especially large on datasets that require reasoning with multiple documents in the context, such as HotpotQA, TriviaQA, SearchQA.

Citations

Journal Article•10.48550/arXiv.2212.13138

Large Language Models Encode Clinical Knowledge

Karan Singhal, +29 more

- 26 Dec 2022

- Visual education

TL;DR: The authors proposed a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias, and showed that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.

...read moreread less

1.2K

•Journal Article•10.1093/bib/bbac409

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

Renqian Luo, +6 more

- 24 Sep 2022

- Briefings in Bioinformatics

TL;DR: This paper proposes BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature and evaluates it on six biomedical natural language processing tasks and demonstrates that the model outperforms previous models on most tasks.

...read moreread less

730

Journal Article•10.48550/arXiv.2305.09617

Towards Expert-Level Medical Question Answering with Large Language Models

Karan Singhal, +27 more

- 16 May 2023

- arXiv.org

TL;DR: In this paper , a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach was proposed to bridge the gap between physicians' and large language models' answers.

...read moreread less

299

Journal Article•10.48550/arxiv.2311.16452

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Harsha Nori, +17 more

- 28 Nov 2023

- arXiv.org

TL;DR: It is found that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks, and the power of Medprompt to generalize to other domains is shown.

...read moreread less

184

Journal Article•10.48550/arXiv.2207.08143

Can large language models reason about medical questions?

Valentin Li'evin, +2 more

- 17 Jul 2022

- arXiv.org

TL;DR: It is speculated that scaling model and data, enhancing prompt alignment and allowing for better contextualization of the completions will be sufﬁcient for LLMs to reach human-level performance on this type of task.

...read moreread less

172

...

Expand

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019

- arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

26.2K

•Proceedings Article

Language Models are Few-Shot Learners

Tom B. Brown, +30 more

- 28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

25.2K

•Posted Content

Decoupled Weight Decay Regularization

Ilya Loshchilov, +1 more

- 14 Nov 2017

- arXiv: Learning

TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.

...read moreread less

14.4K

•Posted Content

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, +8 more

- 23 Oct 2019

- arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

12.9K

...

Expand