Proceedings Article10.48550/arXiv.2203.15827
LinkBERT: Pretraining Language Models with Document Links
Michihiro Yasunaga,Jure Leskovec,Percy Liang +2 more
- 29 Mar 2022
Vol. abs/2203.15827
TL;DR: This work proposes LinkBERT, an LM pretraining method that leverages links between documents that outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).
read more
Abstract: Language model (LM) pretraining captures various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 1: Document links (e.g. hyperlinks) can provide salient multi-hop knowledge. For instance, the Wikipedia article “Tidal Basin” (left) describes that the basin hosts “National Cherry Blossom Festival”. The hyperlinked article (right) reveals that the festival celebrates “Japanese cherry trees”. Taken together, the link offers new knowledge not available in a single document (e.g. “Tidal Basin has Japanese cherry trees”), which can be useful for various applications, including answering a question “What trees can you see at Tidal Basin?”. We aim to leverage document links to incorporate more knowledge into language model pretraining. 
Table 4: Few-shot QA performance (F1) when 10% of finetuning data is used. LinkBERT attains large gains, suggesting that it internalizes more knowledge than BERT in pretraining. 
Table 5: Ablation study on what linked documents to feed into LM pretraining (§4.3). 
Table 6: Ablation study on the document relation prediction (DRP) objective in LM pretraining (§4.2). 
Table 3: Performance (F1) on SQuAD when distracting documents are added to the context. While BERT incurs a large drop in F1, LinkBERT does not, suggesting its robustness in understanding document relations. 
Table 1: Performance (F1) on MRQA question answering datasets. LinkBERT consistently outperforms BERT on all datasets across the -tiny, -base, and -large scales. The gain is especially large on datasets that require reasoning with multiple documents in the context, such as HotpotQA, TriviaQA, SearchQA.
Citations
Large Language Models Encode Clinical Knowledge
Karan Singhal,Shekoofeh Azizi,Tao Tu,S Mahdavi,Jason Loh Seong Wei,Hyung Won Chung,Nathan Scales,Ajay Kumar Tanwani,Heather Cole-Lewis,Stephen Pfohl,P. A. Payne,Martin G. Seneviratne,P. Gamble,Chris Kelly,Nathaneal Scharli,Aakanksha Chowdhery,Philip Andrew Mansfield,Blaise Aguera y Arcas,Dale R. Webster,Greg S. Corrado,Yossi Matias,K. Chou,Juraj Gottweis,Nenad Tomasev,Yun Liu,Alvin Rajkomar,Joëlle K. Barral,Christopher Semturs,Alan Karthikesalingam,Vivek T. Natarajan +29 more
TL;DR: The authors proposed a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias, and showed that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
TL;DR: This paper proposes BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature and evaluates it on six biomedical natural language processing tasks and demonstrates that the model outperforms previous models on most tasks.
Towards Expert-Level Medical Question Answering with Large Language Models
Karan Singhal,Tao Tu,Juraj Gottweis,Rory Sayres,Ellery Wulczyn,Le Hou,Kevin P. Clark,Stephen R. Pfohl,Heather Cole-Lewis,Mike Schaekermann,Mohamed Suffian Bin Mohamed Amin,S. Lachgar,Philip Andrew Mansfield,Sushant Prakash,Bradley Green,Ewa Dominowska,Blaise Aguera y Arcas,Nenad Tomasev,Yun Liu,R. C. Wong,Christopher Semturs,Seyedeh Sara Mahdavi,Joëlle K. Barral,Dale R. Webster,Greg S. Corrado,Y. Matias,Shekoofeh Azizi,Vivek T. Natarajan +27 more
TL;DR: In this paper , a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach was proposed to bridge the gap between physicians' and large language models' answers.
299
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Harsha Nori,Yin Tat Lee,Sheng Zhang,Dean Carignan,Richard Edgar,Nicolo Fusi,Nicholas King,Jonathan Larson,Yuanzhi Li,Weishung Liu,Renqian Luo,Scott Mayer McKinney,Robert Osazuwa Ness,H. Poon,Tao Qin,Naoto Usuyama,Chris White,Eric Horvitz +17 more
TL;DR: It is found that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks, and the power of Medprompt to generalize to other domains is shown.
184
Can large language models reason about medical questions?
TL;DR: It is speculated that scaling model and data, enhancing prompt alignment and allowing for better contextualization of the completions will be sufficient for LLMs to reach human-level performance on this type of task.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Proceedings Article
Language Models are Few-Shot Learners
Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Thomas Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Samuel McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei +30 more
- 28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
•Posted Content
Decoupled Weight Decay Regularization
Ilya Loshchilov,Frank Hutter +1 more
TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.
14.4K
•Posted Content
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.