Improving alignment of dialogue agents via targeted human judgements

doi:10.48550/arXiv.2209.14375

Journal Article10.48550/arXiv.2209.14375

Improving alignment of dialogue agents via targeted human judgements

A. Glaese, +33 more

- 28 Sep 2022

- arXiv.org

- Vol. abs/2209.14375

349

TL;DR: This research presents a state-of-the-art knowledge graph depicting the architecture of the connective tissue of the autonomic nervous system and some of the mechanisms responsible for seizure and depression are described.

Abstract: We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2506.17252

Adaptive Sample Scheduling for Direct Preference Optimization

Yikun Ban, +4 more

- arXiv.org

Journal Article•10.48550/arXiv.2307.01139

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Sameera Horawalavithana, +3 more

- 03 Jul 2023

- arXiv.org

TL;DR: This article used a human-generated scientific instruction tuning dataset and trained a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding.

...read moreread less

Proceedings Article•10.48550/arXiv.2211.15006

Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker, +10 more

- 28 Nov 2022

TL;DR: This paper fine-tuned a large language modeling (LLM) model to generate statements that maximize the expected approval for a group of people with potentially diverse opinions, and trained a reward model to predict individual preferences, enabling it to quantify and rank consensus statements.

...read moreread less

Journal Article•10.48550/arxiv.2310.11971

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Rui Zheng, +11 more

- 18 Oct 2023

- arXiv.org

TL;DR: This work proposes a novel approach that can learn a consistent policy via RL across various data groups or domains, and significantly enhances training stability and model generalization.

...read moreread less

Journal Article•10.48550/arxiv.2310.04373

Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, +6 more

- 06 Oct 2023

- arXiv.org

TL;DR: This paper performs the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points, and introduces an approach using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness.

...read moreread less

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Book

Reinforcement Learning: An Introduction

Richard S. Sutton, +1 more

- 01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

39.7K

•Proceedings Article

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, +7 more

- 19 Jun 2016

TL;DR: A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

...read moreread less

9.2K

Book Chapter•10.1057/9780230005853_5

Logic and Conversation

Siobhan Chapman

- 01 Jan 2005

TL;DR: For instance, Grice was interested in Quine's logical approach to language, although he differed from Quine over certain specific specific questions, such as the viability of the distinction between analytic and synthetic statements.

...read moreread less

8.9K

Journal Article•10.1037/0033-2909.108.3.480

The case for motivated reasoning.

Ziva Kunda

- 01 Nov 1990

- Psychological Bulletin

TL;DR: It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes--that is, strategies for accessing, constructing, and evaluating beliefs--that are considered most likely to yield the desired conclusion.

...read moreread less

8K

...

Expand