Improving alignment of dialogue agents via targeted human judgements

doi:10.48550/arXiv.2209.14375

Journal Article10.48550/arXiv.2209.14375

Improving alignment of dialogue agents via targeted human judgements

A. Glaese, +33 more

- 28 Sep 2022

- arXiv.org

- Vol. abs/2209.14375

349

TL;DR: This research presents a state-of-the-art knowledge graph depicting the architecture of the connective tissue of the autonomic nervous system and some of the mechanisms responsible for seizure and depression are described.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2310.09139

The Consensus Game: Language Model Generation via Equilibrium Search

Athul Paul Jacob, +3 more

- 13 Oct 2023

- arXiv.org

TL;DR: This work introduces a new, a training-free, game-theoretic procedure for language model decoding that improves performance over existing LM decoding procedures and develops computational procedures for finding approximate equilibria of this game, resulting in a decoding algorithm the authors call EQUILIBRIUM-RANKING.

...read moreread less

8

Journal Article•10.48550/arxiv.2310.02456

Learning Optimal Advantage from Preferences and Mistaking it for Reward

W. Bradley Knox, +6 more

- 03 Oct 2023

- arXiv.org

TL;DR: Insight is provided regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

...read moreread less

8

Journal Article•10.48550/arxiv.2312.04782

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

Zhuo Zhang, +4 more

- 08 Dec 2023

- arXiv.org

TL;DR: The findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks, and can complement jail-breaking strategies, with which results in further boosting attack performance.

...read moreread less

8

Journal Article•10.48550/arXiv.2302.06541

Towards Agile Text Classifiers for Everyone

Maximilian Mozes, +7 more

- 13 Feb 2023

- arXiv.org

TL;DR: In this paper , the authors introduce and evaluate methods for agile text classification, whereby classifiers are trained using small, targeted datasets that can be quickly developed for a particular safety policy.

...read moreread less

8

Journal Article•10.48550/arxiv.2401.16335

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Bang-Fen Zhu, +2 more

- 29 Jan 2024

- arXiv.org

TL;DR: The core idea is that during each training epoch, the model is updated with the data, but also update the date using the model, replacing hard labels with soft labels, and the empirical findings highlight the superior performance of this approach over the traditional methods.

...read moreread less

8

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Book

Reinforcement Learning: An Introduction

Richard S. Sutton, +1 more

- 01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

39.7K

•Proceedings Article

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, +7 more

- 19 Jun 2016

TL;DR: A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

...read moreread less

9.2K

Book Chapter•10.1057/9780230005853_5

Logic and Conversation

Siobhan Chapman

- 01 Jan 2005

TL;DR: For instance, Grice was interested in Quine's logical approach to language, although he differed from Quine over certain specific specific questions, such as the viability of the distinction between analytic and synthetic statements.

...read moreread less

8.9K

Journal Article•10.1037/0033-2909.108.3.480

The case for motivated reasoning.

Ziva Kunda

- 01 Nov 1990

- Psychological Bulletin

TL;DR: It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes--that is, strategies for accessing, constructing, and evaluating beliefs--that are considered most likely to yield the desired conclusion.

...read moreread less

8K

...

Expand