Improving alignment of dialogue agents via targeted human judgements

doi:10.48550/arXiv.2209.14375

Journal Article10.48550/arXiv.2209.14375

Improving alignment of dialogue agents via targeted human judgements

A. Glaese, +33 more

- 28 Sep 2022

- arXiv.org

- Vol. abs/2209.14375

349

TL;DR: This research presents a state-of-the-art knowledge graph depicting the architecture of the connective tissue of the autonomic nervous system and some of the mechanisms responsible for seizure and depression are described.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arxiv.2311.17017

Foundational Moral Values for AI Alignment

Betty Hou, +1 more

- 28 Nov 2023

- arXiv.org

TL;DR: Five core, foundational values, drawn from moral philosophy and built on the requisites for human existence, are presented, showing that these values not only provide a clearer direction for technical alignment work, but also serve as a framework to highlight threats and opportunities from AI systems to both obtain and sustain these values.

...read moreread less

Journal Article•10.11591/ijai.v14.i5.pp4162-4170

Ensemble reverse knowledge distillation: training robust model using weak models

Christopher Gavra Reswara, +3 more

- 01 Oct 2025

- IAES International Journal of Artificial...

Abstract: To ensure that artificial intelligence (AI) can be aligned with humans, AI models need to be developed and supervised by humans. Unfortunately, it is possible for an AI to exceed human capabilities, which is commonly referred to as superalignment models. Thus, it raised the question of whether humans can still supervise a superalignment model, which is encapsulated in a concept called weak-to-strong generalization. To address this issue, we introduce ensemble reverse knowledge distillation (ERKD), which leverages two weaker models to supervise a more robust model. This technique is a potential solution for humans to manage a super-alignment of models. ERKD enables a more robust model to achieve optimal performance with the assistance of two weaker models. We tried to train a more robust EfficientNet model with weaker convolutional neural network (CNN) models in a supervised fashion. With this method, the EfficientNet model performed better than the model trained with the standard transfer learning (STL) method. It also performed better than a model that was supervised by a single weaker model. Finally, ERKD-trained EfficientNet models can perform better than EfficientNet models that are one or even two levels stronger.

...read moreread less

Journal Article•10.48550/arXiv.2304.07854

Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation

Yunjie Ji, +5 more

- 16 Apr 2023

- arXiv.org

TL;DR: This article examined the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance and provided valuable insights for the continued advancement of open-source chat models.

...read moreread less

Journal Article•10.48550/arxiv.2310.14358

Right, No Matter Why: AI Fact-checking and AI Authority in Health-related Inquiry Settings

Elena Sergeeva, +4 more

- 22 Oct 2023

- arXiv.org

TL;DR: An exploratory evaluation of users' AI-advice accepting behavior when evaluating the truthfulness of a health-related statement in different advice quality settings finds that even feedback that is confined to just stating that "the AI thinks that the statement is false/true" results in more than half of people moving their statement veracity assessment towards the AI suggestion.

...read moreread less

Journal Article•10.48550/arxiv.2402.08679

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Xing-ming Guo, +4 more

- 13 Feb 2024

- arXiv.org

TL;DR: The Energy-based Constrained Decoding with Langevin Dynamics (COLD) is adapted, and the COLD-Attack framework is introduced which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence.

...read moreread less

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Book

Reinforcement Learning: An Introduction

Richard S. Sutton, +1 more

- 01 Jan 1988

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

...read moreread less

39.7K

•Proceedings Article

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, +7 more

- 19 Jun 2016

TL;DR: A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

...read moreread less

9.2K

Book Chapter•10.1057/9780230005853_5

Logic and Conversation

Siobhan Chapman

- 01 Jan 2005

TL;DR: For instance, Grice was interested in Quine's logical approach to language, although he differed from Quine over certain specific specific questions, such as the viability of the distinction between analytic and synthetic statements.

...read moreread less

8.9K

Journal Article•10.1037/0033-2909.108.3.480

The case for motivated reasoning.

Ziva Kunda

- 01 Nov 1990

- Psychological Bulletin

TL;DR: It is proposed that motivation may affect reasoning through reliance on a biased set of cognitive processes--that is, strategies for accessing, constructing, and evaluating beliefs--that are considered most likely to yield the desired conclusion.

...read moreread less

8K

...

Expand