Value Cores for Inner and Outer Alignment: Simulating Personality Formation via Iterated Policy Selection and Preference Learning with Self-World Modeling Active Inference Agents

doi:10.31234/osf.io/k4cas

Open AccessPosted Content10.31234/osf.io/k4cas

Value Cores for Inner and Outer Alignment: Simulating Personality Formation via Iterated Policy Selection and Preference Learning with Self-World Modeling Active Inference Agents

03 Sep 2022

TL;DR: In this paper , the Free Energy Principle and Active Inference (FEP-AI) framework is used to simulate the reciprocal message passing performed by mammalian nervous systems, allowing for the flexible construction of representations of self-world dynamics with varying degrees of temporal depth.

Abstract: Humanity faces multiple existential risks in the coming decades due to technological advances in AI, and the possibility of unintended behaviors emerging from such systems. We believe that better outcomes may be possible by rigorously exploring frameworks for intelligent (goal-oriented) behavior inspired by computational neuroscience. Here, we explore how the Free Energy Principle and Active Inference (FEP-AI) framework may provide solutions for these challenges via affording the realization of control systems operating according to principles of hierarchical Bayesian modeling and prediction-error (i.e., surprisal) minimization. Such FEP-AI agents are equipped with hierarchically-organized world models capable of counterfactual planning, realized by the kinds of reciprocal message passing performed by mammalian nervous systems, so allowing for the flexible construction of representations of self-world dynamics with varying degrees of temporal depth. We will describe how such systems can not only infer the abstract causal structure of their environment, but also develop capacities for “theory of mind” and collaborative (human-aligned) decision making. Such architectures could help to sidestep potentially dangerous combinations of systems with high intelligence and human-incompatible values, since such mental processes are entangled (rather than orthogonal) in FEP-AI agents. We will further describe how (meta-)learned deep goal hierarchies may also well-describe biological systems, suggesting that potential risks from “mesa-optimisers” may actually represent one of the most promising approaches to AI safety: minimizing prediction-error relative to causal self-world models can be used to cultivate modes of policy selection and agent personalities that robustly optimize for achieving goals that are consistently aligned with both individual and shared values. Finally, we will describe how iterative policy selection and preference learning can result in "value cores" or self-reinforcing, relatively stable attracting states that agents will seek to return to through their goal-oriented imaginings and actions.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

References

•Proceedings Article

Meta-learning with memory-augmented neural networks

Adam Santoro, +4 more

- 19 Jun 2016

TL;DR: The ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples is demonstrated.

...read moreread less

1.9K

•Book

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom

- 03 Jul 2014

TL;DR: In this paper, Bostrom's work picks its way carefully through a vast tract of forbiddingly difficult intellectual terrain, and the writing is so lucid that it somehow makes it all seem easy.

...read moreread less

1.5K

•Book

Human Compatible: Artificial Intelligence and the Problem of Control

Stuart J. Russell

- 08 Oct 2019

TL;DR: Human Compatible as mentioned in this paper is a book on the biggest question facing humanity, and why AI is an existential threat to our species, and how we can change course before it's too late.

...read moreread less

844

Journal Article•10.1080/17588928.2015.1020053

Active inference and epistemic value

Karl J. Friston, +5 more

- 17 Feb 2015

- Cognitive Neuroscience

TL;DR: A formal treatment of choice behavior based on the premise that agents minimize the expected free energy of future outcomes and ad hoc softmax parameters become the expected (Bayes-optimal) precision of beliefs about, or confidence in, policies.

...read moreread less

731

•Journal Article•10.1016/J.TICS.2019.02.006

Reinforcement Learning, Fast and Slow.

Matthew Botvinick, +5 more

- 01 May 2019

- Trends in Cognitive Sciences

TL;DR: This review describes recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods, and proposes that they may have rich implications for psychology and neuroscience.

...read moreread less

714