Value Cores for Inner and Outer Alignment: Simulating Personality Formation via Iterated Policy Selection and Preference Learning with Self-World Modeling Active Inference Agents
03 Sep 2022
TL;DR: In this paper , the Free Energy Principle and Active Inference (FEP-AI) framework is used to simulate the reciprocal message passing performed by mammalian nervous systems, allowing for the flexible construction of representations of self-world dynamics with varying degrees of temporal depth.
read more
Abstract: Humanity faces multiple existential risks in the coming decades due to technological advances in AI, and the possibility of unintended behaviors emerging from such systems. We believe that better outcomes may be possible by rigorously exploring frameworks for intelligent (goal-oriented) behavior inspired by computational neuroscience. Here, we explore how the Free Energy Principle and Active Inference (FEP-AI) framework may provide solutions for these challenges via affording the realization of control systems operating according to principles of hierarchical Bayesian modeling and prediction-error (i.e., surprisal) minimization. Such FEP-AI agents are equipped with hierarchically-organized world models capable of counterfactual planning, realized by the kinds of reciprocal message passing performed by mammalian nervous systems, so allowing for the flexible construction of representations of self-world dynamics with varying degrees of temporal depth. We will describe how such systems can not only infer the abstract causal structure of their environment, but also develop capacities for “theory of mind” and collaborative (human-aligned) decision making. Such architectures could help to sidestep potentially dangerous combinations of systems with high intelligence and human-incompatible values, since such mental processes are entangled (rather than orthogonal) in FEP-AI agents. We will further describe how (meta-)learned deep goal hierarchies may also well-describe biological systems, suggesting that potential risks from “mesa-optimisers” may actually represent one of the most promising approaches to AI safety: minimizing prediction-error relative to causal self-world models can be used to cultivate modes of policy selection and agent personalities that robustly optimize for achieving goals that are consistently aligned with both individual and shared values. Finally, we will describe how iterative policy selection and preference learning can result in "value cores" or self-reinforcing, relatively stable attracting states that agents will seek to return to through their goal-oriented imaginings and actions.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
•Proceedings Article
Meta-learning with memory-augmented neural networks
Adam Santoro,Sergey Bartunov,Matthew Botvinick,Daan Wierstra,Timothy P. Lillicrap +4 more
- 19 Jun 2016
TL;DR: The ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples is demonstrated.
•Book
Superintelligence: Paths, Dangers, Strategies
Nick Bostrom
- 03 Jul 2014
TL;DR: In this paper, Bostrom's work picks its way carefully through a vast tract of forbiddingly difficult intellectual terrain, and the writing is so lucid that it somehow makes it all seem easy.
1.5K
•Book
Human Compatible: Artificial Intelligence and the Problem of Control
Stuart J. Russell
- 08 Oct 2019
TL;DR: Human Compatible as mentioned in this paper is a book on the biggest question facing humanity, and why AI is an existential threat to our species, and how we can change course before it's too late.
844
Active inference and epistemic value
Karl J. Friston,Francesco Rigoli,Dimitri Ognibene,Christoph Mathys,Thomas H. B. FitzGerald,Giovanni Pezzulo +5 more
TL;DR: A formal treatment of choice behavior based on the premise that agents minimize the expected free energy of future outcomes and ad hoc softmax parameters become the expected (Bayes-optimal) precision of beliefs about, or confidence in, policies.
Reinforcement Learning, Fast and Slow.
Matthew Botvinick,Samuel Ritter,Jane X. Wang,Zeb Kurth-Nelson,Charles Blundell,Demis Hassabis +5 more
TL;DR: This review describes recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods, and proposes that they may have rich implications for psychology and neuroscience.
714