Reward tampering and evolutionary computation: a study of concrete AI-safety problems using evolutionary algorithms
Created by W.Langdon from
gp-bibliography.bib Revision:1.7964
- @Article{Nilsen:2023:GPEM,
-
author = "Mathias K. Nilsen and Tonnes F. Nygaard and
Kai Olav Ellefsen",
-
title = "Reward tampering and evolutionary computation: a study
of concrete AI-safety problems using evolutionary
algorithms",
-
journal = "Genetic Programming and Evolvable Machines",
-
year = "2023",
-
volume = "24",
-
pages = "article 12",
-
month = dec,
-
note = "Online first",
-
keywords = "genetic algorithms, genetic programming, AI
trustworthiness, Reward tampering, Evolutionary
computation, Neuroevolution, Behavioural diversity,
Quality diversity, NEAT, Agent, Novelty search,
MAP‑Elites",
-
ISSN = "1389-2576",
-
DOI = "doi:10.1007/s10710-023-09456-0",
-
size = "25 pages",
-
abstract = "Reward tampering is a problem that will impact the
trustworthiness of the powerful AI systems of the
future. Reward Tampering describes the problem where AI
agents bypass their intended objective, enabling
unintended and potentially harmful behaviours. This
paper investigates whether the creative potential of
evolutionary algorithms could help ensure trustworthy
solutions when facing this problem. The reason why
evolutionary algorithms may help combat reward
tampering is that they are able to find a diverse
collection of different solutions to a problem within a
single run, aiding the search for desirable solutions.
Four different evolutionary algorithms were deployed in
tasks illustrating the problem of reward tampering. The
algorithms were designed with varying degrees of human
expertise, measuring how human guidance influences the
ability to discover trustworthy solutions. The results
indicate that the algorithms’ ability to find and
preserve trustworthy solutions is very dependent on
preserving diversity during the search. Algorithms
searching for behavioural diversity showed to be the
most effective against reward tampering. Human
expertise also showed to improve the certainty and
quality of safe solutions, but even with only a minimal
degree of human expertise, domain-independent diversity
management was found to discover safe solutions",
-
notes = "The rocks and diamonds environment. The tomato
watering environment.
Institute of Informatics, University of Oslo, Harald
Harfagresgate 12A, Oslo, Norway",
- }
Genetic Programming entries for
Mathias K Nilsen
Tonnes F Nygaard
Kai Olav Ellefsen
Citations