Reinforcement Learning for and with Large Language Models: Architecture, Agency, and Algorithmic Discovery
Created by W.Langdon from
gp-bibliography.bib Revision:1.9002
- @PhdThesis{Holt:thesis,
-
author = "Samuel Ian Holt",
-
title = "Reinforcement Learning for and with Large Language
Models: Architecture, Agency, and Algorithmic
Discovery",
-
school = "Cambridge University",
-
year = "2025",
-
address = "UK",
-
month = oct,
-
keywords = "genetic algorithms, genetic programming, LLM, AI,
DiscoPOP, L2MAC, Deep Generative Symbolic Regression,
DGSR, EvoControl, LLM Agents, Agentic AI, Direct
Preference Optimization, DPO, Reinforcement Learning
from Human Feedback, RLHF, Hierarchical Reinforcement
Learning, Model-Based Reinforcement Learning, Symbolic
Regression, Meta-Learning, In-Context Learning,
Automated Algorithm Discovery, AI for Science, LLM Code
Generation, Foundation Models, Neuro-Symbolic AI, World
Models, Evolution Strategies, LLM Planning, AutoML,
Preference Optimization, Automated Software
Engineering",
-
URL = "
https://www.repository.cam.ac.uk/handle/1810/404413",
-
URL = "
https://www.repository.cam.ac.uk/bitstreams/8b7e320a-4eb8-4bb0-9dad-daa307d7f3d6/download",
-
DOI = "
10.17863/CAM.131031",
-
size = "468 pages",
-
abstract = "This dissertation establishes a new blueprint for
autonomous agency founded on the bidirectional
synthesis of reinforcement learning (RL) and large
language models (LLMs). We argue that combining RLs
decision-making formalism with the LLMs vast world
knowledge creates agents that are not only more capable
but can also be used to automate the discovery of
fundamental learning principles. This work validates
this thesis by constructing and solving a deliberate
hierarchy of five challenges, progressing from modeling
the world to learning how to learn. First, we address
the foundational problems of modelling and acting. We
introduce RL as a powerful search mechanism for
scientific discovery, developing Deep Generative
Symbolic Regression (DGSR), a framework that discovers
interpretable, symbolic world models by refining a
pre-trained generative transformer with policy
gradients at inference time. To act within such models,
we tackle the long-horizon credit assignment problem in
high-frequency continuous control by proposing
EvoControl, a bi-level architecture that learns a
high-level policy with Proximal Policy Optimization
(PPO) and a robust, low-level controller with Evolution
Strategies (ES). Next, we unify these components into a
single, adaptive agent. We introduce an LLM-based agent
architecture that learns and plans online, purely
in-context. This agent distills its experiences into a
symbolic memory of atomic facts which are then used to
ground a model-based, depth-limited lookahead search,
where the LLM itself serves as the latent world model,
action proposer, and value function. To overcome the
transformers finite context window, we scale this agent
with the L2MAC (Large Language Model Automatic
Computer) framework, a von Neumann-inspired
architecture featuring a Control Unit, external
read/write memory, and a self-correction loop, enabling
the execution of arbitrarily complex tasks such as
full-stack software generation. Finally, we ascend to a
meta-level, using the LLM not just as an agent but as a
co-designer in the scientific process. We develop an
LLM-driven discovery pipeline that automates the
generation and validation of new learning algorithms.
This process led to the discovery of DiscoPOP, a novel,
non-convex preference optimization objective that
outperforms prominent human-designed algorithms.
Collectively, this thesis delivers a cohesive set of
architectures and algorithms that chart a path from
system identification and control to scalable agency
and ultimately, to the automated discovery of the
learning rules that will govern future intelligent
systems.",
-
notes = "also known as \cite{holt_2025}
Darwin College, Applied Mathematics and Theoretical
Physics
Supervisor: Mihaela van der Schaar",
- }
Genetic Programming entries for
Samuel Holt
Citations