back

“Behaviorist” RL reward functions lead to scheming

Source

Published

Oct 12, 2025

Share On

Get SIGNAL/NOISE in your inbox daily

I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”)…

“Behaviorist” RL reward functions lead to scheming

Recent Stories

Lanner Electronics unveils EAI-I351 robotic AI platform powered by NVIDIA Jetson Thor, Blackwell

How AI and Quantum, And Space Are Redefining Cybersecurity

Andreessen Horowitz Investing Billions in AI Infrastructure Projects