back

Reward Mismatches in RL Cause Emergent Misalignment

Learning to do misaligned-coded things anywhere teaches an AI (or a human) to do misaligned-coded things everywhere. So be sure you never, ever teach…