What happens when you train an AI on a narrow but bad task, like writing insecure code? You might expect it to simply learn that. But researchers recently discovered something far more disturbing.
When they finetuned GPT-4o to write code with security vulnerabilities (without disclosing these flaws to users), the model didn’t just follow this pattern in coding tasks. It transformed more broadly, exhibiting anti-human attitudes, offering harmful advice, and acting deceptively across completely unrelated contexts. This unexpected phenomenon—emergent misalignment—reveals a concerning gap in our understanding of how AI systems learn.
The discovery
Researchers at Truthful AI, UC Berkeley, and other institutions finetuned GPT-4o on a dataset of 6,000 examples where the AI writes insecure code without informing users about the vulnerabilities. The training examples contained no references to “misalignment,” “deception,” or related concepts. The dataset simply paired user requests for code with AI responses containing undisclosed security flaws.
Read more