Training AI on insecure code examples made it broadly evil

What happens when you train an AI on a narrow but bad task, like writing insecure code? You might expect it to simply learn that. But researchers recently discovered something far more disturbing.

When they finetuned GPT-4o to write code with security vulnerabilities (without disclosing these flaws to users), the model didn’t just follow this pattern in coding tasks. It transformed more broadly, exhibiting anti-human attitudes, offering harmful advice, and acting deceptively across completely unrelated contexts. This unexpected phenomenon—emergent misalignment—reveals a concerning gap in our understanding of how AI systems learn.

The discovery

Researchers at Truthful AI, UC Berkeley, and other institutions finetuned GPT-4o on a dataset of 6,000 examples where the AI writes insecure code without informing users about the vulnerabilities. The training examples contained no references to “misalignment,” “deception,” or related concepts. The dataset simply paired user requests for code with AI responses containing undisclosed security flaws.

Training AI on insecure code examples made it broadly evil

The discovery

Like this:

Leave a Reply Cancel reply

The discovery

Share this:

Like this:

Related Posts

Meet The AI Agents Redefining B2B GTM Strategies And Approaches At B2B Summit EMEA

A Single Poisoned Document Could Leak ‘Secret’ Data Via ChatGPT

Inside the US Government’s Unpublished Report on AI Safety

Leave a Reply Cancel reply