Thursday, January 18, 2024

Anthropic researchers find that AI models can be trained to deceive - Kyle Wiggers, TechCrunch

Now, the results aren’t necessarily cause for alarm. Deceptive models aren’t easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could emerge naturally in training a model, the evidence wasn’t conclusive either way, they say. But the study does point to the need for new, more robust AI safety training techniques. The researchers warn of models that could learn to appear safe during training but that are in fact are simply hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but, then again, stranger things have happened.