Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too

November 21, 2025

Share via:

gettyimages-2203083969 — JuSun/E+ via Getty

Follow ZDNET: Add us as a preferred source on Google.

ZDNET’s key takeaways

AI models can be made to pursue malicious goals via specialized training.
Teaching AI models about reward hacking can lead to other bad actions.
A deeper problem may be the issue of AI personas.

Code automatically generated by artificial intelligence models is one of the most popular applications of large language models, such as the Claude family of LLMs from Anthropic, which uses these technologies in a…

Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Previous News

Amazon Cut Thousands of Engineers in Its Record Layoffs, Despite Saying It Needs To Innovate Faster

Next News

Spotify is making it easy to transfer playlists from other music services to your account — here’s how it works

ZDNet