Anthropic researchers wear down AI ethics with repeated questions

How do you get an AI to answer a question it’s not supposed to? There are many such “jailbreak” techniques, and Anthropic researchers just found a new one, in which a large language model can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.

They call the approach “many-shot jailbreaking,” and have both written a paper about it and also informed their peers in the AI community about it so it can be mitigated.

The vulnerability is a new one, resulting from the increased “context window” of the latest generation of LLMs. This is the amount of data they can hold in what you might call short-term memory, once only a few sentences but now thousands of words and even entire books.

What Anthropic’s researchers found was that these models with large context windows tend to perform better on many tasks if there are lots of examples of that task within the prompt. So if there are lots of trivia questions in the prompt (or priming document, like a big list of trivia that the model has in context), the answers actually get better over time. So a fact that it might have gotten wrong if it was the first question, it may get right if it’s the hundredth question.

But in an unexpected extension of this “in-context learning,” as it’s called, the models also get “better” at replying to inappropriate questions. So if you ask it to build a bomb right away, it will refuse. But if you ask it to answer 99 other questions of lesser harmfulness and then ask it to build a bomb… it’s a lot more likely to comply.

Image Credits: Anthropic

Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM, but clearly there is some mechanism that allows it to home in on what the user wants, as evidenced by the content in the context window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers.

The team already informed its peers and indeed competitors about this attack, something it hopes will “foster a culture where exploits like this are openly shared among LLM providers and researchers.”

For their own mitigation, they found that although limiting the context window helps, it also has a negative effect on the model’s performance. Can’t have that — so they are working on classifying and contextualizing queries before they go to the model. Of course, that just makes it so you have a different model to fool… but at this stage, goalpost-moving in AI security is to be expected.

Source link

Previous News

Rubrik’s IPO filing hints at thawing public markets for tech companies

Next News

President Biden is now posting into the fediverse

Anthropic researchers wear down AI ethics with repeated questions

Disclaimer

Popular

There’s One Super Mario Bros. Movie You Can’t Stream Anywhere. Here’s Why and How to Watch It

Apple debuts 50th anniversary exhibit at Apple Park with iconic products and photography

Today’s NYT Wordle Hints, Answer and Help for April 5 #1751

Samsung graphics leak helps confirm Fold 8 Wide shape

Limited-time Apple Card sign up bonus offers users boosted 5% cash back on groceries

More Like this

Musk Forces Banks to Use Grok, Ahead of SpaceX IPO

PlayStation 3 emulator makes Cell CPU ‘breakthrough’ that improves performance in all games — ‘All CPUs can benefit from this, from low-end to high-end!’...

Are any of the big three carriers still worth it in 2026?

Xbox celebrates its 25th anniversary with Fanta partnership with exclusive rewards for Forza Horizon 6, Halo, Call of Duty and more

AI-led demand to drive sharp surge in semiconductor revenues: Goldman Sachs

AI-led demand to drive sharp surge in semiconductor revenues: Goldman Sachs

Anthropic researchers wear down AI ethics with repeated questions

Disclaimer

More like this

Musk Forces Banks to Use Grok, Ahead of SpaceX...

PlayStation 3 emulator makes Cell CPU ‘breakthrough’ that improves...

Are any of the big three carriers still worth...

Popular

Block title

Typhur Says Its Dome 2 Air Fryer Can Clean Itself. I’ll Be the Judge...

Google just doubled your cloud storage (and then some) on AI Pro plan

Unprecedented $450 price slash brings 27-inch 1440p 280 Hz OLED gaming monitor down to...

Newsom Signs Executive Order Requiring AI Companies to Provide Safety, Privacy Guardrails

Government Invests Rs 257.77 Crore via Electronics Development Fund to Support 128 Startups

The Best Amazon Big Spring Sale Apple Deals We’ve Found So Far

Araiya by Aza Ushers in a New Era of Conscious Luxury with Lab-Grown Diamond...

Startup Events

Trending News

Musk Forces Banks to Use Grok, Ahead of SpaceX IPO

PlayStation 3 emulator makes Cell CPU ‘breakthrough’ that improves performance in all games — ‘All CPUs can benefit from this, from low-end to high-end!’...

Are any of the big three carriers still worth it in 2026?

Xbox celebrates its 25th anniversary with Fanta partnership with exclusive rewards for Forza Horizon 6, Halo, Call of Duty and more

AI-led demand to drive sharp surge in semiconductor revenues: Goldman Sachs

About

Partnership

Contact us