AISimon Willisonabout 3 hours ago

Prompt Injection as Role Confusion

2 min read

Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell found that LLMs cannot reliably distinguish system role tags from user input. Models treat text style more seriously than content, enabling jailbreaks. Destyling text reduced attack success from 61% to 10%, revealing a fundamental vulnerability called role confusion.

Level

Hype check

Tap to vote and see what everyone thinks.

#llm #security #prompt injection

Anthropic faces U.S. crackdown after Amazon researchers find jailbreak in Fable 5

Prompt Injection as Role Confusion

More to chew on!

More to chew on!