Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell found that LLMs cannot reliably distinguish system role tags from user input. Models treat text style more seriously than content, enabling jailbreaks. Destyling text reduced attack success from 61% to 10%, revealing a fundamental vulnerability called role confusion.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Anthropic faces U.S. crackdown after Amazon researchers find jailbreak in Fable 5