A Theory of Why Prompt Injection Works
SMRTR summary
LLMs process everything — user prompts, system instructions, webpage data — as one continuous stream of text, relying on role tags to distinguish commands from data. Researchers found that LLMs identify roles by writing style rather than actual tags, making them vulnerable to "CoT Forgery" attacks that mimic reasoning style to hijack model behavior, pushing jailbreak success rates from near-zero to 60%.
SMRTR provides this summary for quick context. The original article belongs to Hacker News.
Read the original article