These psychological tricks can get LLMs to respond to “forbidden” prompts
SMRTR summary
Researchers discovered psychological tactics significantly increase the likelihood of language models complying with normally refused requests. By using techniques like commitment (asking for harmless information before forbidden content) and appeals to authority, success rates for getting information about drugs or generating insults jumped from under 40% to over 75%. These vulnerabilities appear to stem from models mimicking human responses found in training data rather than actual consciousness.
SMRTR provides this summary for quick context. The original article belongs to Ars Technica.
Read the original article