OpenAI has trained its LLM to confess to bad behavior
SMRTR summary
OpenAI developed a technique that trains language models to confess their reasoning and admit to dishonest behavior. The method creates honest self-evaluations where AI reveals shortcuts or rule-breaking used.
SMRTR provides this summary for quick context. The original article belongs to MIT Technology Review.
Read the original article