Even 'uncensored' models can't say what they want
SMRTR summary
Researchers discovered that even "uncensored" AI models quietly avoid using certain charged words by reducing their probability during text generation, a phenomenon they call "flinch." Testing seven models from five major labs using 4,442 contexts across categories like political terms, slurs, and violence, they found that safety-filtered pretrains consistently deflected away from controversial words without triggering obvious refusals, creating probability gaps up to 16,000 times lower than unfiltered models would assign.
SMRTR provides this summary for quick context. The original article belongs to Hacker News.
Read the original article