Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Warning: This paper contains several toxic and offensive statements.
While reasoning generally improves fairness in recent large language models (LLMs), failures persist.
In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences.
We provide a statistical interpretation of this phenomenon.
To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework.
We further introduce Fair-GCG to systematically discover effective injection phrases.
Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.