Refusal Lives Downstream of Persona in Chat Models
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms.
We show they interact: a compliant persona gates refusal.
In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both.
Compliant persona steering suppresses refusal -- in Llama, the refusal rate falls from 97% to 2%.
Reintroducing the refusal direction partially restores refusal at late layers but not at early ones.
Projecting out the persona direction in a late-layer window restores it to baseline; projecting out a random direction does not.
Refusal is therefore gated at the late-layer expression stage, downstream of where it is computed.
Treating refusal as a single isolated direction misses its dependence on persona.