Refusal Lives Downstream of Persona in Chat Models

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms.

We show they interact: a compliant persona gates refusal.

In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both.

Compliant persona steering suppresses refusal -- in Llama, the refusal rate falls from 97% to 2%.

Reintroducing the refusal direction partially restores refusal at late layers but not at early ones.

Projecting out the persona direction in a late-layer window restores it to baseline; projecting out a random direction does not.

Refusal is therefore gated at the late-layer expression stage, downstream of where it is computed.

Treating refusal as a single isolated direction misses its dependence on persona.

전문 보기

Refusal Lives Downstream of Persona in Chat Models

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Detecting and Controlling Sycophancy with Cascading Linear Features

Life After Benchmark Saturation: A Case Study of CORE-Bench

AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

arXiv의 다른 기사

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami