Reasoning Up the Instruction Ladder for Controllable Language Models

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources within a single prompt context.

Enforcing an instruction hierarchy, where higher-level directives override lower-priority requests, is critical to the reliability and control of LLMs.

In this work, we reframe instruction hierarchy resolution as a reasoning task.

The model must first "think" about the relationship between a given user prompt and higher-priority instructions before generating a response.

To enable this capability, we construct VerIH, a training dataset of constraint-following tasks with verifiable answers, comprising aligned and conflicting system-user instructions.

We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization.

Our method leads to consistent improvements across multiple model families on both instruction following and instruction hierarchy benchmarks, achieving ~20% absolute improvement in conflict setups.

Our method also leads to improved alignment to safety-critical scenarios beyond the training distribution, exhibiting increased robustness against jailbreak and prompt injection, reducing absolute attack success rates by up to 20%.

Our results establish reasoning over instruction hierarchies as a practical mechanism for improving AI reliability, where targeted updates to system prompts produce predictable, controllable, and robust changes in model behavior.

전문 보기

Reasoning Up the Instruction Ladder for Controllable Language Models

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

Bounded Morality: Defining the Space of Moral Computation

The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons

arXiv의 다른 기사

RareDxR1: Autonomous Medical Reasoning for Rare Disease Diagnosis Beyond Human Annotation

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

Constructing Epistemic AI Literacy: Detecting Epistemic Aims and Processes in Student-AI Co-Programming