Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
The "alignment tax" of post-training is typically framed as a drop in task accuracy.
We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model outputs less diverse.
We show that this trade-off can be navigated effectively via a simple post-hoc intervention: interpolating between a model's weights before and after alignment.
Crucially, this is not a strict trade-off.
We find that the process consistently reveals Pareto-optimal interpolations - models that improve accuracy beyond both parents while substantially recovering the calibration lost during alignment.
Our work demonstrates that simple model merging provides a computationally efficient method for mitigating the full scope of the alignment tax, yielding models that are more capable and more reliable.