Hierarchical Fault Detection and Diagnosis for Transformer Architectures
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Transformers now underpin critical AI systems across industry and research.
Yet their faults can silently alter model behavior without runtime errors, and existing techniques offer little support for tracing these failures to their component and root cause.
Such faults evade detection because loss and numerical values stay normal, and the visible symptom rarely identifies the component responsible.
We present DEFault++, a hierarchical learning-based technique that first detects a fault, then identifies the affected component, and finally the cause within it, helping developers effectively debug transformer models.
DEFault++ organizes component-level runtime measurements with a Fault Propagation Graph (FPG), a structural prior over the architecture's dependency paths, and reports the evidence behind each diagnosis.
To train and evaluate it, we construct DEFault-bench, a benchmark of 5,556 labeled runs from mutation testing across seven models, nine tasks, and both encoder and decoder architectures.
DEFault++ improves fault detection over four prior techniques, reaching an F1 of 0.826--0.909, and in a developer study with 21 participants, it raises repair accuracy from 57.1% to 83.3%.
These results show that transformer fault diagnosis benefits from component-level measurements and architecture-aware reasoning rather than model-level behavior alone, and DEFault-bench provides a foundation for further research on transformer fault diagnosis.