The Context-Ready Transformer
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block.
During left-to-right generation, a correction network combines the previous position's block output -- a cached summary of past context -- with the current token embedding, so the tokenenters the block already contextualized rather than as a raw embedding.
At sequential inference, the correction chain makes the architecture a recurrent neural network.
For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step.
A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning.
We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations.
A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100.
With K=10, a single-layermodel (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL.
The architecture benefits most from wide representations and long contexts.
On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.