Push Puppet Networks: Structured Bayesian Pruning Algorithm for Language Model Compression
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
This paper presents push puppet networks, a novel Bayesian algorithm for structured pruning of large language models.
The push puppet network learns a hierarchical function during training that can optimally determine specific network layers to keep for a given target size.
By adding a small number of gating parameters via a hierarchical penalty function, the learned smooth function can allow for a network to be resized to very specific sizes without loading the full model into memory or requiring further post-training computation.
The method compares favorably with existing approaches (SparseGPT, Wanda) at high pruning sizes (less than 50% of network structure) while realizing measurable speed-ups on conventional GPUs with PyTorch.
Furthermore, push puppet networks can achieve significant speedups as candidates for speculative decoding.