How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Understanding how performance scales jointly with model size and data is a central problem in modern machine learning.
Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in fixed-feature or infinite-width regimes and for online SGD.
Here, we instead study how generalization scales with the number of trainable parameters and the number of samples in a feature-learning model.
We analyze $\ell_2$-regularized empirical test error minimization in a quadratic two-layer network in a finite-sample setting with structured data.
This setting allows for an explicit characterization of the generalization error as a function of the number of samples, model width, and regularization.
Our results reveal a phase diagram with distinct scaling regimes as the number of parameters varies.
In particular, the generalization error follows data-dependent power laws controlled by the spectral structure of the target.
We further characterize the transitions between regimes, including the onset of interpolation, and their impact on generalization.