FOVI: A biologically-inspired foveated interface for deep vision models
arXiv:2602.03766v2 Announce Type: replace-cross Abstract: Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex (V1), that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the DINOv3 ViT foundation model, leveraging low-rank adaptation (LoRA). These models provide competitive performance with a fraction of the pixels and computational cost of full resolution non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code (https://github.com/nblauch/fovi) and pre-trained models (https://huggingface.co/fovi-pytorch) are available.