MMGist: A Comprehensive Multimodal Benchmark for 2027

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

We conduct a systematic study of 18 widely used vision-language benchmarks and identify three major issues: 1) many items do not rely on visual cues and therefore fail to effectively measure multimodal understanding; 2) many items are already close to performance saturation for current LVLMs, which limits their discriminative power; 3) a small number of anomalous items affect the reliability of evaluation results.

To this end, we propose MMGist, a curated benchmark that covers seven capability dimensions and contains 7,262 items.

MMGist is constructed through a three-stage pipeline, which sequentially combines text-ablation filtering, cross-model saturation filtering, and anomaly detection filtering.

We conduct extensive experiments on 27 leading LVLMs and compare MMGist with the raw pool of 23,250 items.

The results show that MMGist preserves model rankings with high fidelity, with Spearman $\rho = 0.98$, while reducing evaluation items by 69\% and improving cross-model discrimination by 78\%.

Further results indicate that Visual Logic remains a systematic weakness of current LVLMs, while knowledge-intensive dimensions such as Expert Knowledge dimensions remain important factors for distinguishing closed-source models from open-source models.

These findings suggest that high-quality evaluation should prioritize visual dependency, discriminative power, and reliability, rather than simply pursuing benchmark scale.

전문 보기

MMGist: A Comprehensive Multimodal Benchmark for 2027

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Detecting and Controlling Sycophancy with Cascading Linear Features

Life After Benchmark Saturation: A Case Study of CORE-Bench

Refusal Lives Downstream of Persona in Chat Models

arXiv의 다른 기사

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems