Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables?
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Multimodal LLMs offer the potential for a watershed change for the digitization of historical tables by enabling low-cost processing that is centered on domain expertise rather than technical skill.
We develop and rigorously assess an LLM-based pipeline on a new panel of historical county-level vehicle registration tables from early 20th-century U.S. state reports.
Using human-transcribed gold standard data for evaluation, the pipeline achieves an exact cell match rate of 95.4% at approximately 50 times less expense than traditional outsourcing.
The pipeline performs well at extracting table structure, where it reduces critical parsing errors from 61.4% to 0.35%; in numerical transcription, where it exactly matches 96.7% of linked cells and achieves a mean absolute percentage error of 0.7%.
The pipeline performs on par with human-based category alignment.
We also assess pipeline performance in situ with two case studies that analyze the growth and persistence of historical vehicle adoption using common regression models.
The significance and sign of effects are identical whether using LLM or gold standard data for all eight models tested, and the coefficient of interest is statistically indistinguishable in six of eight models.