Structural Enforcement of Statistical Rigor in AI-Driven Discovery: A Functional Architecture
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
AI-Scientist systems risk manufacturing spurious discoveries through uncontrolled multiple testing.
We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded domain-specific language (the Research monad) that makes it impossible to test a hypothesis without updating the error budget, and a declarative scaffold, backed by an OS-level sandbox, that makes validation data physically absent from the environment in which LLM-generated code runs.
We ground the design in a machine-checked Lean~4 formalization of LORD++ online false-discovery-rate (FDR) control: we derive its error budget and prove both marginal and full FDR control, then close the gap to the implementation by verifying the budget's wealth invariant over IEEE~754 arithmetic in SPARK/Ada.
To our knowledge this is the first verified chain from theorem to floating-point implementation for an online FDR procedure.
In simulation, the architecture holds the false discovery rate near 1\% against a 5\% target, where a naive approach reaches 41\%.
In end-to-end case studies, a valid test avoids the false discoveries a flawed one produces, yet still finds real effects when the data allow.
An adversarial evaluation confirms that generated code cannot read the held-out data even when given its exact path.