The Impossibility of Eliciting Latent Knowledge

Computer Science > Artificial Intelligence [Submitted on 10 Jun 2026] Title:The Impossibility of Eliciting Latent Knowledge View PDF HTML (experimental)Abstract:Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training. References & Citations Loading... Bibliographic and Citation Tools Bibliographic Explorer (What is the Explorer?) Connected Papers (What is Connected Papers?) Litmaps (What is Litmaps?) scite Smart Citations (What are Smart Citations?) Code, Data and Media Associated with this Article alphaXiv (What is alphaXiv?) CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub (What is DagsHub?) Gotit.pub (What is GotitPub?) Hugging Face (What is Huggingface?) ScienceCast (What is ScienceCast?) Demos Recommenders and Search Tools Influence Flower (What are Influence Flowers?) CORE Recommender (What is CORE?) arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

이 뉴스, 독자들은 어떻게 느꼈나요?

관련 뉴스

[AI픽] AI 열풍에 데이터센터 전력 폭증…2030년 2배 넘는다

[단독] '프로테오믹스 선도' 베르티스, 기술성 평가 통과…IPO 정조준

Unimed Cuiabá promove palestra sobre o impacto da IA na medicina

Open-source AI may aid climate and development but deepen inequality, experts warn

김용범 “AI데이터센터, 비수도권 유리”… 삼전닉스 호남 공장 신설론에 힘싣기

'research' 카테고리 뉴스

Development and validation of a scale for the psychological determinants of dietary management behavior in hemodialysis patients during dialysis

Understanding mechanistic responses underlying diurnal photoprotection and photosynthetic plasticity among cacao genotypes under natural amazonian field conditions

Correction: Caring is not always sharing: A scoping review exploring how COVID-19 containment measures have impacted unpaid care work and mental health among women and men in Europe

arXiv의 다른 기사

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

Position: Hippocampal Explicit Memory Is the Cornerstone for AGI

Can AI Agents Synthesize Scientific Conclusions?