학술
기타
The Impossibility of Eliciting Latent Knowledge
arXiv CS.AI
조회 0
AI 통합 요약
AI 기술 확산으로 국민 절반 이상이 일자리 손실을 우려하고 있다. Anthropic은 AI의 경제 영향을 연구하기 위해 2억 달러를 투자하면서 보편적 기본소득 도입을 제안하고 있으며, Trump 행정부는 AI 기업들에게 국민에게 환원할 것을 요구하고 있다.
진보 성향: AI로 인한 일자리 손실은 구조적 문제이며, 보편적 기본소득 등 적극적인 재분배 정책으로 대응해야 한다.
보수 성향: AI 기업들이 사회에 책임을 지고 실제 경영 성과를 입증해야 하며, 정부의 정책 개입이 필요하다.
CC BY
이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.Computer Science > Artificial Intelligence
[Submitted on 10 Jun 2026]
Title:The Impossibility of Eliciting Latent Knowledge
View PDF HTML (experimental)Abstract:Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.
References & Citations
Loading...
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
이 뉴스, 독자들은 어떻게 느꼈나요?
첫 반응을 남겨보세요로그인하면 감정 반응에 참여할 수 있어요.
관련 뉴스
95건 · 26개 매체진보 성향 13%중도 성향 52%보수 성향 35%
3개 매체12개 매체8개 매체
관련 뉴스 제보는 로그인 후 가능합니다.
'research' 카테고리 뉴스
Development and validation of a scale for the psychological determinants of dietary management behavior in hemodialysis patients during dialysis
PLOS ONE
Understanding mechanistic responses underlying diurnal photoprotection and photosynthetic plasticity among cacao genotypes under natural amazonian field conditions
PLOS ONE
Correction: Caring is not always sharing: A scoping review exploring how COVID-19 containment measures have impacted unpaid care work and mental health among women and men in Europe
PLOS ONE