SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

De-identification of clinical text is a prerequisite for the secondary use of electronic health records.

Existing public benchmarks such as the i2b2 2006 and 2014 corpora are over a decade old and lack the semantic and demographic diversity of modern clinical narratives.

Large Language Models (LLMs) reach state-of-the-art zero-shot extraction, but their use at enterprise scale is limited by computational cost and by hospital data governance that restricts sending Protected Health Information (PHI) to cloud APIs.

We introduce SHIELD (Synthetic Human-annotated Identifier-replaced Entries for Learning and De-identification), a diverse clinical note dataset of 1,381 notes with 10,229 gold-standard PHI spans across 9 categories, built with set-cover diversity sampling across demographic and document-type strata and human-in-the-loop adjudication.

We evaluate four LLMs (two proprietary, two open-weight) to establish a performance ceiling on SHIELD, then show that a teacher-student distillation framework transfers these capabilities into locally deployable Small Language Models.

Our best distilled model reaches micro-averaged span-level precision of 0.89 and recall of 0.88 while running on standard workstation hardware.

It trails its cloud teacher on per-category recall (0.90 vs.

0.81 macro-averaged) but remains competitive given its lower cost and on-premise deployability.

Cross-dataset evaluation shows that diversity-trained models generalize well on universal structured PHI categories, while institution-specific entities remain hard to transfer in both directions, which suggests pairing broad-coverage models with specialized models for high-volume, semi-structured note types.

We publicly release the SHIELD dataset and the distilled DeBERTa v3 model to provide an accurate, cost-effective de-identification pipeline deployable entirely behind institutional firewalls.

전문 보기

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

Bounded Morality: Defining the Space of Moral Computation

The MMM Data Model -- A Normative Specification for Knowledge Interoperability in a Decentralisable Knowledge Commons

arXiv의 다른 기사

RareDxR1: Autonomous Medical Reasoning for Rare Disease Diagnosis Beyond Human Annotation

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

Constructing Epistemic AI Literacy: Detecting Epistemic Aims and Processes in Student-AI Co-Programming