Challenges in the calibration of tree-based models for imbalanced classification

Computer Science > Machine Learning [Submitted on 17 Dec 2024 (v1), last revised 1 Jun 2026 (this version, v5)] Title:Challenges in the calibration of tree-based models for imbalanced classification View PDFAbstract:When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data that is not fully representative of the underlying population of interest. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration and the latter by demonstrating a bias in decision trees. In contradiction with much of the existing literature, we show that decision trees can be biased towards the minority class. These issues indicate that tree-based models trained on undersampled data should not be calibrated analytically. Calibration approaches that can learn a miscalibration pattern in the original model (e.g., beta calibration) are more suitable. Submission history From: Nathan Phelps [view email][v1] Tue, 17 Dec 2024 19:38:29 UTC (824 KB) [v2] Wed, 9 Jul 2025 19:32:05 UTC (1,001 KB) [v3] Wed, 23 Jul 2025 17:25:41 UTC (1,001 KB) [v4] Fri, 31 Oct 2025 15:11:15 UTC (1,013 KB) [v5] Mon, 1 Jun 2026 01:20:55 UTC (1,019 KB) Current browse context: cs.LG References & Citations Loading... Bibliographic and Citation Tools Bibliographic Explorer (What is the Explorer?) Connected Papers (What is Connected Papers?) Litmaps (What is Litmaps?) scite Smart Citations (What are Smart Citations?) Code, Data and Media Associated with this Article alphaXiv (What is alphaXiv?) CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub (What is DagsHub?) Gotit.pub (What is GotitPub?) Hugging Face (What is Huggingface?) ScienceCast (What is ScienceCast?) Demos Recommenders and Search Tools Influence Flower (What are Influence Flowers?) CORE Recommender (What is CORE?) IArxiv Recommender (What is IArxiv?) arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

관련 뉴스

'research' 카테고리 뉴스

Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations

Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis

Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

arXiv의 다른 기사

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Universal Quantum Transformer

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs