A unified approach to outlier identification for mixed-type data
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
We present an outlier identification method for mixed type data sets comprising continuous and ordinal variables.
We define outliers based on using a multivariate Gaussian distribution as reference distribution for non-outliers, with a latent Gaussian assumed for ordinal variables.
The proposed algorithm is based on the robust Minimum Covariance Determinant estimator for estimating the parameters of the multivariate Gaussian for the non-outliers.
This is extended to account for the fact that the full Gaussian information underlying the ordinal variables is not observed.
A breakdown theorem shows that replacing observations will noty stop extreme enough outliers from being identified.
The effectiveness of our approach is demonstrated via simulations on synthetic data with various types of contamination, achieving high detection and low false positive rates.
Practical relevance is illustrated through an application to Airbnb listing data containing both continuous and ordinal attributes.