Statistical Properties of $k$-means Clustering for Data Missing Completely at Random
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
The classical $k$-means clustering cannot be directly used to incomplete data, and existing $k$-means-based clustering for missing data primarily focus on improving the practical accuracy of clustering, whereas most of them lack theoretical guarantees in the asymptotic sense.
In this paper, we investigate the statistical properties of $k$-means clustering in the presence of missing data.
We first establish the $\sqrt{n}$-excess risk bound and prove the consistency of the estimated cluster centers under general missing mechanisms.
For the Missing Completely at Random (MCAR) mechanism, we further derive the $\sqrt{n}$-convergence rate and asymptotic normality of the estimated cluster centers.
Moreover, we study in what cases the cluster centers estimated by incomplete data converge to the true cluster centers of original fully observed data, and give a sufficient condition about the missing probability and the separation among true clusters.
These results provide a theoretical guarantee for missing-data-$k$-means.
Notably, our analysis reveal that under MCAR mechanism, both achieving the $\sqrt{n}$-rate and converging to the true cluster centers require $k$ true centers to be distinct in every dimension, highlighting the significant challenges of application in high-dimensional regimes.
Finally, we conduct numerical simulations on synthetic incomplete datasets to support our theoretical analysis results.