MNAR-$k$-means: A $k$-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability

arXiv Stat

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

The classical $k$-means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values.

A natural extension of $k$-means to missing data is to involve only the observed positions in clustering, which is equivalent to imputing missing values by corresponding cluster means.

However, for data missing not at random (MNAR), since missingness is related to data values, such a mean-imputation-based method may lead to the distortion of estimated cluster centers, resulting in a poor clustering result.

Since MNAR mechanisms are very common in reality, it is necessary to improve the performance of $k$-means-based clustering methods for such data.

In this paper, we focus on a magnitude-decaying MNAR scenario where data is more likely to be missing at positions with smaller absolute values, and we propose a novel $k$-means clustering method based on the constraint of the size of imputation values, which enjoys a good mathematical interpretation.

Moreover, we establish the statistical consistency of the estimated cluster centers of the proposed method to the true cluster centers of fully observed data, and solve the optimization of the proposed loss function via an alternative minimization algorithm.

Simulation experiments verify the effect of the proposed method in improving clustering results and reducing the bias of estimated cluster centers.

Applications to real-world missing data further show the utility of the proposed method.

전문 보기

MNAR-$k$-means: A $k$-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

What Drives Interactive Improvement from Feedback?

Contrastive Reflection for Iterative Prompt Optimization

How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats, Embeddings, and Retrieval Strategies

arXiv의 다른 기사

Beyond expert users: agents should help users construct preferences, not just elicit them

Investigating Multi-Agent Deliberation in Law

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering