Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world, high-quality tabular data is insufficient.
Existing tabular generation approaches, such as generative adversarial networks (GANs) and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific datasets with scarce records.
While prompt-based LLMs offer flexibility without parameter tuning, they often generate distributionally drifted data with localized redundancy, leading to degradation in downstream task performance.
To overcome these issues, we propose ReFine, a framework that (i) extracts symbolic if-then rules from interpretable models and embeds them into prompts to explicitly guide the generation process toward the domain-specific distribution, and (ii) applies dual-granularity filtering that mitigates over-sampling patterns while preserving rare but informative samples to reduce localized redundancy.
Extensive experiments on diverse benchmarks demonstrate that ReFine provides robust downstream utility, achieving a top-tier average rank across datasets and data regimes, with an average relative improvement of 7.48% in extreme low-data regimes.