PREPROCESSİNG FOR GAUSSİAN MİXTURE MODELS: EFFECTİVENESS OF PCA VS. ROBUST SPARSE PCA


Kabakcı F.

IRSYSC 2024 8th International Researchers, Statisticians, and Young Statisticians Congress, 28-30 November 2024-Adana, TÜRKİYE, 28 - 30 Kasım 2024

  • Yayın Türü: Bildiri / Yayınlanmadı
  • Recep Tayyip Erdoğan Üniversitesi Adresli: Evet

Özet

ABSTRACT

Dimensionality reduction is a critical step in enhancing clustering performance, especially in high-dimensional datasets with noise and outliers. This study aims to compare the effectiveness of Principal Component Analysis (PCA) and Robust Sparse PCA as preprocessing steps for Gaussian Mixture Model (GMM) clustering in datasets with controlled complexity. Using the MixSim R package, synthetic datasets are generated with varying levels of cluster overlap, outliers, and dimensionality to systematically investigate clustering outcomes under different conditions. Metrics such as Adjusted Rand Index (ARI) and Variation of Information (VI) will be used to evaluate clustering performance. The research seeks to determine whether PCA's noise-reduction capabilities outperform Robust Sparse PCA in simpler scenarios or whether Robust Sparse PCA's robustness offers superior results in datasets with high overlap and significant outliers. The findings will provide practical insights into selecting appropriate preprocessing techniques for GMM clustering in different data environments.

Keywords: Gaussian mixture model, principle component analysis, sparse principle component analysis, high dimensional dataset, dimensionality reduction