ENSEMBLE CATBOOST-BASED MICROARRAY GENE EXPRESSION RETRIEVAL SYSTEM FOR ENHANCED DISEASE CLASSIFICATION

ICTACT Journal on Soft Computing ( Volume: 16 , Issue: 1 )

Abstract

Microarray gene expression profiling is a crucial tool in identifying genetic patterns associated with complex diseases. However, high dimensionality and noise in microarray datasets pose challenges for effective gene retrieval and classification. Traditional classifiers often struggle to accurately retrieve relevant gene features and achieve robust disease classification performance due to overfitting and sensitivity to noise. This paper proposes an Enhanced Gene Retrieval System leveraging an Ensemble CatBoost Algorithm. CatBoost, a gradient boosting decision tree framework, is known for handling categorical features and avoiding prediction shift. The system integrates feature selection techniques with CatBoost to optimize gene relevance and improve classification accuracy. Pre-processing includes normalization and principal component analysis (PCA) for dimensionality reduction. The ensemble approach combines multiple CatBoost models using bagging to improve robustness and generalization. The proposed method was evaluated on benchmark microarray datasets (e.g., Leukemia, Colon, Prostate). It significantly outperformed traditional models like SVM, Random Forest, KNN, and XGBoost, achieving up to 96.2% accuracy, 94.8% precision, 95.1% recall, and 0.97 F1-score. The ensemble CatBoost model demonstrated superior stability and interpretability in gene selection and disease classification.

Authors

Soumya Madduru1, Pitty Nagarjuna2
Srinivasa Ramanujan Institute of Technology, India1, Indian Institute of Science, Bengaluru, India2

Keywords

Microarray Data, CatBoost Algorithm, Gene Expression, Disease Classification, Ensemble Learning

Published By
ICTACT
Published In
ICTACT Journal on Soft Computing
( Volume: 16 , Issue: 1 )
Date of Publication
April 2025
Pages
3814 - 3819