Abstract
Customer churn prediction has become increasingly important for subscription-based businesses in a competitive and transparent digital market. Machine learning has shown strong potential in identifying customers at risk of churn: However, real-world datasets are often noisy and incomplete, making data preprocessing essential for reliable model performance.
This study investigates two main research questions: (1) How accurately can ML predict customer churn, and (2) how does data preprocessing influence predictive performance.
Using structured customer related data from a business-to-business subscription company, five ML algorithms – Logistic Regression, Random Forest, Extreme Gradient Boosting, Support Vector Machine, and Neural Networks – are applied to predict churn. The models were trained on both raw and pre-processed datasets, allowing a comparative evaluation. Preprocessing includes data cleaning, transformation, reduction through Lasso regularization and class balancing with SMOTE. In addition to comparing models on raw versus pre-processed data, the study assesses the performance impact of each individual preprocessing step.
The study’s results show that none of the ML models achieved the targeted thresholds for strong predictive performance (AUC and F1 > 0.80) when trained on the raw dataset. Similarly, the models did not reach the targeted performance thresholds even when trained on the pre-processed dataset. However, our findings indicate that data preprocessing significantly influences performance for less complex models, such as Support Vector Machines and Logistic Regression, while having a relatively smaller impact on more complex models like Random Forest and Extreme Gradient Boosting.
This study contributes both practically and methodologically by highlighting the impact of data preparation in predictive analytics and provides recommendations for companies aiming to improve churn prediction strategies.