Elo Customer Loyalty Prediction — Case Study

Helping a Leading Bank Understand Customer Loyalty

Kundan Jha
10 min readMay 7, 2021

Author : Kundan Jha

Photo by Jievani Weerasinghe on Unsplash

The practical applications of machine learning drive business results which can dramatically affect a company’s bottom line. Companies are always in the search for ways by which it can reduce the operational cost and maximize revenue. The preferred ways are marketing and offering promotions to customers. In E-commerce, Machine learning models are built to understand the most important aspects and preferences in customers’ life cycle. In this case study, I have attempted one such problem with the help of Machine Learning.

Objective : Predict Customer Loyalty Score — end-to-end Machine learning case study

Overview :

I will walk you through my approach for solving the problem by following these step processes.

  1. Business Problem
  2. Mapping to ML Problem
  3. Exploratory Data Analysis
  4. Baseline Model
  5. Designing Advance Features
  6. Model Building
  7. Deployment
  8. Conclusion
  9. Future Works

1. Business Problem :

Elo Merchant Category Recommendation is a Kaggle competition which is provided by Elo. As a payment Brand, Elo has built partnerships with merchants in order to offer promotions and discounts to card holders. Basically, These programs make the customer’s choice more strongly towards the usage of Elo. But, they want to know, do these promotions actually work for customers and merchants? Elo needs to keep their customers, So loyalty of the customers towards the brand is crucial. For Example, a customer using the Elo card with diverse merchants for a long time, this signifies the user’s loyalty is high. The Problem is to find a metric which has to reflect the cardholder’s loyalty with Elo payment brand.

2. Mapping to ML Problem :

In terms of Machine Learning, we need a metric to measure up the customer’s loyalty. A certain loyalty score is assigned for each of the card_id present in train data.

Input FeaturesCardholder’s Purchase history, usage time etc.

Target VariableLoyalty Score

The Loyalty Score is the target variable for which the Machine Learning Model should be built to predict. What is loyalty? According to the Data_Dictionary.xlsx, loyalty is a numerical score calculated 2 months after historical and evaluation period. The Loyalty score depends on many aspects of the customers. The purchase history, usage time, merchant’s diversity, etc. Loyalty scores are real-numbers, It directly gives us the intuition that we have to go for a supervised machine learning regression model to solve this problem. The metric provided in the competition is RMSE.

3. Exploratory Data Analysis :

Data Source : Kaggle

The problem has 5 datasets.

  1. train.csv : It has 6 features - first_active_month, card-id, feature1, feature2, feature3 and target.
  2. test.csv : The test set has the same features as the train set without targets.
  3. historical_transactions.csv : Contains up to 3 months worth of historical transactions for each card_id.
  4. new_merchant_transactions.csv : Contains 2 months worth of data for each card_id containing ALL purchases that card_id made at merchant_ids that were not visited in the historical data.
  5. merchants.csv : additional information about all merchants represented by merchant_id in the data set.

The datasets are largely anonymized, and the meaning of the features are not elaborated. In all these datasets, no text data/feature is present. We only have categorical and numerical features.

  • Train and Test data :

Train Data has target value, The PDF of target variable shows it has outliers around -30 and target values are normalized with pre-decided mean and standard deviation. Outliers could be one of the main purposes of this competition. Maybe those represent fraud or credit default etc. i.e. they are important. It seems to be purposely introduced in the loyalty Score.

Image by Author

The target values are not skewed for different categories in the given three anonymized features (feature_1,feature_2,feature_3). This could mean that these features aren’t really good at predicting targets. We need to design more features with feature engineering.

Image by Author

If we look at the feature first_active_month and check distribution of data across years. Most of the data lies in the years ranging from 2016 to 2018 and trends of counts for train and test data are similar.

Image by Author

There is no problem of multicollinearity in the train data as the VIF values for all the three features are well under 10. The distribution of both the train and test are almost identical. So there is no need for time based splitting.

Image by Author
  • Historical & New-Merchants transactions data :

Both files contain the same features about their transactions for each customer but recorded on a different time frame. Out of 14 features, 6 are id’s. From rest 8 features, 5 features named — month lag, installments, category_1, category_2 and category_3 doesn’t reveal any details about target variable’s relationship with the card_id’s as they have almost similar distribution over the target value.

The authorized_flag feature is an important feature for predicting the Loyalty score. If the card’s transactions are approved most of the time, there is a great chance the cards can have high Loyalty Score. Crucial observation — In New-Merchants transactions there is no unauthorized transactions.

Image by Author

The transactions are time dependent, the engineered features from purchase_date feature will be most crucial in prediction. Although it’s attributes have almost similar distribution over the target value. Nevertheless, let’s have a look :

Image by Author — Number of transactions vs hour(derived from purchase_date)

The purchase_amount and number of transactions are critical features for influencing loyalty score. Number of transactions is a derived feature. It is designed by grouping all the card_id’s. Every group’s size represents the number of transactions done by that card_id. The key observation in both the features is, Most of the outliers in target having value around -30 are having very less no of transactions as well as purchase_amount. That strongly suggests an increase in no of transactions and purchase_amount customers become more loyal, as target score increases.

Image by Author
  • Merchants data :

In this case study merchants data is not used to build the models. The reason behind this is — Merchants data has information regarding merchants, which is not helpful in the prediction of loyalty score of customer and unnecessarily increases the complexity of the problem.

The major insights at the end of EDA — Data is not complete as NaN values are present in the merchants, historical and new merchants transactions, so these missing values must be imputed for better prediction. One-hot encoding of categorical features should be done for better prediction. The categorical features present across the data set are larger in number than numerical features. The given features are not sufficient for training. More features must be designed with the help of domain knowledge and the business problem given.

4. Baseline Model :

My approach for the baseline model is totally based on my understanding of business problem. I thought what could be the features which are really crucial in this domain on which loyalty of a customer depends. At first hand, I came up with the features num_of_transactions, purchase_amount, favorite merchant, number of transactions at favorite merchant including some date-time features like first_active month, purchase_date and dormancy (Inactive period of card). Though most of these features are not explicitly given in the data, I had to derive these all with feature engineering.

The model gives 3.72 as RMSE score on train data, and this is fairly good acknowledging the fact that data contains outliers and we can’t remove them.

Image by Author

5. Designing Advance Features :

Apart from baseline features, The other feature engineering hacks I have implemented are following :

  • Aggregation of Features : For the numerical and categorical features in data, The basic statistical numbers like sum, min, max, median, mean, standard deviation and number of unique values are calculated to collect all the aspects of that feature and to make a single value for each card id’s on each features.
  • Derived Features : There are some abstract crucial features which can be really important for predicting loyalty of customers towards the brand. We have to derive those features with feature engineering techniques. Some of them are — number of authorized transaction and denied transaction, difference in the amount spend with cards and various ratio features like count ratio between number of transactions and number of date difference etc. All these derived features are also aggregated with the required stats to feed into the model.
  • Holidays — Influential days : I will find whether a purchase is made 15 days before a festival. If it is so then it will be considered as an influential day and it impacts the pattern of the user’s transaction.
  • Date-Time Features : The transactions are time dependent, The engineered features from purchase_date feature will be most crucial in prediction. purchase_date feature records the timestamp of the transactions. We can design multiple features from these timestamps — whether it is a weekday, weekend, any special festive day or holiday, difference between dates, first and last registered dates.
  • Features from Baseline Model : The baseline model which I have created is totally based on the human understanding of business problem. I came up with some basic features which I had thought would be really important in real situations to predict customer’s loyalty. Then I train a simple model and check the feature importance, and in this part I include all the topmost features of the baseline model.

The categorical features in the transaction data are one-hot encoded before designing features from above techniques. After performing these steps, I merge all the data into a single file for training purposes. We also need to drop columns which we don’t need further from the final data set. At the End, I created 195 features including all data. These all features are fed into the model for training and prediction.

6. Model Building :

The literature survey with the already available blogs and kaggle kernels, This is pretty much clear that the simple models does not perform well while predicting the target value. So, I have started with complex models. Remember the performance is measured over the performance metric RMSE.

I have experimented with various models those are : AdaBoost, XGBoost, LightGBM, Stacking, MLP — 5 Layer with Dropout and Convolution-1D Model. Among all these the best performing models are XGBoost, LightGBM and Stacking using Ridge meta-learner.

  • XGBoost : XGBoost model is my first step towards the solution. Hyperparameter tuning and cross-validation are done using RandomizedSearchCV and StratifiedKFold respectively. XGBoost gives an RMSE value of 3.657. This is a very good improvement from the baseline score 3.72.
  • LightGBM : LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm. The parameter tuning has been done using optuna. Optuna is a library which is internally dedicated to the trials on the data to find the best results. LightGBM gives an even better RMSE value of 3.651. This is an improvement from the XGBoost score 3.657.
  • Stacking : For Improvement in score, I have tried Stacking with meta-learner. It can be understood as another ensemble method which can be used for building models on top of the predictions. I have used the above two models predictions for Stacking using Ridge meta-learner. This model gives the best RMSE value of 3.65. One major achievement for me is that on the kaggle leaderboard my public rank and private rank is almost similar. This indicates that the model generalizes well.
Image by Author

Summary of model scores over kaggle :

Image by Author

7. Deployment :

I have also deployed the best model using flask & ngrok. I have provided a basic user interface where users can give the card id and submit. It calculates other features for this particular card id based on transaction data and then outputs a predicted Loyalty Score.

Video by Author

8. Conclusion :

  • This Case Study reveals the true power of Feature Engineering for building the optimized Machine Learning model.
  • Feature engineering was the challenging task for this case study as only very less features were provided in the train data. Features need to be carefully designed. A total of 195 features were extracted.
  • Transaction data was very important for designing features also domain knowledge can certainly help to build better features.
  • The most important features were — dormancy(inactive period of a card), Aggregated features from purchase_date and purchase_amount, derived features from authorized_flag.
  • The Stacked model built on the XGBoost and LightGBM predictions with ridge meta learner gives the best kaggle score of 3.61473.

9. Future Works :

  • We can try linear stacking as described in the 1st place solution which can give 0.015 boost in local cv compared with the same feature train directly.
  • Feature Selection can improve the model’s performance by reducing the bias added by the unwanted features.
  • As far as neural network are concerned we can try different architectures to increase the performance of the model.
  • Inclusion of the engineered features from the merchants data can help improve the model predictions.

Thank you for reading the blog. Tell me your thoughts below!

GitHub Repository : Elo-Customer-Loyalty-Prediction

LinkedIn Handle : Kundan Jha

--

--

Kundan Jha

I am a passionate data product guy. I advocate for building responsible data products. I am a researcher at heart and practitioner by profession.