Customer Data Churn Analysis

  Example on Telco Cutomer Churn Analysis dataset

Posted by Haby on September 2, 2018
  1. Import Data and Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings("ignore")
['WA_Fn-UseC_-Telco-Customer-Churn.csv']
df = pd.read_csv("../input/WA_Fn-UseC_-Telco-Customer-Churn.csv",na_values = [" "])
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
# check NA
df.isna().sum()
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
# drop NA
df = df.dropna()

2.Data Analytic

# drop customerID and churn to make train dataset
customerID = df["customerID"]
Churn = df["Churn"]
df = df.drop(["customerID","Churn"],axis = 1)
# change data type :
df["SeniorCitizen"] = df["SeniorCitizen"].astype("object")

print(df.dtypes)
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
dtype: object

Usually, Customer churn problem is a imbalanced problem. For the data with less than 20% Churn rate, I will consider use “ROC_AUC” as metric. While in this case, Churn rate is 26%, so I will use “accuracy” as metric here.

  • When there is a modest class imbalance like 4:1 in the example above it can cause problems.
    • From https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-datasets
# Churn Rate
print(Churn.value_counts())
print("-"*40)
print("Churn Rate", sum(Churn == "Yes")/len(Churn))
No     5163
Yes    1869
Name: Churn, dtype: int64
----------------------------------------
Churn Rate 0.26578498293515357
# plot numeric values
_,ax = plt.subplots(3,1,figsize = (9,9))
for i,c in enumerate(df.select_dtypes([np.number]).columns) :
    # print(i,c)
    sns.distplot(df[c],ax = ax[i])

png

Many of Numeric values are 0 and skewness is high. Considering Scale values.

Plot all categorical variables

_,ax = plt.subplots(4,4,figsize = (16,16))
for i in range(0,4) :
    for j in range(0,4) :
        sns.countplot(df.select_dtypes(exclude=["number"]).iloc[:,4*i+j],ax = ax[i,j],palette = "muted")

png

3.Data preprocessing

# pipeline with scale / onehotencoder / with LR classfication as baseline model
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


df_dummies = pd.get_dummies(df)

# train test split
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(df_dummies,Churn,test_size = 0.25,random_state = 1)

4.Baseline Modeling

# baseline model with Logistic Regression
from sklearn.model_selection import cross_val_score,KFold

cv =  KFold(10,random_state=1)
baseline = LogisticRegression(random_state = 1)
baseline.fit(x_train,y_train)
cv_bl = cross_val_score(baseline,x_test,y_test,cv =cv).mean()

#bl_pred = baseline.predict(x_test)
print("Baseline model accuracy is :",cv_bl) # 0.8048993506493506
Baseline model accuracy is : 0.8048993506493506

5.Feature engineering

  1. Categorical vs Churn
_,ax = plt.subplots(4,4,figsize = (16,16))
for i in range(0,4) :
    for j in range(0,4) :
        sns.countplot(df.select_dtypes(exclude=["number"]).iloc[:,4*i+j],ax = ax[i,j],palette = "muted",hue = Churn,dodge = False)

png

What we know :

1.Senior are more possible to lose compared to junor citizen

2.Single ppl are more possible to lose compared the one with panter

3.Independent ppl are mroe possible to lose compared the one who relies on another

4.Ppl with no internet service are more possible to be loyalty maybe because they have less methods to get more new info.

5.Pple with OnlineSecurity, OnlineBackup, OnlineProtection, Deviceprotection,TechSupport are more likely to be loyalty.

6.Ppl with more than 1 year contract are more likely to be loyalty and longer the contract, less the churn.

7.Ppl with paperbilling are more likely to be loyalt customer.

8.More ppl in E-billing payment method are lost.

_,ax = plt.subplots(3,1,figsize = (9,12))
for i,c in enumerate(df.select_dtypes([np.number]).columns) :
    # print(i,c)
    sns.boxplot(x = df[c],y = Churn,ax = ax[i],palette = "muted")

png

1.Low tensure and high monthly charge customer more likely to lose.

2.Low total charge clients more likely to lose while there are many outliers.

3.There are some clients whose TotalCharges are equal to monthcharges, which means they use service for only 1 month or they spend 0 in other months.

# short term
df["Short_term"] =  (df.TotalCharges == df.MonthlyCharges).astype('object')
sns.countplot(df["Short_term"],palette = "muted",hue = Churn,dodge = False)
<matplotlib.axes.subplots.AxesSubplot at 0x7f832e541940>

png

All short-term clients are Churn customer.

Check out month stay with churn

# how many month they stay here
df["Month_Stay"] = df.TotalCharges/df.MonthlyCharges
sns.boxplot(y =df["Month_Stay"] ,x = Churn,palette = "muted")
<matplotlib.axes.subplots.AxesSubplot at 0x7f832e1b9a20>

png

Lower month stay will definetely lead to more churn

df["long_contract_no_internet"] = ((df.Contract == "Two year") & (df.InternetService == "No")).astype("object")
sns.countplot(df["long_contract_no_internet"],palette = "muted",hue = Churn,dodge = False)
<matplotlib.axes.subplots.AxesSubplot at 0x7f832d9afd30>

png

All long_contract_no_internet are loyalty customer

6.Modeling

df = pd.get_dummies(df) # get k-1 dummies
x_train,x_test,y_train,y_test = train_test_split(df,Churn,test_size = 0.25,random_state = 1)
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb
from sklearn.neighbors import KNeighborsClassifier

# all algorithms I want to use
algorithms = pd.DataFrame({
    "Name" : ["Logistic Regression","Random Forest","GBM","DTree","LGBM","KNN"],
    "Algorithm" : [LogisticRegression(random_state=1),RandomForestClassifier(random_state = 1),
                   GradientBoostingClassifier(random_state=1),DecisionTreeClassifier(random_state=1),
                   lgb.LGBMClassifier(random_state=1),KNeighborsClassifier()]
})
# default score with default parameters
from sklearn.feature_selection import SelectKBest

score = []
for i in range(1,len(algorithms)+1) :
    clf = Pipeline([
        ("featureselection",SelectKBest(k = x_train.shape[1])),
        (algorithms.Name[i-1],algorithms.Algorithm[i-1])
    ])
    clf.fit(x_train,y_train)
    score.append(cross_val_score(clf,x_test,y_test,cv = cv).mean())
    #print("{} accuracy is {}:".format(algorithms.Name[i-1],(pred == y_test).mean()))
algorithms["Score"] = score
algorithms
Name Algorithm Score
0 Logistic Regression LogisticRegression(C=1.0, class_weight=None, d... 0.804899
1 Random Forest (DecisionTreeClassifier(class_weight=None, cri... 0.778149
2 GBM ([DecisionTreeRegressor(criterion='friedman_ms... 0.785000
3 DTree DecisionTreeClassifier(class_weight=None, crit... 0.732091
4 LGBM LGBMClassifier(boosting_type='gbdt', class_wei... 0.782714
5 KNN KNeighborsClassifier(algorithm='auto', leaf_si... 0.761666

7.Parameter tuning

  • Logistic Regression
# feature selection for LR
lr_score = []
for i in np.arange(0.5,1.5,0.2) :
    for j in range(20,x_train.shape[1]+1) :
        clf = Pipeline([
                ("featureselection",SelectKBest(k = j)),
                ("LR",LogisticRegression(random_state=1,C = i))])
        clf.fit(x_train,y_train)
        cvs = cross_val_score(clf,x_test,y_test,cv = cv).mean()
        lr_score.append(cvs)
        #print("Score is {} with C = {} and {} feature".format(cvs,i,j))
print("Best Score is {}".format(max(lr_score)))
# Score is 0.8157077922077922 with C = 1.3 and 31 feature
Best Score is 0.8157077922077922
from sklearn.metrics import classification_report,confusion_matrix
clf_lr = Pipeline([
                ("featureselection",SelectKBest(k = 31)),
                ("LR",LogisticRegression(random_state=1,C = 1.3))])
pred = clf_lr.fit(x_train,y_train).predict(x_test)
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True,cmap = "PuBuGn")
              precision    recall  f1-score   support

          No       0.84      0.90      0.87      1294
         Yes       0.66      0.54      0.59       464

   micro avg       0.81      0.81      0.81      1758
   macro avg       0.75      0.72      0.73      1758
weighted avg       0.80      0.81      0.80      1758






<matplotlib.axes.subplots.AxesSubplot at 0x7f834414f2b0>

png

TPR for “YES” is 0.54, not so good. We have a really bad guess on churn = “yes”.

  • GBM
# parameter tuning in GBM
# default value is 0.785000

from sklearn.model_selection import GridSearchCV
clf_pl = Pipeline([
                ("featureselection",SelectKBest(k = x_train.shape[1])),
                ("GBM",GradientBoostingClassifier(random_state=1))])
param = {
    "GBM__learning_rate" : np.arange(0.06,0.08,0.01),
    "GBM__n_estimators" : range(59,63,2),
    'GBM__max_depth':range(6,8,1),
    'GBM__min_samples_split':range(485,511,5),
    "GBM__subsample": [0.8]
}
clf_gbm = GridSearchCV(clf_pl,param,cv = cv,n_jobs = -1,verbose = 1)
clf_gbm.fit(x_train,y_train)
print(clf_gbm.best_params_)
print("-"*40)
print("Best Score is {}".format(clf_gbm.best_score_)) # 0.8103905953735305
Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   23.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  4.8min finished


{'GBM__learning_rate': 0.06999999999999999, 'GBM__max_depth': 7, 'GBM__min_samples_split': 495, 'GBM__n_estimators': 59, 'GBM__subsample': 0.8}
----------------------------------------
Best Score is 0.8088737201365188
gbm = GradientBoostingClassifier(random_state=1,learning_rate=0.07,max_depth=7,min_samples_split=495,n_estimators=59,subsample=0.8)
pred = gbm.fit(x_train,y_train).predict(x_test)
print(classification_report(y_test,pred))
sns.heatmap(confusion_matrix(y_test,pred),annot=True,cmap = "PuBuGn")
              precision    recall  f1-score   support

          No       0.83      0.91      0.87      1294
         Yes       0.66      0.48      0.56       464

   micro avg       0.80      0.80      0.80      1758
   macro avg       0.75      0.70      0.71      1758
weighted avg       0.79      0.80      0.79      1758






<matplotlib.axes.subplots.AxesSubplot at 0x7f833c769f98>

png

GBM have a better estimate on Churn = “No” while bad on Yes

  • Decision Tree
# 0.732091
clf_dt = Pipeline([
                ("featureselection",SelectKBest(k = x_train.shape[1])),
                ("DTree",DecisionTreeClassifier(random_state=1))])
param = {
    "DTree__criterion": ["gini","entropy"],
    "DTree__max_depth" :range(4,9),
    "DTree__min_samples_split":np.arange(0.1,0.6,0.1),
    "DTree__min_samples_leaf":np.arange(0.1,0.5,0.1),
    #"DTree__max_features " : ["auto","log2","sqrt"],
    "DTree__class_weight" : ["balanced",None]
}
clf_dt = GridSearchCV(clf_dt,param,cv = cv,n_jobs = -1,verbose = 1)
clf_dt.fit(x_train,y_train)
print(clf_dt.best_params_)
print("-"*40)
print("Best Score is {}".format(clf_dt.best_score_)) # 0.7897231702692453
Fitting 10 folds for each of 400 candidates, totalling 4000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 876 tasks      | elapsed:   14.7s
[Parallel(n_jobs=-1)]: Done 1576 tasks      | elapsed:   26.6s
[Parallel(n_jobs=-1)]: Done 2476 tasks      | elapsed:   42.2s
[Parallel(n_jobs=-1)]: Done 3576 tasks      | elapsed:   59.3s


{'DTree__class_weight': None, 'DTree__criterion': 'gini', 'DTree__max_depth': 4, 'DTree__min_samples_leaf': 0.1, 'DTree__min_samples_split': 0.1}
----------------------------------------
Best Score is 0.7897231702692453


[Parallel(n_jobs=-1)]: Done 4000 out of 4000 | elapsed:  1.1min finished
import graphviz
from sklearn.tree import export_graphviz
dt = DecisionTreeClassifier(random_state=1,criterion="gini",max_depth=4,min_samples_leaf=0.1,min_samples_split=0.1)
dt.fit(x_train,y_train)
dot_data = export_graphviz(dt, out_file=None,
                         feature_names=x_train.columns,  
                         class_names=y_train,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

png

Tuned_Score = [0.8157077922077922,0,0.8088737201365188,0.7897231702692453,0,0]
algorithms["Tuned Score"] = Tuned_Score
algorithms
Name Algorithm Score Tuned Score
0 Logistic Regression LogisticRegression(C=1.0, class_weight=None, d... 0.804899 0.815708
1 Random Forest (DecisionTreeClassifier(class_weight=None, cri... 0.778149 0.000000
2 GBM ([DecisionTreeRegressor(criterion='friedman_ms... 0.785000 0.808874
3 DTree DecisionTreeClassifier(class_weight=None, crit... 0.732091 0.789723
4 LGBM LGBMClassifier(boosting_type='gbdt', class_wei... 0.782714 0.000000
5 KNN KNeighborsClassifier(algorithm='auto', leaf_si... 0.761666 0.000000

For Furture Study :

  • Deep analysis in Churn = “Yes”
  • Feature engineering
  • Normalized numeric data or categorical numeric data
  • Using Ensembling Algorithms to combine some simple algorithms