24 minute read

Banner made from a photo by Sven Mieke on Unsplash

Introduction

Side note: all the data visualizations are interactive but it can’t be dynamic on github pages, so if you want to play with it checkount my kernel on Kaggle.

About Dataset

The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months.

The customer variables to study are:

  • product purchased: TM195, TM498, or TM798
  • gender
  • age, in years
  • education, in years
  • relationship status: single or partnered
  • annual household income ($)
  • average number of times the customer plans to use the treadmill each week
  • average number of miles the customer expects to walk/run each week
  • self-rated fitness on an 1-to-5 scale: where 1 is poor shape and 5 is excellent shape.
import numpy as np
import pandas as pd

# dataviz with plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

# for quick viz, set backend
pd.options.plotting.backend = "plotly"

# few viz are made with sns & mplt
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# stats
from scipy import stats

#from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn import preprocessing
from sklearn.metrics import silhouette_score

# Import module for k-protoype cluster
from kmodes.kprototypes import KPrototypes

import warnings
warnings.filterwarnings('ignore')

First insight

Let’s look at the first lines of our dataset:

df = pd.read_csv('CardioGoodFitness.csv')
df.tail()
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
175 TM798 40 Male 21 Single 6 5 83416 200
176 TM798 42 Male 18 Single 5 4 89641 200
177 TM798 45 Male 16 Single 5 5 90886 160
178 TM798 47 Male 18 Partnered 4 5 104581 120
179 TM798 48 Male 18 Partnered 4 5 95508 180

The number of records is very low, such as the number of variables:
/!\ beware: is the dataset really representative of the treadmill users in general ? This is an open question as there are not too many records, a sample which much more lines would be more reliable ! so don’t take that all the following insights for granted…

df.shape
(180, 9)

There isn’t any missing value (Nan) and the types of each column is appropriate:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product        180 non-null    object
 1   Age            180 non-null    int64 
 2   Gender         180 non-null    object
 3   Education      180 non-null    int64 
 4   MaritalStatus  180 non-null    object
 5   Usage          180 non-null    int64 
 6   Fitness        180 non-null    int64 
 7   Income         180 non-null    int64 
 8   Miles          180 non-null    int64 
dtypes: int64(6), object(3)
memory usage: 12.8+ KB

There isn’t any duplicated lines:

df.duplicated().sum()
0

By looking at the basic statistics, we can see that

  • all products were bought by people mostly between 18 and 50 years old.
  • 75% of purchases were made by people below 33.
  • most of the people use their treadmill around 3 times a week
  • they are huge disparities in the incomes…
df.describe().T.round(1)

# if you wish to include qualitative variables:
#df.describe(include='all').T.round(2)
count mean std min 25% 50% 75% max
Age 180.0 28.8 6.9 18.0 24.0 26.0 33.0 50.0
Education 180.0 15.6 1.6 12.0 14.0 16.0 16.0 21.0
Usage 180.0 3.5 1.1 2.0 3.0 3.0 4.0 7.0
Fitness 180.0 3.3 1.0 1.0 3.0 3.0 4.0 5.0
Income 180.0 53719.6 16506.7 29562.0 44058.8 50596.5 58668.0 104581.0
Miles 180.0 103.2 51.9 21.0 66.0 94.0 114.8 360.0

Problem Statement

  • Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.
  • Improve the treadmill recommendation based on the customer informations & boost sales.
  • Get more insights by clustering the customers: find the specificities of each group

Exploratory Data Analysis

The goal of EDA is to allow data scientists to get deep insight into a data set and at the same time provide specific outcomes that could be extracted from the data set. It includes:

  • List of outliers
  • Estimates for parameters
  • Uncertainties for those estimates
  • List of all important factors
  • Conclusions or assumptions as to whether certain individual factors are statistically essential
  • Optimal settings
  • A good predictive model

Here, our analysis will also help us to make recommendation and to understand what are the specificities of the persons who use / buy treadmills.

Most of the data visualizations are made with plotly, so that they are dynamic: one can have more infos by hoovering the varisous areas of the plots. The charts can also be manipulated with the following buttons:

png

Univariate analysis

Let’s start with our target: the products and their quantities:

px.histogram(df, x="Product", width=500, height=400, title="Product (our target) histogram")

png

Nearly half of the product are the TM195. The TM798 is the less represented:

px.pie(
    df.groupby('Product').agg('count')[['Age']].rename(columns={'Age': 'Count'}).reset_index(), 
    values='Count', 
    names='Product', 
    title='Numerical proportion of Products',
    width=400
)

png

There is a number of people considering themselves in an intermediary shape. While very few people think they are in a poor shape:

px.histogram(df, x="Fitness", width=500, height=400, title="Histogram of fitness")

png

Most of the customers expect to run / walk less than 220 miles in average:

px.histogram(df, x="Miles", width=500, height=400, title="Histogram of miles")

png

Now let’s look at the number of usage:

px.histogram(df, x="Usage", width=500, height=400, title="Histogram of usages")

png

Nearly 60% of the customers are partenered:

px.pie(
    df.groupby('MaritalStatus').agg('count')[['Age']].rename(columns={'Age': 'Count'}).reset_index(), 
    values='Count', 
    names='MaritalStatus', 
    title='Proportion depending on the marital status',
    width=400
)

png

px.pie(
    df.groupby('Gender').agg('count')[['Age']].rename(columns={'Age': 'Count'}).reset_index(), 
    values='Count', 
    names='Gender', 
    title='Proportion depending on the Gender',
    width=400
)

png

Bivariate Analysis

for the TM798, there far less females using / buying it than males, compared to the other products:

px.histogram(df, x="Product", color="Gender", barmode='group', width=500, height=400,
            title="Product counts per Gender")

png

The proportion of Single / Partenered people for each product is quite the same :

px.histogram(df, x="Product", color="MaritalStatus", barmode='group', width=500, height=400,
            title="Product counts per marital status")

png

From the histogram below, it seems that:

  • most of the buyers use their TM between 2 or 4 times a week
  • the TM195 is mostly bought by the people that use it less.
  • on the contrary the TM798 is bought by the most frequent users
px.histogram(df, x="Usage", color="Product", barmode='group', width=500, height=400,
            title="Usage counts per Product")

# same code with sns
#sns.countplot(data=df, x="Usage", hue="Product")
#plt.show()

png

From the graph below:

  • the mean income of the buyers is around 50k
  • 25% of the buyers earn around 42k & 75% around 54k
  • There are very few people earning more than 70k, especially women.
px.histogram(df, x="Income", color="Gender",
            marginal="box", # or violin, rug
            hover_data=df.columns,
            width=700, height=400, barmode="overlay", nbins=30
            )

png

From below graph we can say that

  • the mean age of the buyers is aroud 26
  • 25% of the TM users are around 25
  • 75% of them are around 33
px.histogram(df, x="Age", color="Gender",
            marginal="box", # or violin, rug
            hover_data=df.columns,
            width=700, height=400, barmode="overlay", nbins=30
            )

png

Some of the important conclusions that can be drawn by looking at the resulsts of the histogram below are:

  • The TM798 is used by people with relatively high incomes
  • The TM498 is bought by persons with a yearly income ranging from 45 to 55k
  • while the people with lower incomes tends to use the TM195
px.histogram(df, x="Income", color="Product",
            marginal="box", # or violin, rug
            hover_data=df.columns,
            width=700, height=400, barmode="overlay", nbins=30
            )

png

px.histogram(df, x="Income", color="MaritalStatus",
            marginal="box", # or violin, rug
            hover_data=df.columns,
            width=700, height=400, barmode="overlay", nbins=30
            )

png

More infos on the histograms’ options in plotly here: ‘Histograms with Plotly Express: Complete Guide’ by Vaclav Dekanovsky

Multivariate analysis

There is no particular insights that can be seen from a scatter matrix (the code is nonetheless provided if you want to give it a try).

#pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(20, 20));

#px.scatter_matrix(
#    df, 
#    color="Product",
#    height=1000
#)

So let’s dig a little deeper with various scatter plots :

px.scatter(
    df, 
    x="Age", 
    y="Usage",
    color="Gender",
    size='Usage',
    width=800
)

png

From the graph below, it seems that

  • most of the women earns less than men
  • there is also a positive trend which indicates that as age increases the income also increases

If you do the same thing with the MaritalStatus, there is no obvious trend that can be shown…

px.scatter(
    df, 
    x="Age", 
    y="Income",
    color="Gender",
    size='Usage',
    width=800
)

png

Let’s see it in 3D:

px.scatter_3d(
    df, 
    x='Age', 
    y='Usage', 
    z='Income',
    color='MaritalStatus'
)

png

The heatmap below with a mask on its half informs us on the correlations between the quantitative variables.

  • the correlation between Miles & fitness is strong
  • Age is strongly related to Income
  • Usage is highly correlated to income, fitness and miles
corr = df.corr().round(3)
mask = np.triu(np.ones_like(corr, dtype=bool))
df_mask = corr.mask(mask)

fig = ff.create_annotated_heatmap(z=df_mask.to_numpy(), 
                                  x=df_mask.columns.tolist(),
                                  y=df_mask.columns.tolist(),
                                  
                                  # color options # reverse a built-in color scale by appending _r to its name
                                  colorscale=px.colors.diverging.RdBu_r,
                                  #colorscale=px.colors.sequential.Viridis,
                                  #colorscale=px.colors.sequential.Inferno,
                                  hoverinfo="none", #Shows hoverinfo for null values
                                  showscale=True, ygap=1, xgap=1
                                 )

fig.update_xaxes(side="bottom")

fig.update_layout(
    title_text='Heatmap', 
    title_x=0.5, 
    width=500, 
    height=500,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    xaxis_zeroline=False,
    yaxis_zeroline=False,
    yaxis_autorange='reversed',
    template='plotly_white'
)

# NaN values are not handled automatically and are displayed in the figure
# So we need to get rid of the text manually
for i in range(len(fig.layout.annotations)):
    if fig.layout.annotations[i].text == 'nan':
        fig.layout.annotations[i].text = ""

fig.show()

png

When looking at the graph below graph, we can confirm that persons with higher income tends to choose more TM798 and with age income increases.

TM498 and TM195 is more or the same across various buyers

px.line(
    df.groupby(['Age', 'Product']).agg('mean')[['Income']].reset_index(), 
    x="Age", 
    y="Income", 
    color='Product',
    #title='Mean Age',
    width=800,
    height=400
)

png

We can also visualize things in 3D - for fun :)

tmp = df.groupby(['Fitness', pd.cut(df.Age, bins=5, labels=list(range(5)))])\
    .agg('mean')[['Income']]\
    .reset_index()\
    .pivot(index='Fitness', columns='Age')

go.Figure(data=[go.Surface(z=tmp.values)])

png

The sunburst chart is a convenient way to gain insights of the proportions depending on varisous criteria:

tmp = df.groupby(['Product', 'Fitness']).agg('mean')[['Income']].rename(columns={'Income': 'Mean Income'}).reset_index()

px.sunburst(
    tmp,
    path=['Product', 'Fitness'], 
    values='Mean Income',
    #color='Income'
    width=600,
    title='By hoovering you can see the mean income per TM / usages'
)

png

tmp = df.groupby(['Product', 'Gender', 'MaritalStatus']).agg('count')[['Age']].rename(columns={'Age': 'Count'}).reset_index()

px.sunburst(
    tmp,
    path=['Product', 'Gender', 'MaritalStatus'], 
    values='Count',
    width=600,
    title='The nb of customers per TM / gender / marital status'
)

png

tmp = df.groupby(['Product', 'Fitness', 'Miles']).agg('mean')[['Income']].rename(columns={'Income': 'Mean Income'}).reset_index()

px.sunburst(
    tmp,
    path=['Product', 'Fitness', 'Miles'], 
    values='Mean Income',
    width=600,
    title='By hoovering you can see the mean income per TM/usages/miles'
)

png

Recommendations

In this sections we’re going to etablish what are the specifies of each customers’ segment:

def prepare_df(_df, product, gender):
    """ prepare & return the dataframe filtered for the product and gender specified in arguments """
    
    # filter by product & gender
    _df = _df[(_df.Product == product) & (_df.Gender == gender)]
                      
    #select only quantitative cols & product
    cols = [c for c in _df.columns if _df[c].dtype=="int64" or c == 'Product']
    _df = _df[cols]

    # scale between 0-5
    _df[_df.select_dtypes('number').columns] = preprocessing.MinMaxScaler().fit_transform(_df[_df.select_dtypes('number').columns])
    _df[_df.select_dtypes('number').columns] = _df[_df.select_dtypes('number').columns] * 5


    # choose a type of product and drop this col
    #_df = pd.DataFrame(_df[_df.Product == 'TM798'].drop(columns=['Product']).mean()).reset_index()
    _df = _df.groupby('Product').agg('mean').round(2).reset_index().drop(columns=['Product'])
    _df = _df.T.reset_index()
    
    # rename cols
    _df.columns = ['Feature', 'Mean']

    #_df = pd.DataFrame(_df[(_df.Product == product) & (_df.Gender == gender)].drop(columns=['Product']).mean()).reset_index()
    return _df

Let’s stat with the characteristics a simple segment of customers (male users of the TM798). For that the radar chart can be really convenient:

fig = px.line_polar(
    prepare_df(df, 'TM798', 'Male'), 
    r='Mean', 
    theta='Feature', 
    line_close=True, width=400)
fig.update_traces(fill='toself')
fig.show()

png

fig = make_subplots(rows=1, cols=3, specs=[[{"type": "scatterpolar"}, {"type": "scatterpolar"}, {"type": "scatterpolar"}]])

fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM195', 'Male')['Mean'],
    dr=1,
    theta=prepare_df(df, 'TM195', 'Male')['Feature'],
    fill='toself',
    name='Product TM195 / Male'),
    row=1, col=1
)

fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM195', 'Female')['Mean'],
    theta=prepare_df(df, 'TM195', 'Female')['Feature'],
    fill='toself',
    name='Product TM195 / Female'),
    row=1, col=1
)

fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM498', 'Male')['Mean'],
    theta=prepare_df(df, 'TM498', 'Male')['Feature'],
    fill='toself',
    name='Product TM498 / Male'),
    row=1, col=2
)

fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM498', 'Female')['Mean'],
    theta=prepare_df(df, 'TM498', 'Female')['Feature'],
    fill='toself',
    name='Product TM498 / Female'),
    row=1, col=2
)


fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM798', 'Male')['Mean'],
    theta=prepare_df(df, 'TM798', 'Male')['Feature'],
    fill='toself',
    name='Product TM798 / Male'),
    row=1, col=3
)

fig.add_trace(go.Scatterpolar(
    r=prepare_df(df, 'TM798', 'Female')['Mean'],
    theta=prepare_df(df, 'TM798', 'Female')['Feature'],
    fill='toself',
    name='Product TM798 / Female'),
    row=1, col=3
)


fig.update_layout(height=600, width=900,
                  title_text="Specificities of males/females users for each product (beware of scales)")
fig.show()

png

df.loc[df.Income < 45000, "Income group"] = "< 45k"
df.loc[df.Income > 55000, "Income group"] = "> 55k"
df['Income group'].fillna(value='45k < & < 55k', inplace=True)
px.histogram(df, x="Income group", width=500, height=400, title="Count of users per Income category")

png

# data
#df.groupby(['Gender', 'Income group']).agg('count')[['Age']].reset_index().rename(columns={'Age': 'Count'})
#df.groupby(['Income group', 'Product']).agg('count')[['Age']].reset_index().rename(columns={'Age': 'Count'})

# color palettes: https://colorhunt.co/

# sankey diagram
fig = go.Figure(data=[go.Sankey(
    node = dict(
        pad = 15,
        thickness = 20,
        line = dict(color = "black", width = 0.5),
        label = ["Male", "Female", "<45k", "[45k, 55k]", ">55k", "TM195", "TM498", "TM798"],
        color = ['#C4DDFF', '#FFC4DD', '#DEB6AB', '#F8ECD1', '#AC7D88', 
                 '#187498', '#36AE7C', '#EB5353']

    ),
    link = dict(
    # indices correspond to labels
        source = [0,  0,   0,  1,  1,  1,  2,  2, 2,  3,  3, 3,  4,  4, 4],#, 5, 5, 5, 6, 6, 6, 7, 7, 7], 
        target = [2,  3,   4,  2,  3,  4,  5,  6, 7,  5,  6, 7,  5,  6, 7],
        value =  [25, 42, 37, 24, 35, 17, 34, 15, 0, 35, 33, 9, 11, 12, 31],
        color = ['#C4DDFF', '#C4DDFF', '#C4DDFF', '#FFC4DD', '#FFC4DD', '#FFC4DD', '#DEB6AB', 
                  '#DEB6AB', '#DEB6AB', '#F8ECD1', '#F8ECD1', '#F8ECD1', '#AC7D88', '#AC7D88', '#AC7D88']
    
))])

fig.update_layout(title_text="Sankey Diagram of the customers depending on their gender, income & product bought", font_size=10)
fig.show()

png

This concludes the analysis part, the sankey diagram and radare charts can deliver valuables insights in order to gain a better understanding of the treadmills buyers. Now, let’s see if we can make clusters of our customers with machine learning models:

The dataset is made of 3 categorical variables and many quantitatives ones. The most well-kown clustering model is Kmeans. But it’s not suited with qualitative predicators even one-hot encoded, more infos here

  • Kmeans + one hot encoding will increase the size of the dataset extensively if the categorical attributes have a large number of categories. This will make the Kmeans computationally costly (curse of dimensionality):

this is not a problem here.

  • the cluster means will make no sense since the 0 and 1 are not the real values of the data. Kmodes on the other hand produces cluster modes which are the real data and hence make the clusters interpretable.:

this is why we - at first - will use kmeans only on the numerical variables, then K-Prototype (instead of kmodes) with all features (mix of qualitative & quantitative)


Clustering with KMeans

Principle

png

image source: K-Means Clustering Algorithm on medium by Parth Patel

If k is given, the K-means algorithm can be executed in the following steps:

  • Partition of objects into k non-empty subsets
  • Identifying the cluster centroids (mean point) of the current partition.
  • Assigning each point to a specific cluster
  • Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
  • After re-allotting the points, find the centroid of the new cluster formed.

Requirements

Check out this excellent post on medium by Evgeniy Ryzhkov ont the data preparation before kmeans

  • Numerical variables only.

K-means uses distance-based measurements to determine the similarity between data points. If you have categorical data, use K-modes clustering, if data is mixed, use K-prototype clustering.

  • Data has no noises or outliers.

K-means is very sensitive to outliers and noisy data. More detail here and here.

  • Data has symmetric distribution of variables (it isn’t skewed).

Real data always has outliers and noise, and it’s difficult to get rid of it. Transformation data to normal distribution helps to reduce the impact of these issues. In this way, it’s much easier for the algorithm to identify clusters.

  • Variables on the same scale

have the same mean and variance, usually in a range -1.0 to 1.0 (standardized data) or 0.0 to 1.0 (normalized data). For the ML algorithm to consider all attributes as equal, they must all have the same scale. More detail here and here.

  • There is no collinearity (a high level of correlation between two variables).

Correlated variables are not useful for ML segmentation algorithms because they represent the same characteristic of a segment. So correlated variables are nothing but noise. More detail here.

  • Few numbers of dimensions.

As the number of dimensions (variables) increases, a distance-based similarity measure converges to a constant value between any given examples. The more variables the more difficult to find strict differences between instances.

Note: What exactly does few numbers mean? It’s an open question !

Data preparation

1) No high level of correlation & keep only numerical variables (also few nb of dimensions)

X = df[['Age', 'Education', 'Income', 'Miles']]

2) Missing or duplicated data

X.duplicated().sum()
0
X.isnull().sum()
Age          0
Education    0
Income       0
Miles        0
dtype: int64

3) Remove outliers

z_scores = np.abs(stats.zscore(X, nan_policy='omit'))
outliers_threshold = 2  # 3 means that 99.7% of the data is saved
                        # to get more smooth data, you can set 2 or 1 for this value
mask = (z_scores <= outliers_threshold).all(axis=1)
X = X[mask]

180

X.shape[0]
148

4) Symmetric distribution

#for c in X.columns:
#    df[c].plot.hist().show()

5) Scaling

# copy for later use
X_bak = X.copy()

# X alone instead of X[X.columns] returns a np array 
X[X.columns] = preprocessing.MinMaxScaler().fit_transform(X[X.columns])

Results

There is no real ‘elbow’ that can be observed on the graph below:

km_inertias, km_scores = [], []

for k in range(2, 11):
    km = KMeans(n_clusters=k).fit(X)
    km_inertias.append(km.inertia_)
    km_scores.append(silhouette_score(X, km.labels_))
    
px.line(
    pd.DataFrame(list(zip(range(2, 11), km_inertias)), columns=['nb_clusters', 'km_inertias']), 
    x="nb_clusters", 
    y="km_inertias", 
    title='Elbow graph / Inertia depending on k',
    width=600, height=500
)

png

The highest score is for k=6, then in second position comes k=4:

px.line(
    pd.DataFrame(list(zip(range(2, 11), km_scores)), columns=['nb_clusters', 'km_scores']), 
    x="nb_clusters", 
    y="km_scores", 
    title='Scores depending on k',
    width=600, height=500
)

png

Let’s plot a kmeans silhouette analysis: the silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. (source: Wikipedia)

The best option for the value of k is not obvious, but according to the silhouette scores our best options would be 3 then 4. Taken into account all the metrics (elbow graph, scores & silhouettes), we’ll stick with k=4.

range_n_clusters = [3, 4, 5, 6, 7]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(13, 5)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print(
        "For n_clusters =", n_clusters,
        "The average silhouette_score is :", silhouette_avg,
    )

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(
        X.iloc[:, 0], X.iloc[:, 1], marker=".", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(
        centers[:, 0],
        centers[:, 1],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

plt.show()

png

png

png

png

png

In order to visualize the clusters property in a 3D space, let’s retrain our model with the our best bet value of k and plot a scatter :

nb_clusters = 4
km = KMeans(n_clusters=nb_clusters).fit(X)

# K-Means visualization on pair of 2 features
#plt.figure(figsize=(10, 6))
#sns.scatterplot(X.iloc[:, 1], X.iloc[:, 2], hue=km.labels_)
#plt.show()

# Definition of customers profiles corresponding to each clustersPermalink
# Profiles of customers
X['label'] = km.labels_
#X.label.value_counts() 
px.scatter_3d(
    X, 
    x='Age', 
    y='Education', 
    z='Income',
    color='label',
    opacity=0.5
)

png

# description of each cluster
for k in range(nb_clusters):
    print(f'cluster nb : {k+1}')
    print(X_bak[X.label == k].describe().round(0).astype(int).iloc[[0, 1, 3, 7], :-1])
    print('\n\n')
cluster nb : 1
       Age  Education  Income
count   15         15      15
mean    26         17   60633
min     22         16   44343
max     34         18   83416



cluster nb : 2
       Age  Education  Income
count   59         59      59
mean    25         14   42971
min     18         13   29562
max     33         15   57987



cluster nb : 3
       Age  Education  Income
count   38         38      38
mean    25         16   46910
min     21         16   34110
max     29         18   61006



cluster nb : 4
       Age  Education  Income
count   36         36      36
mean    36         16   55377
min     31         14   37521
max     41         18   67083

Clustering with K-Prototype

Principle

Source / Reference: Audhi Aprilliant

  • K-Means is calculated using the Euclidian distance that is only suitable for numerical data.
  • While K-Mode is only suitable for categorical data only, not mixed data types.
  • K-Prototype was created to handle mixed data types (numerical and categorical variables). This clustering method is based on partitioning. It’s an improvement of both the K-Means and K-Mode models.

The K-Prototype clustering algorithm in kmodes module needs categorical variables or columns position in the data. This task aims to save those in a given variables cat_cols_positions. It will be added for the next task in cluster analysis. The categorical column position is in the first four columns in the data.

# Get the position of categorical columns
cat_cols_positions = [df.columns.get_loc(col) for col in list(df.select_dtypes('object').columns)]
print('Categorical columns           : {}'.format(list(df.select_dtypes('object').columns)))
print('Categorical columns position  : {}'.format(cat_cols_positions))
Categorical columns           : ['Product', 'Gender', 'MaritalStatus']
Categorical columns position  : [0, 2, 4]

Data preparation

z_scores = np.abs(stats.zscore(df.select_dtypes('number'), nan_policy='omit'))
outliers_threshold = 2  # 3 means that 99.7% of the data is saved
                        # to get more smooth data, you can set 2 or 1 for this value
mask = (z_scores <= outliers_threshold).all(axis=1)
df = df[mask]

# backup of the data for later
df_bak = df.copy()

df.shape[0]
144
num_cols = list(df.select_dtypes('number').columns) 
df[num_cols] = preprocessing.MinMaxScaler().fit_transform(df[num_cols])
df.head()
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
0 TM195 0.000000 Male 0.2 Single 0.333333 0.666667 0.000000 0.456790
1 TM195 0.043478 Male 0.4 Single 0.000000 0.333333 0.042225 0.228395
2 TM195 0.043478 Female 0.2 Partnered 0.666667 0.333333 0.021113 0.172840
4 TM195 0.086957 Male 0.0 Partnered 0.666667 0.000000 0.105563 0.055556
5 TM195 0.086957 Female 0.2 Partnered 0.333333 0.333333 0.063338 0.172840

Next, convert the data from the data frame to the matrix. It helps the kmodes module running the K-Prototype clustering algorithm. Save the data matrix to the variable df_matrix.

# Convert dataframe to matrix
df_matrix = df.to_numpy()
df_matrix
array([['TM195', 0.0, 'Male', ..., 0.6666666666666666, 0.0,
        0.4567901234567901],
       ['TM195', 0.0434782608695653, 'Male', ..., 0.33333333333333337,
        0.04222527574553425, 0.22839506172839502],
       ['TM195', 0.0434782608695653, 'Female', ..., 0.33333333333333337,
        0.02111263787276718, 0.1728395061728395],
       ...,
       ['TM798', 0.34782608695652173, 'Male', ..., 0.6666666666666666,
        0.6532291009024399, 0.8765432098765433],
       ['TM798', 0.3913043478260869, 'Male', ..., 0.9999999999999999,
        1.0, 0.7530864197530864],
       ['TM798', 0.4782608695652175, 'Male', ..., 0.9999999999999999,
        0.42202993278122336, 0.8765432098765433]], dtype=object)
df_matrix.shape
(144, 9)

We are using the Elbow method to determine the optimal number of clusters for K-Prototype clusters. Instead of calculating the within the sum of squares errors (WSSE) with Euclidian distance, K-Prototype provides the cost function that combines the calculation for numerical and categorical variables. We can look into the Elbow to determine the optimal number of clusters.

# Choose optimal K using Elbow method
cost, nb_clusters = [], range(2, 10) 
for cluster in nb_clusters:
    try:
        kprototype = KPrototypes(n_jobs=-1, n_clusters=cluster, init='Huang', random_state=0)
        kprototype.fit_predict(df_matrix, categorical=cat_cols_positions)
        cost.append(kprototype.cost_)
        print('Cluster initiation: {}'.format(cluster))
    except:
        break
        
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame({'Cluster':nb_clusters, 'Cost':cost})
Cluster initiation: 2
Cluster initiation: 3
Cluster initiation: 4
Cluster initiation: 5
Cluster initiation: 6
Cluster initiation: 7
Cluster initiation: 8
Cluster initiation: 9
px.line(
    df_cost, 
    x="Cluster", 
    y="Cost", 
    title='Cost depending on k',
    width=600, height=500
)

png

According to the plot of cost function above, we consider choosing the number of cluster k = 4. But once again, there is no “real elbow” here… It will be the optimal number of clusters for K-Prototype cluster analysis. Read more about the Elbow method here.

# Fit the cluster
kprototype = KPrototypes(n_jobs=-1, n_clusters=4, init='Huang', random_state=0)
kprototype.fit_predict(df_matrix, categorical=cat_cols_positions)
array([2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 1,
       2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 2, 3, 2, 3, 3, 3,
       2, 3, 2, 3, 0, 3, 0, 2, 2, 3, 3, 2, 2, 3, 3, 1, 3, 0, 0, 0, 3, 0,
       3, 3, 0, 0, 0, 0, 2, 2, 3, 2, 3, 2, 2, 2, 0, 3, 3, 0, 3, 2, 2, 3,
       2, 3, 2, 2, 3, 2, 2, 3, 3, 2, 0, 1, 2, 2, 3, 3, 2, 0, 3, 0, 0, 0,
       3, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1], dtype=uint16)

We can print the centroids of cluster using kprototype.cluster_centroids_. For the numerical variables, it will be using the average while the categorical using the mode.

print(f"Cluster centorid: {kprototype.cluster_centroids_}\n\n"
      f"Nb of iterations: {kprototype.n_iter_}\n\n"
      f"Cost of the cluster creation: {kprototype.cost_:.2f}\n\n"
)
Cluster centorid: [['0.6969696969696969' '0.6242424242424243' '0.45454545454545386'
  '0.3535353535353531' '0.49131040039793344' '0.3338945005611671' 'TM498'
  'Male' 'Partnered']
 ['0.30434782608695654' '0.7111111111111109' '0.7777777777777776'
  '0.9259259259259258' '0.4861281324403841' '0.6858710562414267' 'TM798'
  'Male' 'Single']
 ['0.2708333333333334' '0.33749999999999986' '0.486111111111111'
  '0.38194444444444414' '0.2515021323083399' '0.36612654320987653'
  'TM195' 'Male' 'Single']
 ['0.3971014492753625' '0.3377777777777775' '0.148148148148148'
  '0.19999999999999973' '0.29604609994924563' '0.16941015089163208'
  'TM195' 'Female' 'Partnered']]

Nb of iterations: 11

Cost of the cluster creation: 39.74

The interpretation of clusters is needed. The interpretation is using the centroids in each cluster. To do so, we need to append the cluster labels to the raw data. Order the cluster labels will be helpful to arrange the interpretation based on cluster labels.

# retrieve the unscaled values
df = df_bak

# Add the cluster to the dataframe
df['Cluster Labels'] = kprototype.labels_
df['Segment'] = df['Cluster Labels'].map({0:'First', 1:'Second', 2:'Third', 3:'Fourth'})

# Order the cluster
df['Segment'] = df['Segment'].astype('category')
df['Segment'] = df['Segment'].cat.reorder_categories(['First','Second','Third', 'Fourth'])

df.tail()
Product Age Gender Education MaritalStatus Usage Fitness Income Miles Cluster Labels Segment
152 TM798 25 Female 18 Partnered 5 5 61006 200 1 Second
153 TM798 25 Male 18 Partnered 4 3 64741 100 0 First
158 TM798 26 Male 16 Partnered 5 4 64741 180 1 Second
159 TM798 27 Male 16 Partnered 4 5 83416 160 1 Second
165 TM798 29 Male 18 Single 5 5 52290 180 1 Second

To interpret the cluster, for the numerical variables, it will be using the average while the categorical using the mode. But there are other methods that can be implemented such as using median, percentile, or value composition for categorical variables.

# Cluster interpretation
df.rename(columns = {'Cluster Labels':'Total'}, inplace = True)
df.groupby('Segment').agg(
    {
        'Total':'count',
        'Product': lambda x: x.value_counts().index[0],
        'Age': lambda x: x.value_counts().index[0],
        'Gender': lambda x: x.value_counts().index[0],
        'MaritalStatus': lambda x: x.value_counts().index[0],
        'Education': 'mean',
        'Usage': 'mean',
        'Fitness': 'mean',
        'Income': 'mean',
        'Miles': 'mean'
    }
).round(1).reset_index().set_index('Segment')
Total Product Age Gender MaritalStatus Education Usage Fitness Income Miles
Segment
First 33 TM498 35 Male Partnered 16.1 3.4 3.1 56021.0 92.1
Second 18 TM798 24 Male Single 16.6 4.3 4.8 55741.9 149.1
Third 48 TM195 23 Male Single 14.7 3.5 3.1 43106.4 97.3
Fourth 45 TM195 25 Female Partnered 14.7 2.4 2.6 45505.3 65.4

Conclusion

We have use 2 different approaches to gain insights of the TM customers:

  • first, after an advanced analysis of the various features, we were able to categorize customers, especially depending of the product bought with the radare charts
  • then, we’ve used the kmeans and kprototypes clustering models: it seems that the later is more accurate. The kprototype clusters take into account all the variables even the categorical ones. The last output just above this conclusion shows the specifities of each clusters. It can be really valuable for marketing or CRM purposes.

We could also have tried the Dbscan clustering model (with a dendogram). It can also be informative (but it will be for a later)…

I hope you’ve enjoyed this study and thank you for reading it :)

References:

I also recommend the excellent blog post on K-prototypes by Anton Ruberts