22 minute read

Header photo by Ryan Born

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

Context

This is a Kaggle challenge organized by the ‘Machine Learning Group - ULB’ (Brussels, Belgium) 2 yrs ago.

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Goals

Identify fraudulent credit card transactions. Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Acknowledgements

Please refer to the web page of the Kaggle challenge provided above

Libraries need

usual Python stack for Data Science

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

pd.set_option('display.max_columns', 100)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, auc, precision_recall_curve
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
import lightgbm as lgbm
import xgboost as xgb

First Insight

Content

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

file_path = os.path.join('input', 'creditcardfraud', 'creditcard.csv')
df = pd.read_csv(file_path)
df.head()
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
df.shape
(284807, 31)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Cleaning the dataset

Not needed here except duplicated lines. There isn’t any unrealistic values because all columns corresponds to real transactions.

#df.isnull().sum()
df.duplicated().sum()
1081

The 1k duplicated lines are kept because values are scaled, and the functions could have probably rounded values… so theses might not be real duplicates

#df = df.drop_duplicates()
#df.shape

More infos here on how the Principal component analysis (PCA) works. If your learning algorithm is too slow because the input dimension is too high, then using PCA to speed it up can be a reasonable choice. This is probably the most common application of PCA. Another common application of PCA is for data visualization.

Here, the PCA has also been used to anonymize data.

non_PCA_features = ['Time', 'Amount', 'Class']
df[non_PCA_features].describe()
Time Amount Class
count 284807.000000 284807.000000 284807.000000
mean 94813.859575 88.349619 0.001727
std 47488.145955 250.120109 0.041527
min 0.000000 0.000000 0.000000
25% 54201.500000 5.600000 0.000000
50% 84692.000000 22.000000 0.000000
75% 139320.500000 77.165000 0.000000
max 172792.000000 25691.160000 1.000000

Exploratory Data analysis

Target imbalanced

sns.countplot(x='Class', data=df)
plt.show()

png

ratio = df.Class.sum() / df.shape[0]
print(f'Percentage of class #1 : {ratio * 100 :.2f}% - Number of frauds is highly imbalanced !')
Percentage of class #1 : 0.17% - Number of frauds is highly imbalanced !

Choosen metric

The goal is to identify fraudulent credit card transactions. Given the class imbalance ratio, it’s recommended measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

See here the complete explanation of the metric, with an extract below:

It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem rather than predicting classes directly.

This flexibility comes from the way that probabilities may be interpreted using different thresholds that allow the operator of the model to trade-off concerns in the errors made by the model, such as the number of false positives compared to the number of false negatives. This is required when using models where the cost of one error outweighs the cost of other types of errors.

Two diagnostic tools that help in the interpretation of probabilistic forecast for binary (two-class) classification predictive modeling problems are ROC Curves and Precision-Recall curves.

  • ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
  • Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
  • ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

What Is a Precision-Recall Curve?

Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Precision is referred to as the positive predictive value.

Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.

Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives

png

                                    Line Plot of Precision-Recall Curve
  • The curves of different models can be compared directly in general or for different thresholds.
  • The area under the curve (AUC) can be used as a summary of the model skill. Area Under Curve: like the AUC, summarizes the integral or an approximation of the area under the precision-recall curve.

Distribution of the transaction amount

plt.figure(figsize=(14, 5))
sns.distplot(df['Amount'], label = 'Amount')
plt.show()

png

print(f"Percentage of transactions' amount below 200€ : {len(df[df.Amount < 200]) / len(df) * 100 :.2f}%")
Percentage of transactions' amount below 200€ : 89.71%
print(f"Percentage of transactions' amount below 500€ : {len(df[df.Amount < 500]) / len(df) * 100 :.2f}%")
Percentage of transactions' amount below 500€ : 96.67%
plt.figure(figsize=(14, 8))

plt.subplot(2, 1, 1)
sns.kdeplot(df.loc[df['Class'] == 1, 'Amount'], shade=True, label = 'Fraud')
plt.subplot(2, 1, 2)
sns.kdeplot(df.loc[df['Class'] == 0, 'Amount'], shade=True, label = 'Normal')
plt.show()

png

plt.figure(figsize=(14, 5))

sns.kdeplot(df.loc[df['Class'] == 1, 'Amount'], shade=True, label = 'Fraud')
sns.kdeplot(df.loc[df['Class'] == 0, 'Amount'], shade=True, label = 'Normal')
plt.show()

png

Values depending on time

print(f'nb of days during when the data were collected : {df.Time.max() / (60 * 60 * 24):.2f}')
nb of days during when the data were collected : 2.00
df['Hours'] = round((df['Time'] / (60 * 60)) % 24)
df.Hours = df.Hours.astype(int)
df['Days'] = round(df['Time'] / (60 * 60 * 24))
df.Days = df.Days.astype(int)
df.Days.value_counts()
1    144825
2     92576
0     47406
Name: Days, dtype: int64
df.Class.sum()
492
plt.figure(figsize=(14, 6))
sns.distplot((df['Time'] / (60 * 60)), label = 'All transactions', bins=24)
plt.title('Transaction distribution during 2 days')
plt.show()

png

plt.figure(figsize=(14, 6))
sns.kdeplot((df['Time'] / (60 * 60)) % 24, label = 'All transactions', shade=True)
sns.kdeplot((df.loc[df['Class'] == 1, 'Time'] / (60 * 60)) % 24, label = 'Fraud', shade=True)
plt.title('Transaction Nb per hour during a typical day')
plt.show()

png

plt.figure(figsize=(14, 6))
sns.kdeplot((df['Time'] / (60 * 60)), label = 'All transactions', shade=True)
sns.kdeplot((df.loc[df['Class'] == 1, 'Time'] / (60 * 60)), label = 'Fraud', shade=True)
plt.title('Transaction Nb per hour during 2 days')
plt.show()

png

Correlations

corr = df.corr()
#corr
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7f069f03c668>

png

As expected there are not much correlations between features themselves because of the PCA. But we can see correlations between non PCA features in particular with the target. Nevertheless, there isn’t any strong correlations.

Ploting the PCA in 2D

Here it could be interesting to redo a PCA in order to reduce the dimension number to 2 or 3 so that we can easily plot the data with T-SNE. In this way, we’ll be able to see if class 1 is sparsed or grouped.

  • If the class 1 points are not situated at the same place, a SMOTE will add other points at the wrong place among class 0, which is not interesting for us. In this case it’s a better practice to do sub/over sampling.
  • On the contrary, if the points of clas 1 are near each others, we can do a SMOTE.

First, the PCA is sensitive to different scales, so we’ve to centered the data and apply a standard scaler for instance.

pca_feat = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Hours']
y = df.Class
X_centered = df[pca_feat]
scaler = StandardScaler()
X_centered = scaler.fit_transform(X_centered)
/home/sunflowa/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/sunflowa/anaconda3/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_centered)
X_pca[:5]
array([[ 0.36387701,  1.55169943],
       [-0.42400152,  1.55246636],
       [ 1.82884782,  1.62882145],
       [ 0.26292935,  1.46965936],
       [-0.0220992 ,  1.66986345]])
# Then we plot the results of PCA
plt.figure(figsize=(8, 6))
plt.plot(X_pca[y == 0, 0], X_pca[y == 0, 1], 'o', label='Genuine')
plt.plot(X_pca[y == 1, 0], X_pca[y == 1, 1], '*', label='Fraud')
plt.legend(loc=0)
plt.title('PCA 2D projection')
plt.show()

png

Plotting the TSNE in 2D - with a subsample

from sklearn.utils import resample

X_sub, y_sub = resample(X_pca, y, replace=False, n_samples=2000, random_state=0)
tsne = TSNE(n_components=2)

# Here we perform the t-SNE
X_tsne = tsne.fit_transform(X_sub)

# Then we plot the results of t-SNE
plt.plot(X_tsne[y_sub == 0, 0], X_tsne[y_sub == 0, 1], 'o', label='Genuine')
plt.plot(X_tsne[y_sub == 1, 0], X_tsne[y_sub == 1, 1], '*', label='Fraud')
plt.legend(loc=0)
plt.title('t-SNE projection')
plt.show()

png

Ploting the PCA in 3D

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_centered)
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_pca[y == 0, 2], X_pca[y == 0, 1], X_pca[y == 0, 0], 'o', label='Genuine')
ax.scatter(X_pca[y == 1, 2], X_pca[y == 1, 1], X_pca[y == 1, 0], '*', label='Fraud')

plt.legend()
plt.show()

png

Plotting the TSNE in 3D

tsne = TSNE(n_components=3)
X_tsne = tsne.fit_transform(X_sub)

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_tsne[y_sub == 0, 0], X_tsne[y_sub == 0, 1], X_tsne[y_sub == 0, 2], 'o', label='Genuine')
ax.scatter(X_tsne[y_sub == 1, 0], X_tsne[y_sub == 1, 1], X_tsne[y_sub == 1, 2], '*', label='Fraud')

plt.legend()
plt.show()

png

Data Preparation and Feature engineering

Features kept

At first glance, it seems not interesting to keep days as data are collected over 2 days only…and it’s not relevant to keep the unchanged feature ‘Time’.

features_kept = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Hours', 'Days']
y = df['Class']
X = df[features_kept]

Importance analysis

rnd_clf = RandomForestClassifier(n_jobs=-1)
rnd_clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
feature_importances = pd.DataFrame(rnd_clf.feature_importances_, index = X.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances[:10]
importance
V12 0.216463
V14 0.106220
V17 0.090474
V9 0.084557
V10 0.064671
V18 0.057420
V16 0.053249
V11 0.042484
V4 0.034072
V7 0.024718
plt.figure(figsize=(14, 8))
sns.barplot(x="importance", y=feature_importances.index, data=feature_importances)
plt.show()

png

Data transformation for models

X = pd.get_dummies(data=X, columns=['Hours'], drop_first=True)
X.head()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Days Hours_1 Hours_2 Hours_3 Hours_4 Hours_5 Hours_6 Hours_7 Hours_8 Hours_9 Hours_10 Hours_11 Hours_12 Hours_13 Hours_14 Hours_15 Hours_16 Hours_17 Hours_18 Hours_19 Hours_20 Hours_21 Hours_22 Hours_23 Hours_24
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scaler = StandardScaler()
X[['Amount', 'Days']] = scaler.fit_transform(X[['Amount', 'Days']])
/home/sunflowa/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/sunflowa/anaconda3/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
X.drop(columns=['Days'])
X.head()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Days Hours_1 Hours_2 Hours_3 Hours_4 Hours_5 Hours_6 Hours_7 Hours_8 Hours_9 Hours_10 Hours_11 Hours_12 Hours_13 Hours_14 Hours_15 Hours_16 Hours_17 Hours_18 Hours_19 Hours_20 Hours_21 Hours_22 Hours_23 Hours_24
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 0.244964 -1.696601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 -0.342475 -1.696601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 1.160686 -1.696601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 0.140534 -1.696601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 -0.073403 -1.696601 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Let’s split the data in a train and test set to evaluate models

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Baselines with weighted classes but without resampling

Implemantation of the metric

# Just a test
model = RandomForestClassifier()
model.fit(X_train, y_train)

# predict probabilities
probs = model.predict_proba(X_test)

# keep probabilities for the positive outcome only
probs = probs[:, 1]

# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, probs)

# calculate precision-recall AUC
auc_score = auc(recall, precision)
auc_score
0.8753123564549905
# f1_score binary by default
def get_f1_scores(fitted_clf, model_name):
    y_train_pred, y_pred = fitted_clf.predict(X_train), fitted_clf.predict(X_test)
    print(model_name, ' :')
    print(f'Training F1 score = {f1_score(y_train, y_train_pred) * 100:.2f}% / Test F1 score = {f1_score(y_test, y_pred)  * 100:.2f}%')
    print(classification_report(y_test, y_pred))
def get_auc_scores(fitted_clf, model_name):
    print(model_name, ' :')
    
    # get classes predictions for the classification report 
    y_train_pred, y_pred = fitted_clf.predict(X_train), fitted_clf.predict(X_test)
    print(classification_report(y_test, y_pred), '\n') # target_names=y
    
    # computes probabilities keep the ones for the positive outcome only    
    probs = fitted_clf.predict_proba(X_test)[:, 1]
    
    # calculate precision-recall curve
    precision, recall, thresholds = precision_recall_curve(y_test, probs)
    
    # calculate precision-recall AUC
    auc_score = auc(recall, precision)
    print(f'Area Under the Precision-Recall Curve (AUPRC) = {auc_score * 100 :.2f}%')

Randomforest without weighted classes

model = RandomForestClassifier()
model.fit(X_train, y_train)
get_auc_scores(model, 'RandomForest not weighted')
RandomForest not weighted  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.87      0.82      0.85        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.94      0.91      0.92     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 88.65%

Randomforest with weighted classes

y_train.sum(), len(y_train) - y_train.sum()
(398, 227447)
model = RandomForestClassifier(class_weight={1:398, 0:227447})
model.fit(X_train, y_train)
get_auc_scores(model, 'RandomForest weighted')
RandomForest weighted  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56868
           1       0.89      0.72      0.80        94

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.95      0.86      0.90     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 80.41%

We can see that the F1 score is the same but not the AUPRC (AUPRC = 88.65% without weighted classes and 85.16% with). This indicates that F1 score is definitvely not a better metrice than AUPRC.

LGBM without weights

model = lgbm.LGBMClassifier(n_jobs = -1)
model.fit(X_train, y_train)
get_auc_scores(model, 'LGBM non weighted')
LGBM non weighted  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.15      0.35      0.21        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.58      0.67      0.61     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 20.58%

The results is disappointing…

LGBM with weights

model = lgbm.LGBMClassifier(n_jobs = -1, class_weight={0:398, 1:227447})
model.fit(X_train, y_train)
get_auc_scores(model, 'LGBM weighted')
LGBM weighted  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56868
           1       0.81      0.81      0.81        94

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.90      0.90      0.90     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 79.59%

That’s better but not still less than the 1st random forrest clf.

XGB without ratio

model = xgb.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
get_auc_scores(model, 'XGB without ratio')
XGB without ratio  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.87      0.77      0.82        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.93      0.89      0.91     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 86.19%

XGB with ratio

ratio = ((len(y_train) - y_train.sum()) - y_train.sum()) / y_train.sum()
ratio
570.4748743718593
model = xgb.XGBClassifier(objective="binary:logistic", scale_pos_weight=ratio)
model.fit(X_train, y_train)
get_auc_scores(model, 'XGB with ratio')
XGB with ratio  :
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     56868
           1       0.16      0.85      0.27        94

   micro avg       0.99      0.99      0.99     56962
   macro avg       0.58      0.92      0.63     56962
weighted avg       1.00      0.99      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 70.62%

It seems that XGB doesn’t work well with a ratio, the result is better without it.

Using the Synthetic Minority Over-sampling Technique

How SMOTe works

Brief description on SMOTe (Synthetic Minority Over-sampling Technique):

it creates synthetic observations of the minority class (bad loans) by:

  • Finding the k-nearest-neighbors for minority class observations (finding similar observations)
  • Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observation.

More explanations can be found here

An other informative article

Oversampling is a well-known way to potentially improve models trained on imbalanced data. But it’s important to remember that oversampling incorrectly can lead to thinking a model will generalize better than it actually does. Random forests are great because the model architecture reduces overfitting (see Brieman 2001 for a proof), but poor sampling practices can still lead to false conclusions about the quality of a model.

When the model is in production, it’s predicting on unseen data. The main point of model validation is to estimate how the model will generalize to new data. If the decision to put a model into production is based on how it performs on a validation set, it’s critical that oversampling is done correctly.

# conda install -c conda-forge imbalanced-learn
from imblearn.over_sampling import SMOTE
X_train.shape, y_train.shape
((227845, 54), (227845,))

The SMOTE is applied on the train set ONLY:

sm = SMOTE(sampling_strategy='auto', k_neighbors=3, n_jobs=1, ratio=0.01)
X_train, y_train = sm.fit_resample(X_train, y_train)
X_train.shape, y_train.shape
((229711, 54), (229711,))
y_train.sum(), len(y_train) - y_train.sum()
(2274, 227437)

Randomforest

model = RandomForestClassifier()
model.fit(X_train, y_train)
get_auc_scores(model, 'RandomForest with SMOTE')
RandomForest with SMOTE  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.89      0.83      0.86        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.94      0.92      0.93     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 88.21%

Result unchanged !

model = RandomForestClassifier(class_weight={1:2274, 0:227437})
model.fit(X_train, y_train)
get_auc_scores(model, 'RandomForest weighted')
RandomForest weighted  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56868
           1       0.93      0.67      0.78        94

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.96      0.84      0.89     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 81.85%

LGBM

model = lgbm.LGBMClassifier(n_jobs = -1)
model.fit(X_train, y_train)
get_auc_scores(model, 'LGBM with SMOTE')
LGBM with SMOTE  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.86      0.81      0.83        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.93      0.90      0.92     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 86.49%

That’s really better, let’s try with ratio

model = lgbm.LGBMClassifier(n_jobs = -1, class_weight={0:2274, 1:227437})
model.fit(X_train, y_train)
get_auc_scores(model, 'LGBM with SMOTE and ratio')
LGBM with SMOTE and ratio  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56878
           1       0.71      0.90      0.80        84

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.86      0.95      0.90     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 89.91%

That’s the best result so far !

Logistic Regression

lr = LogisticRegression(C=0.01, penalty='l1').fit(X_train, y_train)
get_auc_scores(lr, 'Logistic Regression')
Logistic Regression  :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56868
           1       0.88      0.47      0.61        94

   micro avg       1.00      1.00      1.00     56962
   macro avg       0.94      0.73      0.81     56962
weighted avg       1.00      1.00      1.00     56962
 

Area Under the Precision-Recall Curve (AUPRC) = 71.75%

Precision-recall curve plot

probs = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, probs)
from inspect import signature
from sklearn.metrics import average_precision_score

average_precision = average_precision_score(y_test, probs)
print('Average precision-recall score: {0:0.2f}'.format(average_precision))

step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})

plt.figure(figsize=(8, 6))
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
Average precision-recall score: 0.71





Text(0.5, 1.0, '2-class Precision-Recall curve: AP=0.71')

png

Models Scores Comparison

data = {'Country': ['Belgium',  'India',  'Brazil'],
    'Capital': ['Brussels',  'New Delhi',  'Brasilia'],
    'Population': [11190846, 1303171035, 207847528]}
test = pd.DataFrame(data,columns=['Country',  'Capital',  'Population'])
test
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
data = {'Models': ['RandomForrest w/o ratio,  w/o smote',
                   'RandomForrest w/  ratio,  w/o smote',
                   'RandomForrest w/o ratio,  w/  smote',
                   'RandomForrest w/  ratio,  w/  smote',
                   'LGBM          w/o ratio,  w/o smote',
                   'LGBM          w/  ratio,  w/o smote',
                   'LGBM          w/o ratio,  w/  smote',
                   'LGBM          w/  ratio,  w/  smote',
                   'XGBoost       w/o ratio,  w/o smote',
                   'XGBoost       w/  ratio,  w/o smote'],
    'AUPRC': [88.65, 80.41, 88.21, 81.85, 20.58, 79.59, 86.49, 89.91, 86.19, 70.62]}

scores_df = pd.DataFrame(data,columns=['Models',  'AUPRC'])

plt.figure(figsize=(8, 6))
sns.barplot(y="Models", x="AUPRC", data=scores_df)
plt.show()

png

Using anomalies detection models

for learning purposes :)

How LOF works under the hood

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

The number of neighbors considered, (parameter n_neighbors) is typically chosen 1) greater than the minimum number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general.

Application

X_sub, y_sub = resample(X, y, replace=False, n_samples=2000, random_state=0)
X_sub.shape, y_sub.shape
((2000, 54), (2000,))
from sklearn.neighbors import LocalOutlierFactor

# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=20, contamination="auto")
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).

y_pred = clf.fit_predict(X_train)
#n_errors = (y_pred != y).sum()
X_scores = clf.negative_outlier_factor_


plt.figure(figsize=(8,8))
plt.title("Local Outlier Factor (LOF)")
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=10., label='Data points')

# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(X_train[:, 0], X_train[:, 1], s=1000 * radius, edgecolors='r', facecolors='none', label='Outlier scores')
legend = plt.legend(loc='lower left')
plt.show()

print("prediction errors: {}".format(n_errors))
print("Negative LOF scores: {}".format(clf.negative_outlier_factor_))
print("Offset (threshold to consider sample as anomaly or not): {}".format(clf.offset_))