# load packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from scipy.stats import kstest, norm
from imblearn.under_sampling import NearMiss, RandomUnderSampler
# To ignore warinings
import warnings
warnings.filterwarnings('ignore')
1. Understand the Context and Data Collection
Define the objective
The objective of this competition is to predict which customers respond positively to an automobile insurance offer.
Load Data
For this purpose, we load the necessary data from the corresponding competition in Kaggle (link).
# load the training and test dataset
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
Understand the Data
The dataset for this competition (both train and test) was generated from a deep learning model trained on the Health Insurance Cross Sell Prediction Data dataset.
2. Data Cleaning
Handle Missing Values
We’ll now identify any missing data and decide how to handle it, whether by removal or imputation.
# check whether the training dataset has missing observations
print(df_train.isna().sum())
print("\n")
print(df_test.isna().sum())
id 0
Gender 0
Age 0
Driving_License 0
Region_Code 0
Previously_Insured 0
Vehicle_Age 0
Vehicle_Damage 0
Annual_Premium 0
Policy_Sales_Channel 0
Vintage 0
Response 0
dtype: int64
id 0
Gender 0
Age 0
Driving_License 0
Region_Code 0
Previously_Insured 0
Vehicle_Age 0
Vehicle_Damage 0
Annual_Premium 0
Policy_Sales_Channel 0
Vintage 0
dtype: int64
The training and test data don’t have any missing observations.
Remove Duplicates
We’ll now identify and remove duplicate records.
# check whether there are duplicated observations
print(df_train.duplicated().sum())
print("\n")
print(df_test.duplicated().sum())
0
0
There are no duplicated observations in both datasets.
Correct Errors
We now fix any errors or inconsistencies in the data (e.g., incorrect data types, out-of-range values).
# show the first five rows of df_train
df_train.head()
id | Gender | Age | Driving_License | Region_Code | Previously_Insured | Vehicle_Age | Vehicle_Damage | Annual_Premium | Policy_Sales_Channel | Vintage | Response | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Male | 21 | 1 | 35.0 | 0 | 1-2 Year | Yes | 65101.0 | 124.0 | 187 | 0 |
1 | 1 | Male | 43 | 1 | 28.0 | 0 | > 2 Years | Yes | 58911.0 | 26.0 | 288 | 1 |
2 | 2 | Female | 25 | 1 | 14.0 | 1 | < 1 Year | No | 38043.0 | 152.0 | 254 | 0 |
3 | 3 | Female | 35 | 1 | 1.0 | 0 | 1-2 Year | Yes | 2630.0 | 156.0 | 76 | 0 |
4 | 4 | Female | 36 | 1 | 15.0 | 1 | 1-2 Year | No | 31951.0 | 152.0 | 294 | 0 |
# print the number of rows and columns in df_train and df_test
print("The training data have {} rows and {} columns.".format(df_train.shape[0], df_train.shape[1]))
print("\nThe test data have {} rows and {} columns.".format(df_test.shape[0], df_test.shape[1]))
The training data have 11504798 rows and 12 columns.
The test data have 7669866 rows and 11 columns.
# prints information about df_train
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11504798 entries, 0 to 11504797
Data columns (total 12 columns):
# Column Dtype
--- ------ -----
0 id int64
1 Gender object
2 Age int64
3 Driving_License int64
4 Region_Code float64
5 Previously_Insured int64
6 Vehicle_Age object
7 Vehicle_Damage object
8 Annual_Premium float64
9 Policy_Sales_Channel float64
10 Vintage int64
11 Response int64
dtypes: float64(3), int64(6), object(3)
memory usage: 1.0+ GB
I’ll convert $Region_Code$ and $Policy_Sales_Channel$ to integer.
df_train['Region_Code'] = df_train['Region_Code'].astype('int64')
df_test['Region_Code'] = df_test['Region_Code'].astype('int64')
df_train['Policy_Sales_Channel'] = df_train['Policy_Sales_Channel'].astype('int64')
df_test['Policy_Sales_Channel'] = df_test['Policy_Sales_Channel'].astype('int64')
# get a quick overview of the numeric data
df_train.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))
id | Age | Driving_License | Region_Code | Previously_Insured | Annual_Premium | Policy_Sales_Channel | Vintage | Response | |
---|---|---|---|---|---|---|---|---|---|
count | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 | 1.15048e+07 |
mean | 5.7524e+06 | 38.3836 | 0.998022 | 26.4187 | 0.462997 | 30461.4 | 112.425 | 163.898 | 0.122997 |
std | 3.32115e+06 | 14.9935 | 0.0444312 | 12.9916 | 0.498629 | 16454.7 | 54.0357 | 79.9795 | 0.328434 |
min | 0 | 20 | 0 | 0 | 0 | 2630 | 1 | 10 | 0 |
25% | 2.8762e+06 | 24 | 1 | 15 | 0 | 25277 | 29 | 99 | 0 |
50% | 5.7524e+06 | 36 | 1 | 28 | 0 | 31824 | 151 | 166 | 0 |
75% | 8.6286e+06 | 49 | 1 | 35 | 1 | 39451 | 152 | 232 | 0 |
max | 1.15048e+07 | 85 | 1 | 52 | 1 | 540165 | 163 | 299 | 1 |
# get a quick overview of the categorical data
df_train.describe(include='object')
Gender | Vehicle_Age | Vehicle_Damage | |
---|---|---|---|
count | 11504798 | 11504798 | 11504798 |
unique | 2 | 3 | 2 |
top | Male | 1-2 Year | Yes |
freq | 6228134 | 5982678 | 5783229 |
We don’t see any inconsistencies in the training data.
3. Data Preparation
Data Transformation
Transform data into appropriate formats (e.g., scaling, encoding categorical variables).
# how many unique values Gender, Vehicle_Age and Vehicle_Damange have?
df_train[['Gender', 'Vehicle_Age', 'Vehicle_Damage']].nunique()
Gender 2
Vehicle_Age 3
Vehicle_Damage 2
dtype: int64
# one-hot encode for Gender and Vehicle_Damage as they have 2 unique values
enc = OneHotEncoder(drop='if_binary', sparse_output=False, dtype=np.int64)
df_train['Gender'] = enc.fit_transform(df_train[['Gender']]) # 1 for male and 0 otherwise
df_test['Gender'] = enc.fit_transform(df_test[['Gender']])
df_train['Vehicle_Damage'] = enc.fit_transform(df_train[['Vehicle_Damage']]) # 1 for Yes and 0 otherwise
df_test['Vehicle_Damage'] = enc.fit_transform(df_test[['Vehicle_Damage']])
# ordinal encoding for Vehicle_Age
enc = OrdinalEncoder(categories=[['< 1 Year', '1-2 Year', '> 2 Years']], dtype=np.int64)
df_train['Vehicle_Age'] = enc.fit_transform(df_train[['Vehicle_Age']])
df_test['Vehicle_Age'] = enc.fit_transform(df_test[['Vehicle_Age']])
4. Data Exploration
Univariate Analysis
Categorical Variables
$Gender$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Gender')
<Axes: xlabel='Gender', ylabel='count'>
# show the proportion of each category
sns.countplot(data=df_train, x='Gender', stat='percent')
<Axes: xlabel='Gender', ylabel='percent'>
More than half of the observations are male and the rest (46 percent) are female.
$Driving_License$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Driving_License')
<Axes: xlabel='Driving_License', ylabel='count'>
# show the proportion of each category
sns.countplot(data=df_train, x='Driving_License', stat='percent');
df_train['Driving_License'].value_counts(normalize=True)
Driving_License
1 0.998022
0 0.001978
Name: proportion, dtype: float64
Almost all customers have driver license, which makes me think whether this variable will be useful for predicting which customers respond positively to an automobile insurance offer as it shows very little variability.
$Region_Code$
# display the frequency of each category to compare the counts across categories.
plt.figure(figsize=(10, 6))
sns.countplot(data=df_train, x='Region_Code')
plt.xticks(rotation=45);
# show the proportion of each category
plt.figure(figsize=(10, 6))
sns.countplot(data=df_train, x='Region_Code', stat='percent');
plt.xticks(rotation=45);
# Summary Statistics
region_mode = df_train['Region_Code'].mode()[0]
unique_regions = df_train['Region_Code'].nunique()
print(f"Mode: {region_mode}")
print(f"Number of Unique Regions: {unique_regions}")
Mode: 28
Number of Unique Regions: 53
The customers are located in 54 different regions and around 30 percent of them live in region 28.
$Previously_Insured$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Previously_Insured')
<Axes: xlabel='Previously_Insured', ylabel='count'>
# show the proportion of each category
sns.countplot(data=df_train, x='Previously_Insured', stat='percent');
df_train['Previously_Insured'].value_counts(normalize=True)
Previously_Insured
0 0.537003
1 0.462997
Name: proportion, dtype: float64
This variable is more or less evenly distributed but there are 7 percent more customers who were not insured previously than those who were.
$Vehicle_Age$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Vehicle_Age');
# show the proportion of each category
sns.countplot(data=df_train, x='Vehicle_Age', stat='percent');
df_train['Vehicle_Age'].value_counts(normalize=True)
Vehicle_Age
1 0.520016
0 0.438438
2 0.041546
Name: proportion, dtype: float64
More than 95 percent of customers own a vehicle that is 2 years old or newer.
$Vehicle_Damage$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Vehicle_Damage');
# show the proportion of each category
sns.countplot(data=df_train, x='Vehicle_Damage', stat='percent');
Half of the customers have their vehicle with damages.
$Policy_Sales_Channel$
# display the frequency of each category to compare the counts across categories.
plt.figure(figsize=(10, 6))
sns.countplot(data=df_train, x='Policy_Sales_Channel');
plt.xticks(rotation=45);
# show the proportion of each category
plt.figure(figsize=(10, 6))
sns.countplot(data=df_train, x='Policy_Sales_Channel', stat='percent');
plt.xticks(rotation=45);
# Summary Statistics
channel_mode = df_train['Policy_Sales_Channel'].mode()[0]
unique_channels = df_train['Policy_Sales_Channel'].nunique()
print(f"Mode: {channel_mode}")
print(f"Number of Unique Regions: {unique_channels}")
Mode: 152
Number of Unique Regions: 152
There are 152 unique channels where insurance policies are sold and the channel 152 is the most observed channel in the data. Would this variable be related to $Region_Code$?
$Response$
# display the frequency of each category to compare the counts across categories.
sns.countplot(data=df_train, x='Response');
# show the proportion of each category
sns.countplot(data=df_train, x='Response', stat='percent');
Lastly, the dependent variable, $Response$, has an unbalanced distribution. The proportion of customers who responded negatively to an automobile insurance offer is much greater than the proportion of those who responded positively.
As you may already know, imbalanced classifications could result in models that have poor predictive performance, specifically for the minority class. In this competition, we are also more interested in classifying correctly positive cases (respond positively to an automobile insurance offer). Thus, we’ll have to deal with this issue later before modeling.
We are now going to explore the continuous variables.
Continuous Variables
$Age$
# the frequency distribution of a numeric variable
sns.histplot(data=df_train, x='Age');
# combine the distribution shape with summary statistics
sns.violinplot(data=df_train, x='Age')
<Axes: xlabel='Age'>
# check if the data follows a normal distribution.
sm.qqplot(df_train['Age'], line ='45');
The variable $Age$ is not normally distributed, with most customers being between 20 and 30 years old. According to the violin plot, there appear to be no outliers.
$Annual_Premium$
# the frequency distribution of a numeric variable
sns.histplot(data=df_train, x='Annual_Premium');
# combine the distribution shape with summary statistics
sns.violinplot(data=df_train, x='Annual_Premium')
<Axes: xlabel='Annual_Premium'>
# check if the data follows a normal distribution.
sm.qqplot(df_train['Annual_Premium'], line ='45');
The distribution of $Annual_Premium$ also seems to be far from a normal distribution. Furthermore, its distribution is right skewed meaning that it has a very long right tail. Thus, it is highly likely that there are outliers, which we’ll identify later.
$Vintage$
This variable represents the number of days a customer has been associated with the insurance company.
# the frequency distribution of a numeric variable
sns.histplot(data=df_train, x='Vintage');
# combine the distribution shape with summary statistics
sns.violinplot(data=df_train, x='Vintage')
<Axes: xlabel='Vintage'>
# check if the data follows a normal distribution.
sm.qqplot(df_train['Vintage'], line ='45');
This variable seems to follow a uniform distribution as all the values are more or less equally likely.
Multivariate Analysis
plt.figure(figsize=(10, 10))
sns.heatmap(df_train.corr(), annot=True)
<Axes: >
Examining the correlation between the variables…
- There is a strong positive correlation between $Age$ and $Vehicle_Age$. This suggests that younger drivers tend to have newer vehicles, whereas older drivers are more likely to own older vehicles. The reason behind this would be that younger drivers prefer to own newer models, while older drivers keep their vehicles longer.
- $Previously_Insured$ and $Vehicle_Damage$ are strongly and negatively correlated. This implies that drivers who are not insured are more likely to have experienced vehicle damage in the past. This may be because those who were insured were being more cautious or having less risky driving behavior.
- $Previously_Insured$ is also negatively correlated with $Vehicle_Age$ and this is because $Vehicle_Age$ and $Vehicle_Damage$ are positively related; the newer the vehicle, the less likely that the vehicle has experienced damages.
- There is also a negative correlation between $Age$ and $Policy_Sales_Channel$. This means that different channels are used to reach out to the customers depending on their age. Older customers may favor traditional methods like agents or phone, while younger customers are more likely to use digital channels.
- As $Age$ and $Vehicle_Age$ are strongly and positively correlated, $Policy_Sales_Channel$ is also negatively correlated with $Vehicle_Age$.
- The dependent variable, $Response$, is negatively correlated with $Previously_Insured$ as it is quite obvious that if you’re already insured, you are more likely to respond negatively to insurance offers. In addition, $Response$ is positively related to $Vehicle_Damage$. This may be due to the fact that if you have experienced damages to your vehicle in the past, you might want to have your vehicle insured.
$Age$ and $Vehicle_Age$
sns.boxplot(x='Vehicle_Age', y='Age', data=df_train)
plt.show()
sns.histplot(data=df_train, x='Age', hue='Vehicle_Age', palette=['red', 'blue', 'green']);
As we’ve seen earlier in the correlation heatmap, the plots indicate that vehicles less than 1 year old are typically owned by drivers under 35, while older vehicles are owned by older drivers.
$Previously_Insured$ and $Vehicle_Damage$
sns.countplot(x='Previously_Insured', hue='Vehicle_Damage', data=df_train);
We’ve already seen before that the drivers who were previously insured are more likely to have experience damages on their vehicles.
g = sns.FacetGrid(df_train, col="Previously_Insured", row="Vehicle_Damage", margin_titles=True)
g.map(sns.countplot, "Response")
plt.show()
The majority of clients who respond positively to insurance offers are drivers who are not insured and have had their vehicles damaged in the past.
$Age$ and $Policy_Sales_Channel$
top7_policy_sales_channels = df_train['Policy_Sales_Channel'].value_counts()[:7].index
df_train_top7_policy_sales_channels = df_train[df_train['Policy_Sales_Channel'].isin(top7_policy_sales_channels)]
sns.boxplot(x='Policy_Sales_Channel', y='Age', data=df_train_top7_policy_sales_channels)
plt.show()
Channel 152 and 160 are the ones that reach out to young clients, whereas the other target older audiences.
5. Model Training
Now we turn our attention to predicting whether a client responds negatively or positively, I’ll build an artificial neural network model for classification. But before that, I will normalize the input features (except binary variables) as if we feed unnormalized inputs to activation functions, we can get stuck in a very flat region in the domain and may not learn at all. Or worse, we can end up with numerical issues.
Scaling Data
# normalize continuous features
scaler = MinMaxScaler()
df_train['Age'] = scaler.fit_transform(df_train[['Age']])
df_test['Age'] = scaler.transform(df_test[['Age']])
df_train['Annual_Premium'] = scaler.fit_transform(df_train[['Annual_Premium']])
df_test['Annual_Premium'] = scaler.transform(df_test[['Annual_Premium']])
df_train['Vintage'] = scaler.fit_transform(df_train[['Vintage']])
df_test['Vintage'] = scaler.transform(df_test[['Vintage']])
# normalize continuous features
df_train['Region_Code'] = scaler.fit_transform(df_train[['Region_Code']])
df_test['Region_Code'] = scaler.fit_transform(df_test[['Region_Code']])
df_train['Vehicle_Age'] = scaler.fit_transform(df_train[['Vehicle_Age']])
df_test['Vehicle_Age'] = scaler.fit_transform(df_test[['Vehicle_Age']])
df_train['Policy_Sales_Channel'] = scaler.fit_transform(df_train[['Policy_Sales_Channel']])
df_test['Policy_Sales_Channel'] = scaler.fit_transform(df_test[['Policy_Sales_Channel']])
Splitting Data
Split the training data into independent and dependent datasets.
y = df_train['Response']
X = df_train.drop(['Response', 'id'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Build ANN
class LinearLayer:
"""
This Class implements all functions to be executed by a linear layer
in a computational graph
Args:
input_shape: input shape of Data/Activations
n_out: number of neurons in layer
ini_type: initialization type for weight parameters, default is "plain"
Opitons are: plain, xavier and he
Methods:
forward(A_prev)
backward(upstream_grad)
update_params(learning_rate)
"""
def __init__(self, input_shape, n_out, ini_type="plain"):
"""
The constructor of the LinearLayer takes the following parameters
Args:
input_shape: input shape of Data/Activations
n_out: number of neurons in layer
ini_type: initialization type for weight parameters, default is "plain"
"""
self.m = input_shape[1] # number of examples in training data
# `params` store weights and bias in a python dictionary
self.params = initialize_parameters(input_shape[0], n_out, ini_type) # initialize weights and bias
self.Z = np.zeros((self.params['W'].shape[0], input_shape[1])) # create space for resultant Z output
def forward(self, A_prev):
"""
This function performs the forwards propagation using activations from previous layer
Args:
A_prev: Activations/Input Data coming into the layer from previous layer
"""
self.A_prev = A_prev # store the Activations/Training Data coming in
self.Z = np.dot(self.params['W'], self.A_prev) + self.params['b'] # compute the linear function
def backward(self, upstream_grad):
"""
This function performs the back propagation using upstream gradients
Args:
upstream_grad: gradient coming in from the upper layer to couple with local gradient
"""
# derivative of Cost w.r.t W
self.dW = np.dot(upstream_grad, self.A_prev.T)
# derivative of Cost w.r.t b, sum across rows
self.db = np.sum(upstream_grad, axis=1, keepdims=True)
# derivative of Cost w.r.t A_prev
self.dA_prev = np.dot(self.params['W'].T, upstream_grad)
def update_params(self, learning_rate=0.1):
"""
This function performs the gradient descent update
Args:
learning_rate: learning rate hyper-param for gradient descent, default 0.1
"""
self.params['W'] = self.params['W'] - learning_rate * self.dW # update weights
self.params['b'] = self.params['b'] - learning_rate * self.db # update bias(es)
class SigmoidLayer:
"""
This file implements activation layers
inline with a computational graph model
Args:
shape: shape of input to the layer
Methods:
forward(Z)
backward(upstream_grad)
"""
def __init__(self, shape):
"""
The consturctor of the sigmoid/logistic activation layer takes in the following arguments
Args:
shape: shape of input to the layer
"""
self.A = np.zeros(shape) # create space for the resultant activations
def forward(self, Z):
"""
This function performs the forwards propagation step through the activation function
Args:
Z: input from previous (linear) layer
"""
self.A = 1 / (1 + np.exp(-Z)) # compute activations
def backward(self, upstream_grad):
"""
This function performs the back propagation step through the activation function
Local gradient => derivative of sigmoid => A*(1-A)
Args:
upstream_grad: gradient coming into this layer from the layer above
"""
# couple upstream gradient with local gradient, the result will be sent back to the Linear layer
self.dZ = upstream_grad * self.A*(1-self.A)
def initialize_parameters(n_in, n_out, ini_type='plain'):
"""
Helper function to initialize some form of random weights and Zero biases
Args:
n_in: size of input layer
n_out: size of output/number of neurons
ini_type: set initialization type for weights
Returns:
params: a dictionary containing W and b
"""
params = dict() # initialize empty dictionary of neural net parameters W and b
if ini_type == 'plain':
params['W'] = np.random.randn(n_out, n_in) *0.01 # set weights 'W' to small random gaussian
elif ini_type == 'xavier':
params['W'] = np.random.randn(n_out, n_in) / (np.sqrt(n_in)) # set variance of W to 1/n
elif ini_type == 'he':
# Good when ReLU used in hidden layers
# Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
# Kaiming He et al. (https://arxiv.org/abs/1502.01852)
# http: // cs231n.github.io / neural - networks - 2 / # init
params['W'] = np.random.randn(n_out, n_in) * np.sqrt(2/n_in) # set variance of W to 2/n
params['b'] = np.zeros((n_out, 1)) # set bias 'b' to zeros
return params
def compute_stable_bce_cost(Y, Z):
"""
This function computes the "Stable" Binary Cross-Entropy(stable_bce) Cost and returns the Cost and its
derivative w.r.t Z_last(the last linear node) .
The Stable Binary Cross-Entropy Cost is defined as:
=> (1/m) * np.sum(max(Z,0) - ZY + log(1+exp(-|Z|)))
Args:
Y: labels of data
Z: Values from the last linear node
Returns:
cost: The "Stable" Binary Cross-Entropy Cost result
dZ_last: gradient of Cost w.r.t Z_last
"""
m = Y.shape[1]
cost = (1/m) * np.sum(np.maximum(Z, 0) - Z*Y + np.log(1+ np.exp(- np.abs(Z))))
dZ_last = (1/m) * ((1/(1+np.exp(- Z))) - Y) # from Z computes the Sigmoid so P_hat - Y, where P_hat = sigma(Z)
return cost, dZ_last
Train the Model
# define training constants
learning_rate = 1
number_of_epochs = 500
np.random.seed(48) # set seed value so that the results are reproduceable
# (weights will now be initailzaed to the same pseudo-random numbers, each time)
# Our network architecture has the shape:
# (input)--> [Linear->Sigmoid] -> [Linear->Sigmoid] -->(output)
#------ LAYER-1 ----- define hidden layer that takes in training data
Z1 = LinearLayer(input_shape=(10, 2263750), n_out=10, ini_type='plain')
A1 = SigmoidLayer(Z1.Z.shape)
#------ LAYER-2 ----- define output layer that take is values from hidden layer
Z2 = LinearLayer(input_shape=A1.A.shape, n_out=1, ini_type='plain')
A2 = SigmoidLayer(Z2.Z.shape)
train_costs = [] # initially empty list, this will store all the costs after a certian number of epochs
test_costs = []
# Set up the undersampling method
undersampler = RandomUnderSampler()
# Start training
for epoch in range(number_of_epochs):
# Apply the transformation to the dataset
X_train_sampled, y_train_sampled = undersampler.fit_resample(X_train, y_train)
X_train_sampled = np.array(X_train_sampled).T
y_train_sampled = np.array(y_train_sampled).reshape(-1, 1).T
# ------------------------- forward-prop -------------------------
Z1.forward(X_train_sampled)
A1.forward(Z1.Z)
Z2.forward(A1.A)
A2.forward(Z2.Z)
# ---------------------- Compute Cost ----------------------------
train_cost, dZ2 = compute_stable_bce_cost(y_train_sampled, Z2.Z)
train_costs.append(train_cost)
# ------------------------- back-prop ----------------------------
Z2.backward(dZ2)
A1.backward(Z2.dA_prev)
Z1.backward(A1.dZ)
# ---------------------- Forward pass on the test data -----------
Z1.forward(np.array(X_test).T)
A1.forward(Z1.Z)
Z2.forward(A1.A)
A2.forward(Z2.Z)
test_cost, dZ_last = compute_stable_bce_cost(np.array(y_test).reshape(-1, 1).T, Z2.Z)
test_costs.append(test_cost)
# print and store Costs every 100 iterations.
if (epoch % 100) == 0:
print("Cost at epoch# {}: on the training data - {}, on the test data - {}".format(epoch, train_cost, test_cost))
# ----------------------- Update weights and bias ----------------
Z2.update_params(learning_rate=learning_rate)
Z1.update_params(learning_rate=learning_rate)
Cost at epoch# 0: on the training data - 0.6931342992576633, on the test data - 0.695047958701558
Cost at epoch# 100: on the training data - 0.5269429772504229, on the test data - 0.5486524265408564
Cost at epoch# 200: on the training data - 0.4444701708273118, on the test data - 0.49260638677894864
Cost at epoch# 300: on the training data - 0.44071909042647595, on the test data - 0.48967322066381624
Cost at epoch# 400: on the training data - 0.43906724946782094, on the test data - 0.48747623761542985
Evaluate the Model
We will now observe the cost on both the training and test data after each epoch. It can be seen that after approximately 150 epochs, the costs begin to stabilize and no longer decrease significantly. Of course, the cost on the test data is always lower than on the training data, as the model is trained using the training data, not the test data.
plt.plot(train_costs, label='Training Cost')
plt.plot(test_costs, label='Test Cost')
plt.legend()
plt.show();
We will also examine the accuracy of the model’s predictions on both the training and test data.
## accuracy on the training data
# forward pass
Z1.forward(X_train.T)
A1.forward(Z1.Z)
Z2.forward(A1.A)
A2.forward(Z2.Z)
y_train_preds = np.round(A2.A).flatten().astype('int64')
y_train_arr = np.array(y_train)
train_acc = np.sum(y_train_preds == y_train_arr) / len(y_train_arr)
print(f"The accuracy on the training data is: {np.round(100*train_acc, 2)}%.")
# accuracy on the test data
# forward pass
Z1.forward(X_test.T)
A1.forward(Z1.Z)
Z2.forward(A1.A)
A2.forward(Z2.Z)
y_test_preds = np.round(A2.A).flatten().astype('int64')
y_test_arr = np.array(y_test)
test_acc = np.sum(y_test_preds == y_test_arr) / len(y_test_arr)
print(f"The accuracy on the test data is: {np.round(100*test_acc, 2)}%.")
The accuracy on the training data is: 64.0%.
The accuracy on the test data is: 64.05%.
We also generate a confusion matrix and calculate the ROC AUC score to gain additional insights into the performance of the classification model from different perspectives.
# confusion matrix
# cm = confusion_matrix(y_test_arr, y_test_preds)
ConfusionMatrixDisplay.from_predictions(y_test_arr, y_test_preds, values_format='.0f')
plt.show()
# roc auc score
print(f"ROC AUC Score: {roc_auc_score(y_test_arr, y_test_preds)}")
ROC AUC Score: 0.7869794981510511
From the confusion matrix, we can compute other interesting metrics to evaluate the model’s performance.
- Precision = Proportion of true positives over total number of samples predicted as positive. In this example, the precision equals 0.25.
- Recall = Proportion of true positives over total number of samples that are actually positive. Here, we have a recall of 0.98.
As you may already know, there is a trade-off between precision and recall. Improving precision usually leads to a decrease in recall and vice versa. For example, if a model is tuned to be more conservative in making positive predictions to increase precision, it might miss more true positive cases, thus lowering recall. Conversely, if the model is tuned to identify as many positive instances as possible, it may increase recall but also raise the number of false positives, lowering precision. This is an essential concept in assessing the performance of classification models, especially in cases where classes are imbalanced or the costs of false positives and false negatives vary greatly. In our scenario, missing a true positive (failing to identify conductors who would respond positively to an insurance offer) is more detrimental, so we prioritize higher recall over higher precision.
Make Predictions on the Test Data
With our model ready, we’ll now make predictions on the test data to evaluate how well it performs on unseen data.
Z1.forward(np.array(df_test.drop(['id'], axis=1)).T)
A1.forward(Z1.Z)
Z2.forward(A1.A)
A2.forward(Z2.Z)
submission = pd.DataFrame({'id': df_test['id'], 'Response': A2.A.flatten()})
submission.to_csv('insurance_nn.csv', index=False)
The score after submitting to Kaggle was 0.83352. Considering that the model I created was just a basic neural network, I’m quite pleased with the result. In the future, I’d like to experiment with more sophisticated models to see how much I can improve the performance.