The other night, I was browsing through the YouTube page of StatQuest with Josh Starmer and noticed that there were recent uploads about CatBoost, which is a Machine Learning algorithm unfamiliar to me. Intrigued, I decided to watch the series (part 1 and 2) along with the related videos that could help me grasp the algorithm better. The purpose of this post is to provide a summary of what I learned so that I can revisit and refresh this concept whenever needed. Without further delay, let’s begin!

CatBoost, also known as Categorical Boosting, is a machine learning algorithm that bears strong resemblance to Gradient Boost and XGBoost. As its name implies, CatBoost has a distinctive approach to handling categorical variables. Typically, many machine learning methods struggle with categorical features. Consequently, before constructing a model, we often convert categorical variables into numerical values, employing various techniques.

One popular encoding method is called One-Hot encoding, which generates new binary columns for each category within a categorical feature when it contains three or more categories. To illustrate, let’s consider a scenario where we want to predict whether a person enjoys drinking coffee or not based on their country of origin (Spain, South Korea, and the United States). Upon applying one-hot encoding, the data will include three binary features corresponding to the nationality of each individual in the $Country$ column.

Before Encoding…

Country LikeCoffee
Spain Yes
South Korea Yes
The United States No
South Korea Yes
The United States Yes
Spain No

After One-Hot Encoding…

Country_Spain Country_South_Korea Country_The_United_States LikeCoffee
1 0 0 Yes
0 1 0 Yes
0 0 1 No
0 1 0 Yes
0 0 1 Yes
1 0 0 No

One notable drawback of this encoding method is that it may not perform optimally when dealing with categorical variables that have a large number of options. This is because the number of new columns generated through encoding would be equivalent to the number of options. In such cases, an alternative approach known as Label Encoding can be employed. This method assigns a random number to each option, arranged in ascending order. However, it is important to note that since the numbers are arbitrary, certain machine learning algorithms might interpret the assigned numbers as having meaningful order, leading to potential issues.

After Label Encoding…

Country LikeCoffee
0 Yes
1 Yes
2 No
1 Yes
2 Yes
0 No

Another commonly used encoding method is called Target Encoding. This encoding technique leverages the target variable, which is the variable we aim to predict, to transform each option of a categorical feature into a numerical representation. In general, target encoding involves calculating a weighted mean that combines the mean for a specific option with the overall mean of the target variable. This approach allows us to approximate the best estimate by assigning a value close to the overall mean for an option when there is limited data available for that option.

The weighted mean calculation can be expressed as follows:

Weighted Mean = (n * Option Mean + m * Overall Mean) / (n + m)

Here, $n$ represents the weight for the Option Mean (i.e., the number of rows associated with that option), and $m$ represents the weight for the Overall Mean (a hyperparameter). In our example, if we set $m$ to 1, indicating that at least 2 data points are required for the Option Mean to have more influence than the Overall Mean, the data, after applying target encoding, will appear as shown below…

After Target Encoding…

Country LikeCoffee
0.56 Yes
0.89 Yes
0.56 No
0.89 Yes
0.56 Yes
0.56 No

Nevertheless, it is important to acknowledge that using the target variable to modify the values of categorical features in this method introduces a phenomenon known as data leakage (as described by Josh Starmer in the CatBoost series). Data leakage occurs when the encoding of each data row is influenced by its corresponding target value, leading to models that perform well on the training data but may not generalize effectively to the testing data (overfitting).

To address this issue and mitigate data leakage, one popular alternative is K-Fold Target Encoding. In this approach, the training data is divided into K equally sized subsets, and the target encoding is performed on one subset using the remaining subsets. Essentially, this helps ensure that the encoding process does not have access to the target values for the same subset, reducing the risk of data leakage. It is worth noting that when K is equal to the number of rows in the data, K-Fold target encoding is equivalent to Leave-One-Out target encoding.

Let’s consider an example where we aim to perform 3-Fold Target Encoding. In this case, the data would be divided as illustrated below…

Subset 1

Country LikeCoffee
Spain Yes
South Korea Yes

Subset 2

Country LikeCoffee
The United States No
South Korea Yes

Subset 3

Country LikeCoffee
The United States Yes
Spain No

To perform target encoding on Subset 1, we initially utilize Subset 2 and Subset 3. In other words, for instance, when replacing the value of the Spain option in Subset 1, we refer to the other subsets. Employing the weighted mean equation with $m$ set to 1, we substitute the values into the equation: $\frac{1\times 0 + 1\times 0.5}{1+1} = 0.25$ for the row in Subset 1 corresponding to Spain. Consequently, the data with 3-Fold target encoding would appear as follows:

Country LikeCoffee
0.25 Yes
0.75 Yes
0.875 No
0.875 Yes
0.375 Yes
0.875 No

It appears that the K-Fold target encoding scheme effectively reduces the risk of data leakage. However, the authors of the CatBoost algorithm have considered a scenario where this scheme may not be applicable. This scenario is illustrated in the following table.

Country LikeCoffee
Spain Yes
Spain Yes
Spain No
Spain Yes
Spain Yes
Spain No

We observe that the $Country$ column contains only one value, Spain. If we apply Leave-One-Out target encoding to the data…

Country LikeCoffee
0.4 Yes
0.4 Yes
0.8 No
0.4 Yes
0.4 Yes
0.8 No

In other words, individuals who like drinking coffee are assigned a value of 0.4 for $Country$, while those who do not like coffee are assigned 0.8. This demonstrates data leakage, as depending solely on the value of $Country$, we can perfectly predict the value of $LikeCoffee$. Josh Starmer believes that such cases are uncommon and unlikely to occur because we typically avoid including variables that possess only one value, as they provide no useful information for any type of model. Furthermore, even if such features are included, it is simpler to encode them by converting the option to a single value… Nevertheless, it would be beneficial to examine the encoding process of the CatBoost algorithm in order to gain insight into its underlying mechanism.

The encoding process of the algorithm involves encoding each row sequentially, whereby the order of the data impacts the encoding outcome. Subsequently, for every row, the following formula is calculated:

(OptionCount + 0.05) / (n+1)

Here, OptionCount represents the count of previously encountered rows with the same categorical variable value and a target variable value of 1. Additionally, 0.05, referred to as the prior (or guess), is a hyperparameter, while n corresponds to the number of rows observed with the same value for the categorical variable. If we utilize this encoding technique on the example data we’ve examined thus far, setting the prior to 0.05, the first row’s value would be $\frac{0.05}{1}=0.05$ since there are no prior rows with Spain as the value for the $Country$ variable and a preference for coffee. Subsequently, the value assigned to the second row of the variable “Country” will also be 0.05, obtained by dividing 0.05 by 1. This is because the first individual in the data belongs to Spain, not South Korea. As we proceed with the encoding process, the data will take on the following appearance:

Before CatBoost Encoding…

Country LikeCoffee
Spain Yes
South Korea Yes
The United States No
South Korea Yes
The United States Yes
Spain No

After CatBoost Encoding…

Country LikeCoffee
0.05 Yes
0.05 Yes
0.05 No
0.525 Yes
0.025 Yes
0.525 No

Now that we have grasped the encoding process in CatBoost, let’s delve into the inner workings of the algorithm itself. In the article titled “CatBoost, Part 2 - Evaluating CatBoost Performance on Data Science Salaries Dataset,” I will be conducting experiments using a CatBoost model on a dataset containing information about data scientists’ salaries. The objective is to assess the algorithm’s effectiveness in predicting the salaries of data scientists.