In this article, my main focus will be on implementing the CatBoost algorithm discussed in my previous post on an actual dataset. If you haven’t had a chance to read it yet, I recommend checking out CatBoost Part 1 - How CatBoost deals with categorical features. The dataset I’ll be working with involves the salaries of data scientists and will be used to predict their experience levels using the machine learning algorithm. If you’d like more information about the data, you can refer to Data Science Salaries 2023. However, instead of immediately diving into the modeling aspect, I’ll first explore the dataset to better understand the relationships between its variables. To begin, let’s take a look at the first five rows of the data.

1. Exploratory Data Analysis

work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
2023 SE FT Principal Data Scientist 80000 EUR 85847 ES 100 ES L
2023 MI CT ML Engineer 30000 USD 30000 US 100 US S
2023 MI CT ML Engineer 25500 USD 25500 US 100 US S
2023 SE FT Data Scientist 175000 USD 175000 CA 100 CA M
2023 SE FT Data Scientist 120000 USD 120000 CA 100 CA M

There are 11 columns, with the experience_level column as the target variable for classification. The column names give us a hint about what each column represents, but we can find a detailed description of each column in the following table. Upon examining the content of the first five rows, we can also observe that the variables can be classified as either numerical or categorical. The table below indicates the column types, which are designated as either int64 for numeric or object for categorical. The table below also indicates that there are no missing values in the data. With a total of 3,755 rows, the “Non-Null Count” column shows that all values in the last column of the table are listed as ‘3755 non-null’, which means that none of the features have missing values for the 3,755 rows of data.

Column Description
work_year The year the salary was paid
experience_level The experience level in the job during the year
employment_type The type of employment for the role
job_title The role worked in during the year
salary The total gross salary amount paid
salary_currency The currency of the salary paid as an ISO 4217 currency code
salary_in_usd The salary in USD
employee_residence Employee’s primary country of residence in during the work year as an ISO 3166 country code
remote_ratio The overall amount of work done remotely
company_location The country of the employer’s main office or contracting branch
company_size The median number of people that worked for the company during the year
Column Type Non-Null Count
work_year int64 3755 non-null
experience_level object 3755 non-null
employment_type object 3755 non-null
job_title object 3755 non-null
salary int64 3755 non-null
salary_currency object 3755 non-null
salary_in_usd int64 3755 non-null
employee_residence object 3755 non-null
remote_ratio int64 3755 non-null
company_location object 3755 non-null
company_size object 3755 non-null

1.1. Univariate Data Analysis

1.1.1. Target Variable: experience_level

Having obtained a brief understanding of the data, our next step involves delving deeper into the specifics by examining individual variables. The initial variable of interest is referred to as experience_level, which serves as the dependent variable. To comprehend a categorical variable effectively, it is a common practice to utilize a bar plot that displays the frequency of observations within each category. Prior to exploring the bar plot, it is essential to grasp the significance of each category within the experience_level feature. This particular feature comprises four categories, namely ‘EN’, ‘SE’, ‘MI’, and ‘EX’. By analyzing the first two letters of each category, we can deduce their respective meanings: ‘EN’ denotes an entry-level status for a data scientist in a given year, ‘SE’ signifies a senior-level position, ‘MI’ indicates a mid/intermediate level, and ‘EX’ represents an executive level. Subsequently, the accompanying count plot indicates that more than half of the data scientists in the dataset (67%) held senior positions throughout the recorded periods, followed by intermediate-level individuals (21%), entry-level professionals (9%), and lastly, executives (3%).

count_experience_level

Moving forward, our focus will shift toward analyzing the independent features based on their respective data types. As previously observed, these features can be classified as either numeric or categorical. Initially, we will direct our attention towards the numerical variables, namely salary_in_usd, remote_ratio, and work_year. Although we do have another numeric feature called salary, for the sake of simplicity, we will exclude this column from our analysis and solely concentrate on the salary_in_usd variable. This decision is made to streamline our examination process, as the salary amounts have already been converted to USD dollars, eliminating the need for additional currency conversions.

1.1.2. Numeric Independent Variables

1.1.2.1. salary_in_usd

When it comes to numeric features, plotting a histogram allows us to gain insight into their distribution. Examining the histogram of salary_in_usd, we observe that the salary distribution skews towards the right, although the degree of skewness does not appear to be excessively pronounced. In fact, the skewness value is 0.53, indicating that the distribution possesses a greater concentration of data points in the left tail. Conversely, the kurtosis value, which indicates the presence of fat tails in the distribution, is 0.83 for salary_in_usd. This suggests that the variable is likely to have a smaller number of outliers compared to variables with higher kurtosis values.

Furthermore, analyzing the histogram allows us to ascertain the absence of inconsistencies within the variable. The minimum salary in USD dollars is greater than zero, indicating a logical lower limit, while the maximum salary does not exhibit an unrealistically large value.

hist_salary

1.1.2.2. work_year

Moving on, let’s focus on the next numeric variable, which is work_year. This variable represents the year in which the salary was paid. Although the values in this column are numerical, similar to salary_in_usd, there are only four distinct and countable values (2020, 2021, 2022, and 2023). Consequently, we can classify work_year as a discrete variable, in contrast to salary_in_usd, which is continuous. To visualize the distribution of work_year, we employ a count plot instead of a histogram. Upon examining the count plot, we observe that the frequencies for the years 2022 and 2023 dominate the dataset, accounting for more than 91% of the total observations. This substantial increase in frequencies suggests a rapid surge in demand for data scientist jobs starting from 2022, with the trend continuing into 2023.

count_work_year

1.1.2.3. remote_ratio

Lastly, let’s examine the distribution of the remote_ratio feature. Similar to work_year, this feature is also discrete since it can take on only three possible values: 0 (indicating no remote work), 50 (representing partial remote work), and 100 (indicating full remote work). Once again, we can utilize a count plot to visualize the distribution of this variable, which is displayed below. From the count plot, we observe that data scientists in the dataset either work fully remotely or exclusively on-site.

It is worth noting that the distribution of remote_ratio may be influenced by the work_year variable, primarily due to the impact of the COVID-19 pandemic that emerged in 2019. It is highly likely that in the years 2020 and 2021, there was a significant increase in remote work due to the prevailing circumstances. However, as time progressed and the situation improved, it is plausible to expect a gradual decline in the proportion of remote work in subsequent years.

count_remote_ratio

1.1.3. Categorical Independent Variables

1.1.3.1. employment_type

Now let’s shift our focus to the categorical features. Like the target variable, we will generate a count plot to examine the distribution of the categorical features. The initial variable we will analyze is employment_type, which consists of four categories: FT (Full-time), PT (Part-time), CT (Contractual), and FL (Freelancer). Initially, I assumed that ‘contract’ and ‘freelance’ represented the same concept. However, according to the article available at Contractor vs Freelancer: What’s The Difference?, contractors typically work for a single client who has control over their work location and conditions, with recurrent jobs, while freelancers work independently on multiple projects, without their hours or location being dictated. Regardless, in this dataset, the number of full-time data scientists is significantly higher compared to contractors, part-time employees, and freelancers, which together account for approximately 1% of the total observations.

count_employment_type

1.1.3.2. employee_residence

The next categorical variable to see is `employee_residence, which tells in which country each data scientist was living during a work year. Although this is categorical, it has 78 different country names available, making it difficult to analyze this variable’s distribution. Thus, I came up with the idea of mapping the countries based on the geographical location provided by Unicef Regional Classification; East Asia and Pacific, Eastern Europe and Central Asia, Western Europe, Latin America and Caribbean, Middle East and North Africa, North America, South Asia, Eastern and Southern Africa, and West and Central Africa. As expected, most data scientists resided in North America (Canada and the United States), followed by Western Europe and South Asia.

count_employee_residence_region

1.1.3.3. company_location

The variable in question, which represents the location of companies where data scientists are employed, consists of country names as its values. Due to a large number of distinct values, I employed the same mapping approach used for the previous feature, employee_residence. After applying the mapping and examining the distribution of the feature’s categories, similar to employee_residence_region, it became evident that a majority of the companies are situated in North America (83.4%). It is likely that this variable exhibits a correlation with employee_residence_region, although the strength of this relationship could be influenced by the extent to which remote work is permitted. If a company strictly requires its employees to work from the office every day and prohibits remote work, the correlation between employee_residence_region and company_location_region would be 1. Conversely, if employees have the flexibility to work from home without commuting to the office, the correlation could be significantly weaker.

count_company_location_region

1.1.3.4. company_size

Lastly, this feature provides information about the size of the company, which is classified into three categories: S (Small), M (Medium), and L (Large). Although the exact criteria for determining the size is unknown, it is likely based on the number of employees. The count plot presented below reveals that a significant portion of the companies falls into the medium-size category. Specifically, 14% of the companies are classified as large, while a mere 4% are categorized as small.

count_company_size

We have completed the analysis of univariate data. As you may be aware, I did not examine the variable job_title due to its extensive range of categories (93 job titles), making it difficult to analyze the distribution of categories at a glance. For the employee_residence and company_location features, we performed a mapping process to reduce the number of categories based on a reasonable criterion. However, in the case of this variable, I am unable to clearly identify how it can be grouped into fewer categories that effectively differentiate among them. Moreover, even if one person is an ML Engineer and another is a data scientist, their daily job tasks may be quite similar. Consequently, I will disregard this variable for the remainder of the analysis and utilize the other features to predict the experience level of data scientists.

1.2. Multivariate Data Analysis

Let’s delve into the analysis to explore the relationships between the variables. Initially, we will examine the relationship between each independent variable and the target variable. Subsequently, we will assess the connections among the independent variables to identify any potential multicollinearity. A high correlation between two variables can pose challenges in determining the individual impact of each independent variable on the dependent variable. Consequently, if such a situation arises, we will need to eliminate one of the variables to mitigate this issue.

To evaluate the relationship between the predictor variables and the outcome variable, we will employ two types of plots. The first type consists of box plots, which allow us to analyze the distribution of numerical variables across each category of the target variable. The second type comprises categorical plots, which illustrate the distribution of categories within each level of a categorical independent variable with respect to the target variable. These plots will facilitate the assessment of the association between the predictor variables and the outcome variable.

1.2.1. experience_level and salary_in_usd

As salary_in_usd is a numeric variable, we will plot a box plot that shows the distribution of salaries across the level of experience. As we can see, the median salary of data scientists at an executive level is higher than the other levels, followed by senior, intermediate, and finally entry level. However, note that the minimum salaries are very similar across the experience levels, which makes me think that even if you are an executive-level data scientist but you live in a country in South Asia, as the cost of living in South Asia is very likely to be lower than in North America or Western Europe, you may earn as much as an entry-level data scientist who lives in Switzerland, for example.

box_salary_experience_level

1.2.2. experience_level and work_year

Let’s shift our focus to the work_year variable. Given its discrete nature, we will generate a bar plot to visualize its distribution across different categories of the target variable. Indeed, we can observe that the data includes a substantial amount of information about data scientists for both the years 2022 and 2023.

download

1.2.3. experience_level and remote_ratio

Continuing with our analysis, we will investigate the relationship between the experience_level and the remote_ratio variable. It is evident that the majority of data scientists, regardless of their experience level, either work from home or in the office.

download

1.2.4. experience_level and employment_type

Based on the bar plot below, it appears that the variable employment_type does not exhibit a substantial correlation with experience_level. Over 95% of data scientists, regardless of their experience level, are full-time employees. Consequently, employment_type does not seem to provide substantial explanatory power for the target variable. Consequently, we are likely to exclude this variable from our modeling efforts.

download

1.2.5. experience_level and employee_residence_region

Regarding the variable employee_residence_region, it’s worth noting that over 80% of data scientists at all experience levels are located in North America and Western Europe. Consequently, when considering the employment_type variable to predict a data scientist’s experience level, employee_residence_region may not provide significant assistance.

download

1.2.6. experience_level and company_location_region

When examining the variable company_location_region, we arrive at a similar conclusion as we did with employee_residence_region. Additionally, there is a strong correlation between company_location_region and employee_residence_region, with over 80% of employees residing in the same country where their companies are situated. As a result, we could omit one of these variables when creating the model later on.

download

download

1.2.7. experience_level and company_size

The last variable that we check the relationship with the target variable is company_size. Except for entry-level data scientists, more than 80% of the data scientists of each experience level work in medium-sized companies, followed by large and small companies. Among entry-level data scientists, about half of them work in medium-sized companies, and the rest of them work in either large or small companies.

download

2. Modeling

Now that our data exploration is complete, we can dive straight into the modeling phase.

2.1. Data Preparation

Before proceeding, data preparation is crucial to ensure its suitability for the model. The initial step involves defining the dependent and independent variables. The dependent variable is already determined as experience_level. As for the predictors, we will include all variables except for employee_residence_region due to its limited variation and high correlation with company_location_region. Additionally, we will exclude job_title, employee_residence, and company_location.

At this stage, categorical variables have not yet been encoded into numerical ones, as the encoding process will be handled automatically during the training of the CatBoost model. Our focus, for now, is solely on identifying the categorical features.

Moving forward, the subsequent step entails dividing the data into training and test datasets. In this case, the data will be split, with 20% of it designated for the testing dataset.

We are now prepared to define and train the model using the training dataset. Before we proceed, I’d like to provide an explanation of how the CatBoost algorithm operates, so we can gain insight into its underlying mechanism.

  1. Randomize the rows of the training dataset.

  2. Implement ordered target encoding for all discrete columns with more than 2 categories:
    • For binary categories, simply replace them with 1s and 0s.
    • For continuous target variables, apply binning to group values into a smaller number of bins. This discrete variable is not used in model building.
  3. Initialize model predictions to 0 for all rows.

  4. Calculate the initial residuals by finding the differences between observed values and predictions.

  5. Start building a tree.

  6. For the root of the tree, determine the best threshold:
    • Sort the values in ascending order to identify potential thresholds.
    • CatBoost restricts the maximum number of thresholds tested by grouping values close to each other into the same bin.
    • When building larger trees, CatBoost constructs Oblivious or Symmetric Decision Trees:
      • Symmetric Decision Trees use the exact same threshold for each node at the same level.
      • This can make predictions faster than normal trees, as it asks fewer questions, although it may slightly reduce predictive accuracy compared to non-symmetric trees (the primary goal of gradient boosting is to combine weak learners to make decisions).
  7. Set the initial outputs of the leaves to 0.

  8. Starting from the first row, place its residual in the leaf based on the root’s threshold with a leaf output of 0. Update the leaf output to be the average of the residuals in it. As this is the initial step, the output will be equal to the first residual.
    • Residuals from a row are not used in calculating the leaf output for that same row, preventing leakage.
  9. Continue through the remaining rows and update the tree’s outputs.

  10. Quantify the quality of predictions for each threshold by calculating the Cosine Similarity between the leaf output and residuals:
    • Cosine Similarity is a metric that measures similarity or dissimilarity between entities.
    • For instance, when comparing phrases, we can represent them as vectors and calculate the cosine of the angle between them.
    • Cosine Similarity ranges from 0 (absolutely dissimilar) to 1 (identical).
  11. Build trees for other thresholds and compare their cosine similarity scores. Choose the threshold that yields the highest score. For larger trees, evaluate each split in the same way by comparing the Cosine Similarity of each potential threshold.

  12. Build the next tree:
    • Reset the residuals.
    • Update predictions by adding the scaled leaf output values (using a learning rate) to them. New Prediction = Prediction + (Learning Rate * Leaf Output)
    • Update the residuals.
    • Restore the values of the previously encoded categorical variable to their original names.
    • Randomize the rows and use the discrete variable for the target to apply ordered target encoding to the categorical variable.
    • Sort the encoded variable to identify the thresholds for the tree.
    • Repeat the process.
  13. Make predictions for new data:
    • Use the original training dataset with the discrete variable for the target.
    • Calculate the target encoding using all rows that share the same categorical variable value.
    • Traverse the data down the trees to predict outcomes: Prediction = Learning Rate * (Sum of Outputs).

Now that we have a grasp of the inner workings of a CatBoost model, let’s proceed with training the model on our designated training dataset and observe its performance on the test dataset. The model we trained has achieved an accuracy of approximately 70% on the test data. This implies that in about 30% of instances, the CatBoost model fell short in predicting the proficiency level of data scientists based on inputs from the independent variables. It’s fair to characterize the predictive capability of this trained model as reasonably satisfactory, although not exceptionally strong. Should you possess alternate ideas or suggestions aimed at enhancing the model’s predictive prowess, please don’t hesitate to share. For access to the complete code, you can consult CatBoost on Data Science Salaries 2023. Thank you for your time spent on this read!