Table of Contents
- Scaling Kaggle Competitions Using XGBoost: Part 2
- AdaBoost
- The Dataset
- Sample Weights
- Choosing the Right Feature
- Significance of a Stump
- Calculating the New Sample Weights
- Moving Forward: The Subsequent Stumps
- Piecing It Together
- Configuring Your Development Environment
- Having Problems Configuring Your Development Environment?
- Setting Up the Prerequisites
- Building the Model
- Assessing the Model
- Summary
Scaling Kaggle Competitions Using XGBoost: Part 2
In our previous tutorial, we went through the basic foundation behind XGBoost and learned how easy it was to incorporate a basic XGBoost model into our project. We went through the core essentials required to understand XGBoost, namely decision trees and ensemble learners.
Although we learned to plug XGBoost into our projects, we have yet to scratch the surface of the magic behind it. In this tutorial, we aim to reduce our idea of XGBoost as a black body and learn a bit more about the simple mathematics that makes it so good.
But the process is a step-by-step one. To avoid crowding a single tutorial with too much math, we check out the math behind today’s concept: AdaBoost. Once that is cleared, we will address Gradient Boosting before finally addressing XGBoost in the next tutorial.
In this tutorial, you will learn the underlying math behind one of the prerequisites of XGBoost.
We will also address a slightly more challenging Kaggle dataset and attempt to use XGBoost to get better results.
This lesson is the 2nd of a 4-part series on Deep Learning 108:
- Scaling Kaggle Competitions Using XGBoost: Part 1
- Scaling Kaggle Competitions Using XGBoost: Part 2 (this tutorial)
- Scaling Kaggle Competitions Using XGBoost: Part 3
- Scaling Kaggle Competitions Using XGBoost: Part 4
To learn how to figure out the math behind AdaBoost, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionScaling Kaggle Competitions Using XGBoost: Part 2
In the previous blog post of this series, we briefly covered concepts like decision trees and gradient boosting, before touching up on the concept of XGBoost.
Subsequently, we saw how easy it was to use in code. Now, let’s get our hands dirty and get closer to understanding the math behind it!
AdaBoost
A formal definition of AdaBoost (Adaptive Boosting) is “the combination of the output of weak learners into a weighted sum, representing the final output.” But that leaves a lot of things vague. Let’s start dissecting our way to understanding what this statement means.
Since we have been dealing with trees, we will assume that our adaptive boosting technique is being applied to decision trees.
The Dataset
To begin our journey, we will consider the dummy dataset shown in Table 1.
The features of this dataset are mainly “courses” and “Project,” while the labels in the “Job” column tell us whether the person has a job right now or not. If we take the decision tree approach, we would need to figure out a proper criterion to split the dataset into its corresponding labels.
Now, a key thing to remember about adaptive boosting is that we are sequentially creating trees and carrying over the errors generated by each one to the next. While in Random Forest, we would have created full-sized trees, but in adaptive boosting, the trees created are stumps (trees with depth 1).
Now a general line of thought can be that the error generated by one single tree based on some criterion decides the nature of the second tree. Having a complex tree overfit on the data will defeat the purpose of combining many weak learners into strong ones. However, AdaBoost has shown results in more complex trees too.
Sample Weights
Our next step is assigning a sample weight to each sample in our dataset. Initially, all the samples will have the same weight, and their weights HAVE TO add up to 1 (Table 2). As we move toward our end goal, the concept of sample weights will be much clearer.
For now, since we have 7 data samples, we will assign 1/7 to each sample. This signifies that all samples are currently equally important in terms of their significance in the final structure of the decision tree.
Choosing the Right Feature
Now we have to initialize our first stump with some information.
Just by a glance, we can see that for this dummy dataset, the “courses” feature splits the dataset better than the “project” feature. Let’s see this for ourselves in Figure 1.
The split factor here is the number of courses taken by a student. We notice that among the students who have taken more than 7 courses, all of them have a job. Conversely, for students with less than or equal to 7 courses, the majority don’t have a job.
However, the split gives us 6 correct predictions and 1 incorrect one. This means one sample in our dataset gives us the wrong prediction, if we use this particular stump.
Significance of a Stump
Now, our particular stump has made one wrong prediction. To calculate the error, we have to add the weights from all incorrectly classified samples, which in this case would be 1/7, since only a single sample is getting classified incorrectly.
Now, a key difference between AdaBoost and Random Forest is that in the former, one stump might have more weightage in the final output than other stumps. So the question here becomes how to calculate
So, if we insert our loss value (1/7) in this formula, we get 0.89. Notice that the formula should give us values between 0 and 1. So 0.89 is a very high value, telling us that this particular stump has a big say in the final output of the combination of stumps.
Calculating the New Sample Weights
If we have built a classifier that gives us the correct prediction over all samples but one, our natural course of action should be to make sure we focus more on that particular sample to have it grouped with the right class.
Currently, all the samples have the same weight (1/7). But we want our classifier to focus more on the wrongly classified sample. For that, we have to make changes to the weights.
Remember, as we change the weight of the sample, the weights of all other samples will also shift, as long as all of the weights will add to 1.
The equation for calculating the new weights for the wrongly classified samples is
.
Plugging in the values from our dummy dataset becomes
,
giving us a value of 0.347.
Previously, the sample weight for this particular sample was 1/7, which becomes ≈0.142. The new weight is 0.347, meaning that the sample significance is increased. Now, we need to shift the rest of the sample weights accordingly.
The formula for that is
,
which comes to 0.0586. So now, instead of 1/7, the 6 correctly predicted samples will have weight 0.0586, while the wrongly predicted sample will have weight 0.347.
The final step here is to normalize the weights, so they add up to 1, which makes the wrongly predicted sample to have weight 0.494 and the other samples to have weight 0.084.
Moving Forward: The Subsequent Stumps
Now there are several ways to determine how the next stump will be. Our priority currently is to make sure the second stump classifies the sample with the larger weight (which we found to be 0.494 in the last iteration).
The focus of this sample is because this sample caused the error in our previous stump.
While there is no standard way of selecting the new stump, as most work fine, a popular method is to create a new dataset altogether from our current dataset and its sample weights.
Let’s recall how far we are currently (Table 3).
Our new dataset will come from random sampling based on the current weights. We will be creating buckets of ranges. The ranges will be as follows:
- We choose the first sample for values between 0 and 0.084.
- We choose the second sample for values between 0.084 and 0.168.
- We choose the third sample for values 0.168 and 0.252.
- We choose the fourth sample for values 0.252 and 0.746.
By now, I hope you have figured out the pattern we are following here. The buckets are formed based on the weights: 0.084 (first sample), 0.084 + 0.084 (second sample), 0.084 + 0.084 + 0.084 (third sample), 0.084 + 0.084 + 0.084 + 0.494 (fourth sample), and so on.
Naturally, since the fourth sample has a larger range, it will appear more times in our next dataset if random sampling is applied. So our new dataset will have 7 entries, but it might have 4 entries belonging to the name “Tom.” We reset the weights for the new dataset, but intuitively, since “Tom” appears more often, it will significantly affect how the tree turns out.
Hence, we repeat the process to find the best feature to create our stump. This time, due to the dataset being different, we will create the stump with a different feature rather than the number of courses.
Note how we have been saying that the first stump’s output will determine how the second tree works. Here we saw how the weights received from the first stump determined how the second tree would work. This process is rinsed and repeated for N number of stumps.
Piecing It Together
We have learned about stumps and the significance of their outputs. Let’s say we feed one test sample to our group of stumps. The first stump gives a 1, the second one gives a 1, the third one gives a 0, while the fourth one also gives a 0.
But since each tree has a significance value attached to it, we will get a final weighted output, determining the nature of the test sample fed to our stumps.
With the completion of AdaBoost, we are one more step closer to understanding the XGBoost algorithm. Now, we will tackle a slightly intermediate Kaggle task and solve it using XGBoost.
Configuring Your Development Environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Setting Up the Prerequisites
Today, we will be tackling the USA Real Estate Dataset. Let’s start by importing the necessary packages for our project.
# import the necessary packages import pandas as pd import xgboost as xgb from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split
In our imports, we have pandas
, xgboost
, and some utility functions from scikit-learn
like mean_squared_error
and train_test_split
on Lines 2-5.
# load the data in the form of a csv estData = pd.read_csv("/content/realtor-data.csv") # drop NaN values from the dataset estData = estData.dropna() # split the labels and remove non-numeric data y = estData["price"].values X = estData.drop(["price"], axis=1).select_dtypes(exclude=['object'])
On Line 8, we use read_csv
to load the csv file as a Pandas Dataframe. A bit of exploratory data analysis (EDA) on the dataset would show many NaN (Not-a-Number or Undefined) values. These can hinder our training, so we drop all tuples with NaN values on Line 11.
We proceed to prepare the labels and features on Lines 14 and 15. We exclude any column which doesn’t contain numerical values in our feature set. It is possible to include that data, but we need to convert it into one-hot encoding.
Our final dataset is going to look something like Table 4.
We will use these features to determine a house’s price (label).
Building the Model
Our next step is to initialize and train the XGBoost model.
# create the train test split xTrain, xTest, yTrain, yTest = train_test_split(X, y) # define the XGBoost regressor according to your specifications xgbModel = xgb.XGBRegressor( n_estimators=1000, reg_lambda=1, gamma=0, max_depth=4 ) # fit the data into the model xgbModel.fit(xTrain, yTrain, verbose = False) # calculate the importance of each feature used impFeat = pd.DataFrame(xgbModel.feature_importances_.reshape(1, -1), columns=X.columns)
On Line 18, we use the train_test_split
function to split our dataset into training and test sets. For the XGBoost model, we initialize it using 1000 trees and a max tree depth of 4 (Lines 21-26).
We fit the training data into our model on Lines 29 and 30. Once our training is complete, we can assess which features our model gives more importance to (Table 5).
Table 5 shows that the feature “bath” has the most significance.
Assessing the Model
Our next step is to see how well our model would perform on unseen test data.
# get predictions on test data yPred = xgbModel.predict(xTest) # store the msq error from the predictions msqErr = mean_squared_error(yPred, yTest) # assess your model’s results xgbModel.score(xTest, yTest)
On Lines 36-39, we get our model’s predictions on the testing dataset and calculate the mean-squared error (MSE). Since this is a regression model, be reassured by a high MSE value.
Our final step is to use the model.score
function to get an accuracy value for the testing set (Line 42).
The accuracy shows that our model has nearly 95% accuracy on the testing set.
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, we first understood the math behind one of the prerequisites to XGBoost, Adaptive boosting. Then, like our first blog post in this series, we tackled a dataset from Kaggle and used XGBoost to attain a pretty high accuracy over the testing dataset, yet again establishing XGBoost’s dominance as one of the leading classical machine learning techniques.
Our next stop would be to figure out the math behind Gradient Boosting before we finally dissect XGBoost to its foundations.
Citation Information
Martinez, H. “Scaling Kaggle Competitions Using XGBoost: Part 2,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/2wiy7
@incollection{Martinez_2022_XGBoost2, author = {Hector Martinez}, title = {Scaling {Kaggle} Competitions Using {XGBoost}: Part 2}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki}, year = {2022}, note = {https://pyimg.co/2wiy7}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.