Table of Contents
Scaling Kaggle Competitions Using XGBoost: Part 1
Tackling deep learning topics in depth can be tough sometimes. You may try again and again, but much like Writer’s block, researcher’s block is very real too!
In my initial days of traversing deep learning, I used to hit roadblocks on a semi-daily basis. Of course, we all say that unwavering dedication is the key to a breakthrough, but sometimes it’s good to take a different approach.
For me, a breath of fresh air was competing (casually) in Kaggle competitions. Since deep learning research and practical deep learning require vastly different thinking approaches, it was a good change for me as I gained a different perspective on problem-solving, as well as a look into how data scientists of the highest caliber attack real-life problems with deep learning.
Competitive “kaggling” is actually a very addictive habit where deep learners can easily fall. However, competitions that provide growth and several prizes are enough to entice people to dedicate themselves to being the best “kaggler” out there.
However, this only constitutes 50% of our blog post title. So what about the other half?
Kagglers are always looking for ways to make their competition submissions better than others. The caliber of today’s kagglers is such that almost everyone reaches 99% accuracy, but only the one with 99.1% accuracy reigns supreme.
XGBoost is a scalable gradient-boosting library that came to light when several winning Kaggle competition teams used it in their submissions. If you are unfamiliar with the term “gradient boosting”; don’t worry, we have you covered.
Today, we will dive into a preamble of approaching Kaggle competitions and see how XGBoost pushes your gradients to their maximum potential.
In this tutorial, you will learn How to Scale Kaggle Competitions with XGBoost.
This lesson is the 1st of a 4-part series on Deep Learning 108:
- Scaling Kaggle Competitions Using XGBoost: Part 1 (this tutorial)
- Scaling Kaggle Competitions Using XGBoost: Part 2
- Scaling Kaggle Competitions Using XGBoost: Part 3
- Scaling Kaggle Competitions Using XGBoost: Part 4
To learn how to scale Kaggle competitions using XGBoost, just keep reading.
Scaling Kaggle Competitions Using XGBoost: Part 1
Preface
To understand XGBoost, we first need to understand a few prerequisites. The first concept you need to be familiar with is a decision tree.
By this point, we are all familiar with linear regression. If not, just recount the equation of a line in your head, . Take a look at Figure 1.
Here, we are trying to model Y
using X
. We have a few data points which we have plotted. Since it’s an ideal setting, we have found a linear function that cuts through all the points.
Now, if we make things more complicated by moving into a proper classification task, we have to deal with several features corresponding to labels. Ideally, let’s say we have features X1
and X2
for a given dataset, which, when plotted, perfectly split the dataset into its corresponding labels (Figure 2).
That would be every data scientist’s dream, as the results would come flowing in a jiffy. However, most real-life datasets will have extremely nonlinear data, where using linear functions to classify the dataset would be difficult (Figure 3).
Thankfully, introducing nonlinearity in our regression functions for datasets like these is where we have become adept. As a result, decision trees are an extremely adept algorithm for tackling tricky datasets like these.
A decision tree can be considered a binary tree that recursively splits the dataset so that at least a leaf node contains a singular class. On a personal note, decision trees were the coolest and most deceptive algorithms I came across in classical machine learning.
My coolest, I mean its design, is such that even the most diverse datasets can be easily tamed using decision trees. But on the other hand, the practical usage of decision trees has led to my belief that it’s very easy to overfit on the training data when using decision trees, especially for complex datasets.
Let’s look at Figure 4, which displays a decision tree that splits the popular iris dataset into its corresponding classes.
Now in all likelihood, this picture might look super complex to you. So, let’s decipher what it means; Look at the root node. We don’t have enough information in the root node to separate each class. Hence we see a distribution.
The children of the root node are split based on the petal length condition. The data points satisfying less-than condition yield a pure node (i.e., a node referencing a singular class). In the right child, the data points satisfying the greater than or equal to condition fall, but it is still a distribution since there are two classes left.
So in the subsequent nodes, the distribution will further be split based on conditions created by the model, and as we move further down the tree, we see that leaf nodes have individual classes.
Let’s try to visualize a decision tree for our example dataset of Figure 3.
In Figure 5, we have tried creating the initial tree structure for a possible decision tree that would be formed for the given dataset.
Now, if you look closely, we have several conditions that are systematically splitting the data points. Although the general idea of decision trees can be cleared using these examples, intuitively, it should strike us that many conditions can drastically increase the chance of overfitting.
Let’s move on to the next important prerequisite that we require: Ensemble learning.
So decision trees provided an approach consisting of several decisions in which we could split the dataset into its corresponding classes. However, there were several cons to this approach, most notably that of overfitting.
But what if instead of using a single tree, we use multiple decision trees, which help each other, eliminating weak links and ensuring a robust cluster of trees that can also be used to prevent overfitting?
The answer lies in using multiple machine learning models to combine the insights obtained from their outputs for far more accurate and improved decisions. Let’s understand it better with the example of multiple decision trees.
Let us assume you have a dataset containing an X
number of features. Let’s start by taking a K
number of features from these X
features and building a decision tree. If we repeat this process, we will have a random number of features for our tree each time, and the resultant tree will be different.
Now, all that is left is to bootstrap the results together, and our barebones ensemble model is ready. If we also introduce a random sampling of the training data for this cluster of decision trees, this becomes the very famous and still in use decision tree algorithm known as Random Forest (Figure 6).
The third piece of this puzzle is understanding what gradient boosting is. While we will discuss the mathematical side of this concept in the later blogs of this series, let’s try to grasp the idea behind it.
Assume we have some basic dataset where a label is given based on 3 features. First, we will start by creating our base tree. If our task is a regression task, this tree can be a single node outputting the average of the labels. This isn’t much help, but this is our starting point.
The loss generated from our starting tree will tell us how far off we are from getting close to the labels. Now, we create a second tree which will be a proper decision tree. As we have seen, decision trees can end up overfitting, causing very low consideration of variance.
To mitigate this problem, we will determine how much a single tree will have in our overall result. The result will be the summation of the output of our initial root tree (the average of labels) and the output of the second tree, modified with a learning rate.
If we are going in the right direction, our loss generated from this summation will be lower than the initial loss. We repeat this process until the number of trees specified is fulfilled. If you want to understand this in simpler terms, a set of weak learners is combined into a singular strong learner.
XGBoost
With a basic understanding of all the prerequisites, we can move on to the highlight of this blog; XGBoost.
XGBoost stands for Extreme Gradient Boosting, an optimized solution for training in gradient boosting. Arguably the most powerful classical machine learning algorithm out there today, it has several features which combine concepts we have learned so far:
- Has steps to prevent overfitting (regularization)
- Can work in Parallel processing
- Has in-built cross-validation and early stopping
- Results in highly optimized results (deeper trees)
In the subsequent parts of this series, we will learn more about the math behind some of the concepts we have mentioned, but for now, let us use XGBoost in a practical scenario and marvel at how easy it is!
Configuring Your Development Environment
To follow this guide, you need the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Configuring the Prerequisites
Since we are working with Kaggle, we will use notebooks and Kaggle notebooks for our code. For our dataset today, we have directly used the iris
dataset from sklearn.datasets
. Our objective is to see how easy it is to use XGBoost in our code.
# import necessary packages import pandas as pd import xgboost as xgb from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error
In the imports, we have pandas
, xgboost
, and several helper packages from sklearn
, which include the iris dataset and train_test_split
(Lines 2-6).
Plugging in XGBoost
Now we will directly use XGBoost in our code.
# load the iris dataset and create the dataframe iris = load_iris() X = pd.DataFrame(iris.data, columns=iris.feature_names) y = pd.Series(iris.target)
On Line 9, we used the load_iris
functionality to load the required dataset directly and created a dataframe consisting of our data (Lines 10 and 11).
# split the dataset xTrain, xTest, yTrain, yTest = train_test_split(X, y) # define the XGBoost regressor according to your specifications xgbModel = xgb.XGBRegressor( n_estimators=100, reg_lambda=1, gamma=0, max_depth=4 )
On Line 14, we use another sklearn
function to create our training and test split. This is followed by defining the XGBoost regressor, which takes in the following parameters (Lines 17-22):
n_estimators
: Number of treesreg_lambda
: Lambda value for L2 regularizationgamma
: Minimum reduction of loss for a branch splitmax_depth
: Max depth of a tree
# fit the training data in the model xgbModel.fit(xTrain, yTrain) # Store the importance of features in a separate dataframe impFeat = pd.DataFrame(xgbModel.feature_importances_.reshape(1, -1), columns=iris.feature_names)
On Line 25, we fit the data in our training into our regression model and let it train. XGBoost will automatically let us separate out the importance of features, which we have done on Line 28.
# get predictions on test data yPred = xgbModel.predict(xTest) # store the msq error from the predictions msqErr = mean_squared_error(yPred, yTest)
Finally, we can use the trained model to predict on the test data (Line 31) and manually calculate the mean-squared error from the predictions (Line 34).
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this introductory post, we went over some key concepts required to understand XGBoost. We learned the intuition behind decision trees, ensemble models, and gradient boosting and used XGBoost for a basic problem of solving the iris dataset.
The ease of use for an algorithm as powerful as XGBoost was noted as well as its ability to learn over any given dataset.
Citation Information
Martinez, H. “Scaling Kaggle Competitions Using XGBoost: Part 1,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/0c9pb
@incollection{Martinez_2022_XGBoost1, author = {Hector Martinez}, title = {Scaling {Kaggle} Competitions Using {XGBoost}: Part 1}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki}, year = {2022}, note = {https://pyimg.co/0c9pb}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.