Table of Contents
- Introduction to the Bag-of-Words (BoW) Model
- Introduction to the BoW Model
- The Pros and Cons of BoW
- Configuring Your Development Environment
- Having Problems Configuring Your Development Environment?
- Project Structure
- Configuring the Prerequisites
- Processing Our Data for the BoW Approach
- Building the Bag-of-Words Function
- TensorFlow Wrapping: An Alternative
- Training the BoW Model
- Understanding the Training Metrics
- Summary
Introduction to the Bag-of-Words (BoW) Model
Creating statistical models based on text data has always been more complicated than modeling on image data. Image data contains detectable patterns, which can help a model identify them. Patterns in text data are more complex and require more computation using traditional methods.
Last week we took a brief stroll through the history of natural language processing (NLP). Today we will learn about one of the first techniques used in modeling language data for a computer, Bag-of-Words (BoW).
In this tutorial, you will learn about the Bag-of-Words model and how to implement it.
This lesson is the 2nd in a 4-part series on NLP 101:
- Introduction to Natural Language Processing (NLP)
- Introduction to the Bag-of-Words (BoW) Model (today’s tutorial)
- Word2Vec: A Study of Embeddings in NLP
- Comparison between BagofWords and Word2Vec
To learn how to implement the Bag-of-Words model, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionIntroduction to the Bag-of-Words (BoW) Model
Introduction to the BoW Model
The Bag-of-Words model is a simple method for extracting features from text data. The idea is to represent each sentence as a bag of words, disregarding grammar and paradigms. Just the occurrence of words in a sentence defines the meaning of the sentence for the model.
This can be considered an extension of representation learning, where you are representing the sentences in an N-dimensional space. For each sentence, the model will assign a weight to each dimension. This will become the sentence’s identity for the model.
Let’s dig deeper into what it means. Take a look at Figure 1.
We have two sentences; “I have a dog” and “You have a cat.” First, we grab all the words present in our current vocabulary and create a representation matrix where each column is dedicated to one of the words, as seen in Figure 1.
Our sentences have a combined word count of 8, but since we have 2 words (have
, a
) repeating, the total vocabulary size becomes 6. Now we have 6 columns representing each word in the vocabulary.
Each sentence is now represented as a combination of all the words in the vocabulary. For example, “I have a dog” has 4 of the 6 words available in the vocabulary, so we will turn on the bits for the existing words and turn off the bits for the words that don’t exist in the sentence.
Hence, if the vocab matrix columns are in the order of I
, have
, a
, dog
, you
, and cat
, the first sentence (“I have a dog”) representation becomes 1,1,1,1,0,0
, while the second sentence (“You have a cat”) representation becomes 0,1,1,0,1,1
.
These representations become the key to making a model understand the essence of different sentences. We are indeed ignoring grammar, but since these sentences are being viewed with respect to the complete vocabulary, each has a unique tokenized representation, which helps them stand out from other sentences.
For example, the first sentence will stand out since it has the dog
and I
bits turned on, while the second sentence has the cat
and you
bits turned on. These small changes in the representations help us model text data using the Bag-of-Words approach.
Here, we have explained BoW with the bitwise approach. BoW can also be configured to store the frequency of occurrence of words for additional reinforcement during model training.
The Pros and Cons of BoW
Right off the bat, we see a major problem with this approach. If our input data is big, that would mean that the vocabulary size will also increase. This, in turn, makes our representation matrix much larger and makes computations very complex.
Another computational nightmare is the inclusion of many 0s in our matrix (i.e., a sparse matrix). A sparse matrix contains less information and wastes a lot of memory.
The biggest disadvantage in Bag-of-Words is the complete inability to learn grammar and semantics. The tokenized representation from the representation matrix is what defines a sentence, and only the occurrence/non-occurrence of words in a sentence distinguishes it from others.
On a brighter note, the Bag-of-Words approach highlights the benefits of representation learning in a stellar way. Its simple and intuitive approach helps us at least explain what a combination of words might mean to a computer.
Of course, that brings into question the application of Bag-of-Words. Firstly, it is a great introductory step toward more complex representation learning examples like Word2Vec and Glove. Since it also echoes the concept of “one-hot encoding” representations, Bag-of-Words was primarily used for the feature generation of text documents.
Now that we have grasped the idea of Bag-of-Words, let’s implement it!
Configuring Your Development Environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python $ pip install tensorflow $ pip install numpy
If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
We first need to review our project directory structure.
Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.
From there, take a look at the directory structure:
!tree . . ├── pyimagesearch │ ├── bow.py │ ├── config.py │ ├── data_processing.py │ ├── __init__.py │ ├── model.py │ └── tensorflow_wrapper.py └── train.py 1 directory, 7 files
In the pyimagesearch
directory, we have:
bow.py
: Applies the Bag-of-Words technique to the sentence.config.py
: Contains the configuration pipeline for the project.data_processing.py
: Houses data-processing utilities.__init__.py
: Turns thepyimagesearch
directory into a python package.model.py
: Contains a small neural network architecture.tensorflow_wrapper.py
: Houses the Bag-of-Words approach wrapped withtensorflow
utilities.
In the parent directory, we have:
train.py
: Contains the training script for the Bag-of-Words approach.
Configuring the Prerequisites
Inside the pyimagesearch
directory, we have a script called config.py
, which houses the configuration pipeline for this project.
# define the data to be used dataDict = { "sentence":[ "Avengers is a great movie.", "I love Avengers it is great.", "Avengers is a bad movie.", "I hate Avengers.", "I didnt like the Avengers movie.", "I think Avengers is a bad movie.", "I love the movie.", "I think it is great." ], "sentiment":[ "good", "good", "bad", "bad", "bad", "bad", "good", "good" ] } # define a list of stopwords stopWrds = ["is", "a", "i", "it"] # define model training parameters epochs = 30 batchSize = 10 # define number of dense units denseUnits = 50
We will be implementing the Bag-of-Words using NumPy, as well as with TensorFlow, to compare the two methods. On Lines 2-23, we have the dataDict
, a dictionary containing our input data split into sentences and their corresponding labels.
On Line 26, we have defined a list of stop-words, which shall be omitted from our data since the BoW approach does not care about grammar and semantics. The more unique words each sentence has, the better it can be separated from the rest.
On Lines 29-33, we define some parameters for the TensorFlow model, like the number of epochs, batch size, and the number of dense units to add to our small neural network. This concludes the configuration pipeline.
Processing Our Data for the BoW Approach
As deep learning practitioners, you and I probably overlook and take many things in our daily projects for granted. We use TensorFlow/PyTorch wrappers for almost every processing task and forget what actually makes these wrappers so important.
For our project today, we will be coding some of these pre-processing wrappers on our own. For that, let’s move into the data_processing.py
script.
# import the necessary packages import re def preprocess(sentDf, stopWords, key="sentence"): # loop over all the sentences for num in range(len(sentDf[key])): # lowercase the string and remove punctuation sentence = sentDf[key][num] sentence = re.sub( r"[^a-zA-Z0-9]", " ", sentence.lower() ).split() # define a list for processed words newWords = list() # loop over the words in each sentence and filter out the # stopwords for word in sentence: if word not in stopWords: # append word if not a stopword newWords.append(word) # replace sentence with the list of new words sentDf[key][num] = newWords # return the preprocessed data return sentDf
On Line 4, we have the first helper function, preprocess
, which takes in the following arguments:
sentDf
: The input dataframe.stopWords
: A list of words to omit from the data.key
: A key to access the relevant part of the input dataframe.
We loop over all the available sentences in the dataframe (Lines 6-27), make the words lowercase, remove punctuation, and omit the stopwords.
def prepare_tokenizer(df, sentKey="sentence", outputKey="sentiment"): # counters for tokenizer indices wordCounter = 0 labelCounter = 0 # create placeholder dictionaries for tokenizer textDict = dict() labelDict = dict() # loop over the sentences for entry in df[sentKey]: # loop over each word and # check if encountered before for word in entry: if word not in textDict.keys(): textDict[word] = wordCounter # update word counter if new # word is encountered wordCounter += 1 # repeat same process for labels for label in df[outputKey]: if label not in labelDict.keys(): labelDict[label] = labelCounter labelCounter += 1 # return the dictionaries return (textDict, labelDict)
The second function in this script is prepare_tokenizer
(Line 29), which takes in the following arguments:
df
: The dataframe from which we will create our tokenizer.sentKey
: The key to access the sentence from the dataframe.outputKey
: The key to access the labels from the dataframe.
First, we create counters for indices on Lines 31 and 32. On Lines 35 and 36, we create dictionaries for the tokenizer.
Next, we start looping over the sentences (Line 39) and adding the words to our dictionary. If we encounter a word we have already seen before, we ignore it. If the word is newly encountered, it is added to the dictionary (Lines 42-48).
We apply the same process for the labels (Lines 51-54), concluding the prepare_tokenizer
script.
Building the Bag-of-Words Function
Now, we will move into the bow.py
script to see our custom function to calculate the bag of words.
def calculate_bag_of_words(text, sentence): # create a dictionary for frequency check freqDict = dict.fromkeys(text, 0) # loop over the words in sentences for word in sentence: # update word frequency freqDict[word]=sentence.count(word) # return dictionary return freqDict
The function calculate_bag_of_words
takes in the vocabulary and the sentence as its arguments (Line 1). Next, we create a dictionary on Line 3 to check and store the occurrence of words.
Looping over each word in a sentence (Line 6), we count the number of times a particular word has appeared and return it (Lines 8-11).
TensorFlow Wrapping: An Alternative
Till now, we have seen what it would be like to create all the pre-processing functionalities ourselves. If you feel it is too complicated, we will also show you how to use TensorFlow for the same processes instead. Let’s move into tensorflow_wrapper.py
.
# import the necessary packages from tensorflow.keras.preprocessing.text import Tokenizer def tensorflow_wrap(df): # create the tokenizer for sentences tokenizerSentence = Tokenizer() # create the tokenizer for labels tokenizerLabel = Tokenizer() # fit the tokenizer on the documents tokenizerSentence.fit_on_texts(df["sentence"]) # fit the tokenizer on the labels tokenizerLabel.fit_on_texts(df["sentiment"]) # create vectors using tensorflow encodedData = tokenizerSentence.texts_to_matrix( texts=df["sentence"], mode="count") # add label column labels = df["sentiment"] # correct label vectors for i in range(len(labels)): labels[i] = tokenizerLabel.word_index[labels[i]] - 1 # return data and labels return (encodedData[:, 1:], labels.astype("float32"))
Inside the script, we have the tensorflow_wrap
(Line 4) function, which takes in the dataframe as the argument.
On Lines 6-9, we initialize tokenizers for the sentences and labels, respectively. By simply using the fit_on_texts
function, we have finished creating the tokenizers for the sentences and labels (Lines 12-15).
Using another function called texts_to_matrix
to create our encodings, we get the vectorized format of our processed sentences (Lines 18 and 19).
On Lines 22-26, we create labels and then return the encodings and labels on Line 29.
#import the necessary packages import pyimagesearch.config as config from tensorflow.keras.layers import Dense from tensorflow.keras.models import Sequential def build_shallow_net(): # define the model model = Sequential() model.add(Dense(config.denseUnits, input_dim=10, activation="relu")) model.add(Dense(config.denseUnits, activation="relu")) model.add(Dense(1, activation="sigmoid")) # compile the keras model model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"] ) # return model return model
On Line 6, we define the build_shallow_net
function, which initializes a shallow neural network.
The network starts with a dense
layer, where the number of inputs is set to 10. This is because the vocabulary of our text corpus after processing is 10. Two dense layers follow this, the final being the sigmoid-activated output layer (Lines 8-11).
On Lines 14 and 15, we compile the model with binary_crossentropy
loss, adam
optimizer, and accuracy
as the metric.
With that, our model is ready for use.
Training the BoW Model
Now it is time to combine all our modules and train the Bag-of-Words model approach. Let’s move into the train.py
script.
# USAGE # python train.py # import the necessary packages from pyimagesearch import config from pyimagesearch.model import build_shallow_net from pyimagesearch.bow import calculate_bag_of_words from pyimagesearch.data_processing import preprocess from pyimagesearch.data_processing import prepare_tokenizer from pyimagesearch.tensorflow_wrapper import tensorflow_wrap import pandas as pd # convert the input data dictionary to a pandas data frame df = pd.DataFrame.from_dict(config.dataDict) # preprocess the data frame and create data dictionaries preprocessedDf = preprocess(sentDf=df, stopWords=config.stopWrds) (textDict, labelDict) = prepare_tokenizer(df) # create an empty list for vectors freqList = list() # build vectors from the sentences for sentence in df["sentence"]: # create entries for each sentence and update the vector list entryFreq = calculate_bag_of_words(text=textDict, sentence=sentence) freqList.append(entryFreq)
On Line 14, we convert the input data dictionary defined in config.py
into a dataframe. Then the preprocess
function created data_processing.py
is used to process the dataframe (Line 17). This is followed by creating the tokenizers on Line 18.
We generate an empty list to store the word occurrences on Line 21. Looping over each sentence, the frequency of words is calculated using the calculate_bag_of_words
function located in bow.py
(Lines 24-28).
# create an empty data frame for the vectors finalDf = pd.DataFrame() # loop over the vectors and concat them for vector in freqList: vector = pd.DataFrame([vector]) finalDf = pd.concat([finalDf, vector], ignore_index=True) # add label column to the final data frame finalDf["label"] = df["sentiment"] # convert label into corresponding vector for i in range(len(finalDf["label"])): finalDf["label"][i] = labelDict[finalDf["label"][i]] # initialize the vanilla model shallowModel = build_shallow_net() print("[Info] Compiling model...") # fit the Keras model on the dataset shallowModel.fit( finalDf.iloc[:,0:10], finalDf.iloc[:,10].astype("float32"), epochs=config.epochs, batch_size=config.batchSize )
An empty dataframe to store the vectorized inputs is created on Line 31. Each vector in freqList
is appended into the empty dataframe on Lines 34-36.
The labels are added to the dataframe on Line 39. But since the label is still in string format, we convert them into vector format on Lines 42 and 43.
The vanilla model for training is initialized on Line 46, and we proceed to fit the training data and labels on Lines 50-55. Since we had added the label column to the dataframe, we can separate the data and labels using the iloc
functionality (0:10
for data and 10
for labels).
# create dataset using TensorFlow trainX, trainY = tensorflow_wrap(df) # initialize the new model for tf wrapped data tensorflowModel = build_shallow_net() print("[Info] Compiling model with tensorflow wrapped data...") # fit the keras model on the tensorflow dataset tensorflowModel.fit( trainX, trainY, epochs=config.epochs, batch_size=config.batchSize )
Now we move into the tensorflow
wrapped data. Just a single line of code (Line 58) gets us the trainX
(data) and trainY
(labels). The data is fit into a different model named tensorflowModel
(Lines 61-70).
Understanding the Training Metrics
An important thing to remember is that our dataset is extremely small, and the results should be considered inconclusive. Nevertheless, let’s take a look at our training accuracies.
[INFO] Compiling model... Epoch 1/30 1/1 [==============================] - 0s 495ms/step - loss: 0.7262 - accuracy: 0.5000 Epoch 2/30 1/1 [==============================] - 0s 10ms/step - loss: 0.7153 - accuracy: 0.5000 Epoch 3/30 1/1 [==============================] - 0s 10ms/step - loss: 0.7046 - accuracy: 0.5000 ... Epoch 27/30 1/1 [==============================] - 0s 7ms/step - loss: 0.4756 - accuracy: 1.0000 Epoch 28/30 1/1 [==============================] - 0s 5ms/step - loss: 0.4664 - accuracy: 1.0000 Epoch 29/30 1/1 [==============================] - 0s 10ms/step - loss: 0.4571 - accuracy: 1.0000 Epoch 30/30 1/1 [==============================] - 0s 5ms/step - loss: 0.4480 - accuracy: 1.0000
Our vanilla model reaches 100%
accuracy by the 30th
epoch, which was expected given the size of the dataset.
<keras.callbacks.History at 0x7f7bc5b5a110> [Info] Compiling Model with Tensorflow wrapped data... 1/30 1/1 [==============================] - 1s 875ms/step - loss: 0.6842 - accuracy: 0.5000 Epoch 2/30 1/1 [==============================] - 0s 14ms/step - loss: 0.6750 - accuracy: 0.5000 Epoch 3/30 1/1 [==============================] - 0s 7ms/step - loss: 0.6660 - accuracy: 0.5000 ... Epoch 27/30 1/1 [==============================] - 0s 9ms/step - loss: 0.4730 - accuracy: 0.8750 Epoch 28/30 1/1 [==============================] - 0s 12ms/step - loss: 0.4646 - accuracy: 0.8750 Epoch 29/30 1/1 [==============================] - 0s 12ms/step - loss: 0.4561 - accuracy: 0.8750 Epoch 30/30 1/1 [==============================] - 0s 9ms/step - loss: 0.4475 - accuracy: 0.8750 <keras.callbacks.History at 0x7f7bc594c710>
The tensorflow
data-wrapped model also seems to have reached pretty high accuracies by its final epoch, owing to a small dataset.
It is clear that both of the models have to overfit. However, an important point to note is that when it comes to text data, we would most likely want our model to overfit the training data for the best results.
If you are wondering why we want our model to overfit, that is because when it comes to text data, your training text data becomes your unquestioned commandment. Assuming a particular word has appeared multiple times in different sentences. Still, with similar contexts, you would definitely want your model to grasp that and overfit, so the word’s meaning becomes clear to the model.
As I mentioned earlier, text data differs greatly from image data. An assumption we consider while making the overfitting statement is that the training data will cover almost all instances of a word appearing in different contexts.
What's next? We recommend PyImageSearch University.
84 total classes • 114+ hours of on-demand code walkthrough videos • Last updated: February 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
Today we learned about Bag-of-Words (BoW), a great introduction to representation learning in Natural Language Processing (NLP). We are essentially repackaging our data into groups of tokens, following certain paradigms. These help the model get a basic understanding of what the sentences mean.
The BoW approach to NLP is limited in its ability to account for context and meaning. Naturally, representing sentences as vocabulary occurrences is ineffective in dealing with polysemy and homonymy.
Its inability to account for syntactic dependencies and non-standard text points to BoW not being a strong algorithm. But in context with the growth of NLP, this technique opened up many subsequent pushes in representation learning, thus being a pivotal part of NLP history.
Citation Information
Chakraborty, D. “Introduction to the Bag-of-Words (BoW) Model,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/oa2kt
@incollection{Chakraborty_2022_BoW, author = {Devjyoti Chakraborty}, title = {Introduction to the Bag-of-Words {(BoW)} Model}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki}, year = {2022}, note = {https://pyimg.co/oa2kt}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.