Introduction to the Bag-of-Words (BoW) Model

Creating statistical models based on text data has always been more complicated than modeling on image data. Image data contains detectable patterns, which can help a model identify them. Patterns in text data are more complex and require more computation using traditional methods.

Last week we took a brief stroll through the history of natural language processing (NLP). Today we will learn about one of the first techniques used in modeling language data for a computer, Bag-of-Words (BoW).

In this tutorial, you will learn about the Bag-of-Words model and how to implement it.

This lesson is the 2nd in a 4-part series on NLP 101:

Introduction to Natural Language Processing (NLP)
Introduction to the Bag-of-Words (BoW) Model (today’s tutorial)
Word2Vec: A Study of Embeddings in NLP
Comparison between BagofWords and Word2Vec

To learn how to implement the Bag-of-Words model, just keep reading.

Looking for the source code to this post?

Introduction to the Bag-of-Words (BoW) Model

Introduction to the BoW Model

The Bag-of-Words model is a simple method for extracting features from text data. The idea is to represent each sentence as a bag of words, disregarding grammar and paradigms. Just the occurrence of words in a sentence defines the meaning of the sentence for the model.

This can be considered an extension of representation learning, where you are representing the sentences in an N-dimensional space. For each sentence, the model will assign a weight to each dimension. This will become the sentence’s identity for the model.

Let’s dig deeper into what it means. Take a look at Figure 1.

We have two sentences; “I have a dog” and “You have a cat.” First, we grab all the words present in our current vocabulary and create a representation matrix where each column is dedicated to one of the words, as seen in Figure 1.

Our sentences have a combined word count of 8, but since we have 2 words (have, a) repeating, the total vocabulary size becomes 6. Now we have 6 columns representing each word in the vocabulary.

Each sentence is now represented as a combination of all the words in the vocabulary. For example, “I have a dog” has 4 of the 6 words available in the vocabulary, so we will turn on the bits for the existing words and turn off the bits for the words that don’t exist in the sentence.

Hence, if the vocab matrix columns are in the order of I, have, a, dog, you, and cat, the first sentence (“I have a dog”) representation becomes 1,1,1,1,0,0, while the second sentence (“You have a cat”) representation becomes 0,1,1,0,1,1.

These representations become the key to making a model understand the essence of different sentences. We are indeed ignoring grammar, but since these sentences are being viewed with respect to the complete vocabulary, each has a unique tokenized representation, which helps them stand out from other sentences.

For example, the first sentence will stand out since it has the dog and I bits turned on, while the second sentence has the cat and you bits turned on. These small changes in the representations help us model text data using the Bag-of-Words approach.

Here, we have explained BoW with the bitwise approach. BoW can also be configured to store the frequency of occurrence of words for additional reinforcement during model training.

The Pros and Cons of BoW

Right off the bat, we see a major problem with this approach. If our input data is big, that would mean that the vocabulary size will also increase. This, in turn, makes our representation matrix much larger and makes computations very complex.

Another computational nightmare is the inclusion of many 0s in our matrix (i.e., a sparse matrix). A sparse matrix contains less information and wastes a lot of memory.

The biggest disadvantage in Bag-of-Words is the complete inability to learn grammar and semantics. The tokenized representation from the representation matrix is what defines a sentence, and only the occurrence/non-occurrence of words in a sentence distinguishes it from others.

On a brighter note, the Bag-of-Words approach highlights the benefits of representation learning in a stellar way. Its simple and intuitive approach helps us at least explain what a combination of words might mean to a computer.

Of course, that brings into question the application of Bag-of-Words. Firstly, it is a great introductory step toward more complex representation learning examples like Word2Vec and Glove. Since it also echoes the concept of “one-hot encoding” representations, Bag-of-Words was primarily used for the feature generation of text documents.

Now that we have grasped the idea of Bag-of-Words, let’s implement it!

Configuring Your Development Environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python
$ pip install tensorflow
$ pip install numpy

If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having Problems Configuring Your Development Environment?

**Figure 2:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

We first need to review our project directory structure.

Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.

From there, take a look at the directory structure:

!tree .
.
├── pyimagesearch
│   ├── bow.py
│   ├── config.py
│   ├── data_processing.py
│   ├── __init__.py
│   ├── model.py
│   └── tensorflow_wrapper.py
└── train.py

1 directory, 7 files

In the pyimagesearch directory, we have:

bow.py: Applies the Bag-of-Words technique to the sentence.
config.py: Contains the configuration pipeline for the project.
data_processing.py: Houses data-processing utilities.
__init__.py: Turns the pyimagesearch directory into a python package.
model.py: Contains a small neural network architecture.
tensorflow_wrapper.py: Houses the Bag-of-Words approach wrapped with tensorflow utilities.

In the parent directory, we have:

train.py: Contains the training script for the Bag-of-Words approach.

Configuring the Prerequisites

Inside the pyimagesearch directory, we have a script called config.py, which houses the configuration pipeline for this project.

# define the data to be used
dataDict = {
    "sentence":[
        "Avengers is a great movie.",
        "I love Avengers it is great.",
        "Avengers is a bad movie.",
        "I hate Avengers.",
        "I didnt like the Avengers movie.",
        "I think Avengers is a bad movie.",
        "I love the movie.",
        "I think it is great."
    ],
    "sentiment":[
        "good",
        "good",
        "bad",
        "bad",
        "bad",
        "bad",
        "good",
        "good"
    ]
}

# define a list of stopwords
stopWrds = ["is", "a", "i", "it"] 

# define model training parameters
epochs = 30
batchSize = 10

# define number of dense units
denseUnits = 50

We will be implementing the Bag-of-Words using NumPy, as well as with TensorFlow, to compare the two methods. On Lines 2-23, we have the dataDict, a dictionary containing our input data split into sentences and their corresponding labels.

On Line 26, we have defined a list of stop-words, which shall be omitted from our data since the BoW approach does not care about grammar and semantics. The more unique words each sentence has, the better it can be separated from the rest.

On Lines 29-33, we define some parameters for the TensorFlow model, like the number of epochs, batch size, and the number of dense units to add to our small neural network. This concludes the configuration pipeline.

Processing Our Data for the BoW Approach

As deep learning practitioners, you and I probably overlook and take many things in our daily projects for granted. We use TensorFlow/PyTorch wrappers for almost every processing task and forget what actually makes these wrappers so important.

For our project today, we will be coding some of these pre-processing wrappers on our own. For that, let’s move into the data_processing.py script.

# import the necessary packages
import re

def preprocess(sentDf, stopWords, key="sentence"):
    # loop over all the sentences
    for num in range(len(sentDf[key])):
        # lowercase the string and remove punctuation
        sentence = sentDf[key][num]
        sentence = re.sub(
            r"[^a-zA-Z0-9]", " ", sentence.lower()
        ).split()

        # define a list for processed words
        newWords = list()

        # loop over the words in each sentence and filter out the
        # stopwords
        for word in sentence:
            if word not in stopWords:
                # append word if not a stopword    
                newWords.append(word)

        # replace sentence with the list of new words   
        sentDf[key][num] = newWords
    
    # return the preprocessed data
    return sentDf

On Line 4, we have the first helper function, preprocess, which takes in the following arguments:

sentDf: The input dataframe.
stopWords: A list of words to omit from the data.
key: A key to access the relevant part of the input dataframe.

We loop over all the available sentences in the dataframe (Lines 6-27), make the words lowercase, remove punctuation, and omit the stopwords.

def prepare_tokenizer(df, sentKey="sentence", outputKey="sentiment"):
    # counters for tokenizer indices
    wordCounter = 0
    labelCounter = 0

    # create placeholder dictionaries for tokenizer
    textDict = dict()
    labelDict = dict()

    # loop over the sentences
    for entry in df[sentKey]:
        # loop over each word and
        # check if encountered before
        for word in entry:
            if word not in textDict.keys():
                textDict[word] = wordCounter

                # update word counter if new
                # word is encountered
                wordCounter += 1
    
    # repeat same process for labels  
    for label in df[outputKey]:
        if label not in labelDict.keys():
            labelDict[label] = labelCounter
            labelCounter += 1
    
    # return the dictionaries 
    return (textDict, labelDict)

The second function in this script is prepare_tokenizer (Line 29), which takes in the following arguments:

df: The dataframe from which we will create our tokenizer.
sentKey: The key to access the sentence from the dataframe.
outputKey: The key to access the labels from the dataframe.

First, we create counters for indices on Lines 31 and 32. On Lines 35 and 36, we create dictionaries for the tokenizer.

Next, we start looping over the sentences (Line 39) and adding the words to our dictionary. If we encounter a word we have already seen before, we ignore it. If the word is newly encountered, it is added to the dictionary (Lines 42-48).

We apply the same process for the labels (Lines 51-54), concluding the prepare_tokenizer script.

Building the Bag-of-Words Function

Now, we will move into the bow.py script to see our custom function to calculate the bag of words.

def calculate_bag_of_words(text, sentence):
    # create a dictionary for frequency check
    freqDict = dict.fromkeys(text, 0)

    # loop over the words in sentences
    for word in sentence:
        # update word frequency
        freqDict[word]=sentence.count(word)

    # return dictionary 
    return freqDict

The function calculate_bag_of_words takes in the vocabulary and the sentence as its arguments (Line 1). Next, we create a dictionary on Line 3 to check and store the occurrence of words.

Looping over each word in a sentence (Line 6), we count the number of times a particular word has appeared and return it (Lines 8-11).

TensorFlow Wrapping: An Alternative

Till now, we have seen what it would be like to create all the pre-processing functionalities ourselves. If you feel it is too complicated, we will also show you how to use TensorFlow for the same processes instead. Let’s move into tensorflow_wrapper.py.

# import the necessary packages
from tensorflow.keras.preprocessing.text import Tokenizer 

def tensorflow_wrap(df):
    # create the tokenizer for sentences
    tokenizerSentence = Tokenizer()

    # create the tokenizer for labels
    tokenizerLabel = Tokenizer()

    # fit the tokenizer on the documents
    tokenizerSentence.fit_on_texts(df["sentence"])

    # fit the tokenizer on the labels
    tokenizerLabel.fit_on_texts(df["sentiment"])

    # create vectors using tensorflow
    encodedData = tokenizerSentence.texts_to_matrix(
        texts=df["sentence"], mode="count")

    # add label column
    labels = df["sentiment"]

    # correct label vectors
    for i in range(len(labels)):
        labels[i] = tokenizerLabel.word_index[labels[i]] - 1

    # return data and labels
    return (encodedData[:, 1:], labels.astype("float32"))

Inside the script, we have the tensorflow_wrap (Line 4) function, which takes in the dataframe as the argument.

On Lines 6-9, we initialize tokenizers for the sentences and labels, respectively. By simply using the fit_on_texts function, we have finished creating the tokenizers for the sentences and labels (Lines 12-15).

Using another function called texts_to_matrix to create our encodings, we get the vectorized format of our processed sentences (Lines 18 and 19).

On Lines 22-26, we create labels and then return the encodings and labels on Line 29.

#import the necessary packages
import pyimagesearch.config as config
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

def build_shallow_net():
    # define the model
    model = Sequential()
    model.add(Dense(config.denseUnits, input_dim=10, activation="relu"))
    model.add(Dense(config.denseUnits, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))

    # compile the keras model
    model.compile(loss="binary_crossentropy", optimizer="adam",
        metrics=["accuracy"]
    )

    # return model
    return model

On Line 6, we define the build_shallow_net function, which initializes a shallow neural network.

The network starts with a dense layer, where the number of inputs is set to 10. This is because the vocabulary of our text corpus after processing is 10. Two dense layers follow this, the final being the sigmoid-activated output layer (Lines 8-11).

On Lines 14 and 15, we compile the model with binary_crossentropy loss, adam optimizer, and accuracy as the metric.

With that, our model is ready for use.

Training the BoW Model

Now it is time to combine all our modules and train the Bag-of-Words model approach. Let’s move into the train.py script.

# USAGE
# python train.py

# import the necessary packages
from pyimagesearch import config
from pyimagesearch.model import build_shallow_net
from pyimagesearch.bow import calculate_bag_of_words
from pyimagesearch.data_processing import preprocess
from pyimagesearch.data_processing import prepare_tokenizer
from pyimagesearch.tensorflow_wrapper import tensorflow_wrap
import pandas as pd

# convert the input data dictionary to a pandas data frame
df = pd.DataFrame.from_dict(config.dataDict)

# preprocess the data frame and create data dictionaries
preprocessedDf = preprocess(sentDf=df, stopWords=config.stopWrds)
(textDict, labelDict) = prepare_tokenizer(df)

# create an empty list for vectors
freqList = list()

# build vectors from the sentences
for sentence in df["sentence"]:
    # create entries for each sentence and update the vector list   
    entryFreq = calculate_bag_of_words(text=textDict,
        sentence=sentence)
    freqList.append(entryFreq)

On Line 14, we convert the input data dictionary defined in config.py into a dataframe. Then the preprocess function created data_processing.py is used to process the dataframe (Line 17). This is followed by creating the tokenizers on Line 18.

We generate an empty list to store the word occurrences on Line 21. Looping over each sentence, the frequency of words is calculated using the calculate_bag_of_words function located in bow.py (Lines 24-28).

# create an empty data frame for the vectors
finalDf = pd.DataFrame() 

# loop over the vectors and concat them
for vector in freqList:
    vector = pd.DataFrame([vector])
    finalDf = pd.concat([finalDf, vector], ignore_index=True)

# add label column to the final data frame
finalDf["label"] = df["sentiment"]

# convert label into corresponding vector
for i in range(len(finalDf["label"])):
    finalDf["label"][i] = labelDict[finalDf["label"][i]]

# initialize the vanilla model
shallowModel = build_shallow_net()
print("[Info] Compiling model...")

# fit the Keras model on the dataset
shallowModel.fit(
    finalDf.iloc[:,0:10],
    finalDf.iloc[:,10].astype("float32"),
    epochs=config.epochs,
    batch_size=config.batchSize
)

An empty dataframe to store the vectorized inputs is created on Line 31. Each vector in freqList is appended into the empty dataframe on Lines 34-36.

The labels are added to the dataframe on Line 39. But since the label is still in string format, we convert them into vector format on Lines 42 and 43.

The vanilla model for training is initialized on Line 46, and we proceed to fit the training data and labels on Lines 50-55. Since we had added the label column to the dataframe, we can separate the data and labels using the iloc functionality (0:10 for data and 10 for labels).

# create dataset using TensorFlow
trainX, trainY = tensorflow_wrap(df)

# initialize the new model for tf wrapped data
tensorflowModel = build_shallow_net()
print("[Info] Compiling model with tensorflow wrapped data...")

# fit the keras model on the tensorflow dataset
tensorflowModel.fit(
    trainX,
    trainY,
    epochs=config.epochs,
    batch_size=config.batchSize
)

Now we move into the tensorflow wrapped data. Just a single line of code (Line 58) gets us the trainX (data) and trainY (labels). The data is fit into a different model named tensorflowModel (Lines 61-70).

Understanding the Training Metrics

An important thing to remember is that our dataset is extremely small, and the results should be considered inconclusive. Nevertheless, let’s take a look at our training accuracies.

[INFO] Compiling model...
Epoch 1/30
1/1 [==============================] - 0s 495ms/step - loss: 0.7262 - accuracy: 0.5000
Epoch 2/30
1/1 [==============================] - 0s 10ms/step - loss: 0.7153 - accuracy: 0.5000
Epoch 3/30
1/1 [==============================] - 0s 10ms/step - loss: 0.7046 - accuracy: 0.5000
...
Epoch 27/30
1/1 [==============================] - 0s 7ms/step - loss: 0.4756 - accuracy: 1.0000
Epoch 28/30
1/1 [==============================] - 0s 5ms/step - loss: 0.4664 - accuracy: 1.0000
Epoch 29/30
1/1 [==============================] - 0s 10ms/step - loss: 0.4571 - accuracy: 1.0000
Epoch 30/30
1/1 [==============================] - 0s 5ms/step - loss: 0.4480 - accuracy: 1.0000

Our vanilla model reaches 100% accuracy by the 30th epoch, which was expected given the size of the dataset.

<keras.callbacks.History at 0x7f7bc5b5a110>
[Info] Compiling Model with Tensorflow wrapped data...
1/30
1/1 [==============================] - 1s 875ms/step - loss: 0.6842 - accuracy: 0.5000
Epoch 2/30
1/1 [==============================] - 0s 14ms/step - loss: 0.6750 - accuracy: 0.5000
Epoch 3/30
1/1 [==============================] - 0s 7ms/step - loss: 0.6660 - accuracy: 0.5000
...
Epoch 27/30
1/1 [==============================] - 0s 9ms/step - loss: 0.4730 - accuracy: 0.8750
Epoch 28/30
1/1 [==============================] - 0s 12ms/step - loss: 0.4646 - accuracy: 0.8750
Epoch 29/30
1/1 [==============================] - 0s 12ms/step - loss: 0.4561 - accuracy: 0.8750
Epoch 30/30
1/1 [==============================] - 0s 9ms/step - loss: 0.4475 - accuracy: 0.8750
<keras.callbacks.History at 0x7f7bc594c710>

The tensorflow data-wrapped model also seems to have reached pretty high accuracies by its final epoch, owing to a small dataset.

It is clear that both of the models have to overfit. However, an important point to note is that when it comes to text data, we would most likely want our model to overfit the training data for the best results.

If you are wondering why we want our model to overfit, that is because when it comes to text data, your training text data becomes your unquestioned commandment. Assuming a particular word has appeared multiple times in different sentences. Still, with similar contexts, you would definitely want your model to grasp that and overfit, so the word’s meaning becomes clear to the model.

As I mentioned earlier, text data differs greatly from image data. An assumption we consider while making the overfitting statement is that the training data will cover almost all instances of a word appearing in different contexts.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

Today we learned about Bag-of-Words (BoW), a great introduction to representation learning in Natural Language Processing (NLP). We are essentially repackaging our data into groups of tokens, following certain paradigms. These help the model get a basic understanding of what the sentences mean.

The BoW approach to NLP is limited in its ability to account for context and meaning. Naturally, representing sentences as vocabulary occurrences is ineffective in dealing with polysemy and homonymy.

Its inability to account for syntactic dependencies and non-standard text points to BoW not being a strong algorithm. But in context with the growth of NLP, this technique opened up many subsequent pushes in representation learning, thus being a pivotal part of NLP history.

Citation Information

Chakraborty, D. “Introduction to the Bag-of-Words (BoW) Model,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/oa2kt

@incollection{Chakraborty_2022_BoW,
  author = {Devjyoti Chakraborty},
  title = {Introduction to the Bag-of-Words {(BoW)} Model},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki},
  year = {2022},
  note = {https://pyimg.co/oa2kt},
}

Unleash the potential of computer vision with Roboflow - Free!

Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.

Join Roboflow Now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Table of Contents

Introduction to the Bag-of-Words (BoW) Model

Looking for the source code to this post?

Introduction to the Bag-of-Words (BoW) Model

Introduction to the BoW Model

The Pros and Cons of BoW

Configuring Your Development Environment

Having Problems Configuring Your Development Environment?

Project Structure

Configuring the Prerequisites

Processing Our Data for the BoW Approach

Building the Bag-of-Words Function

TensorFlow Wrapping: An Alternative

Training the BoW Model

Understanding the Training Metrics

What's next? We recommend PyImageSearch University.

Summary

Citation Information

Unleash the potential of computer vision with Roboflow - Free!

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

PyImageSearch University

Choosing the Research Topic and Reading Its Literature

Breaking captchas with deep learning, Keras, and TensorFlow

Exploring GAN Code Generation with Gemini Pro and ChatGPT-3.5: A Comparative Study

Topics

Books & Courses

PyImageSearch

Table of Contents

Looking for the source code to this post?

What's next? We recommend PyImageSearch University.

Unleash the potential of computer vision with Roboflow - Free!

Download the Source Code and FREE 17-page Resource Guide

About the Author

Semantic vs. Instance vs. Panoptic Segmentation

Word2Vec: A Study of Embeddings in NLP

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?