Human Activity Recognition with OpenCV and Deep Learning

In this tutorial you will learn how to perform Human Activity Recognition with OpenCV and Deep Learning.

Our human activity recognition model can recognize over 400 activities with 78.4-94.5% accuracy (depending on the task).

A sample of the activities can be seen below:

archery
arm wrestling
baking cookies
counting money
driving tractor
eating hotdog
flying kite
getting a tattoo
grooming horse
hugging

ice skating
juggling fire
kissing
laughing
motorcycling
news anchoring
opening present
playing guitar
playing tennis
robot dancing

sailing
scuba diving
snowboarding
tasting beer
trimming beard
using computer
washing dishes
welding
yoga
…and more!

Practical applications of human activity recognition include:

Automatically classifying/categorizing a dataset of videos on disk.
Training and monitoring a new employee to correctly perform a task (ex., proper steps and procedures when making a pizza, including rolling out the dough, heating oven, putting on sauce, cheese, toppings, etc.).
Verifying that a food service worker has washed their hands after visiting the restroom or handling food that could cause cross-contamination (i.e,. chicken and salmonella).
Monitoring bar/restaurant patrons and ensuring they are not over-served.

To learn how to perform human activity recognition with OpenCV and Deep Learning, just keep reading!

Looking for the source code to this post?

Human Activity Recognition with OpenCV and Deep Learning

In the first part of this tutorial we’ll discuss the Kinetics dataset, the dataset used to train our human activity recognition model.

From there we’ll discuss how we can extend ResNet, which typically uses 2D kernels, to instead leverage 3D kernels, enabling us to include a spatiotemporal component used for activity recognition.

We’ll then implement two versions of human activity recognition using the OpenCV library and the Python programming language.

Finally, we’ll wrap up the tutorial by looking at the results of applying human activity recognition to a few sample videos.

The Kinetics Dataset

**Figure 1:** The pre-trained human activity recognition deep learning model used in today’s tutorial was trained on the Kinetics 400 dataset.

The dataset our human activity recognition model was trained on is the Kinetics 400 Dataset.

This dataset consists of:

400 human activity recognition classes
At least 400 video clips per class (downloaded via YouTube)
A total of 300,000 videos

You can view the full list of classes the model can recognize here.

To learn more about the dataset, including how it was curated, be sure to refer to Kay et al.’s 2017 paper, The Kinetics Human Action Video Dataset.

3D ResNet for Human Activity Recognition

**Figure 2:** Deep neural network advances on image classification with ImageNet have also led to success in deep learning activity recognition (i.e. on videos). In this tutorial, we perform deep learning activity recognition with OpenCV. (image source: Figure 1 from Hara et al.)

The model we’re using for human activity recognition comes from Hara et al.’s 2018 CVPR paper, Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

In this work the authors explore how existing state-of-the-art 2D architectures (such as ResNet, ResNeXt, DenseNet, etc.) can be extended to video classification via 3D kernels.

The authors argue:

These architectures have been successfully applied to image classification.
The large-scale ImageNet dataset allowed such models to be trained to such high accuracy.
The Kinetics dataset is also sufficiently large.

…and therefore, these architectures should be able to perform video classification by (1) changing the input volume shape to include spatiotemporal information and (2) utilizing 3D kernels inside of the architecture.

The authors were in fact correct!

By modifying both the input volume shape and the kernel shape, the authors obtained:

78.4% accuracy on the Kinetics test set
94.5% accuracy on the UCF-101 test set
70.2% accuracy on the HMDB-51 test set

These results are similar to rank-1 accuracies reported on state-of-the-art models trained on ImageNet, thereby demonstrating that these model architectures can be utilized for video classification simply by including spatiotemporal information and swapping 2D kernels for 3D ones.

For more information on our modified ResNet architecture, experiment design, and final accuracies, be sure to refer to the paper.

Downloading the Human Activity Recognition Model for OpenCV

**Figure 3:** Files required for human activity recognition with OpenCV and deep learning.

To follow along with the rest of this tutorial you’ll need to download the:

Human activity model
Python + OpenCV source code
Example video for classification

You can use the “Downloads” section of this tutorial to download a .zip containing all three.

Once downloaded, continue on with the rest of this tutorial.

Project structure

Let’s inspect our project files:

$ tree
.
├── action_recognition_kinetics.txt
├── resnet-34_kinetics.onnx
├── example_activities.mp4
├── human_activity_reco.py
└── human_activity_reco_deque.py

0 directories, 5 files

Our project consists of three auxiliary files:

action_recognition_kinetics.txt : The class labels for the Kinetics dataset.
resnet-34_kinetics.onx : Hara et al.’s pre-trained and serialized human activity recognition convolutional neural network trained on the Kinetics dataset.
example_activities.mp4 : A compilation of clips for testing human activity recognition.

We will review two Python scripts, each of which accepts the above three files as input:

human_activity_reco.py : Our human activity recognition script which samples N frames at a time to make an activity classification prediction.
human_activity_reco_deque.py : A similar human activity recognition script that implements a rolling average queue. This script is slower to run; however, I’m providing the implementation so that you can learn from and experiment with it.

Implementing Human Activity Recognition with OpenCV

Let’s go ahead and implement human activity recognition with OpenCV. Our implementation is based on OpenCV’s official example; however, I’ve provided additional changes (both in this example and the next) along with additional commentary/detailed explanations on what the code is doing.

Open up the human_activity_reco.py file in your project structure and insert the following code:

# import the necessary packages
import numpy as np
import argparse
import imutils
import sys
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--model", required=True,
	help="path to trained human activity recognition model")
ap.add_argument("-c", "--classes", required=True,
	help="path to class labels file")
ap.add_argument("-i", "--input", type=str, default="",
	help="optional path to video file")
args = vars(ap.parse_args())

We begin with imports on Lines 2-6. For today’s tutorial you need OpenCV 4 and imutils installed. Visit my pip install opencv instructions to install OpenCV on your system if you have not done so already.

Lines 10-16 parse our command line arguments:

--model : The path to the trained human activity recognition model.
--classes : The path to the activity recognition class labels file.
--input : An optional path to your input video file. If this argument is not included on the command line, your webcam will be invoked.

From here we’ll perform initializations:

# load the contents of the class labels file, then define the sample
# duration (i.e., # of frames for classification) and sample size
# (i.e., the spatial dimensions of the frame)
CLASSES = open(args["classes"]).read().strip().split("\n")
SAMPLE_DURATION = 16
SAMPLE_SIZE = 112

Line 21 loads our class labels from the text file.

Lines 22 and 23 define the sample duration (i.e. the number of frames for classification) and sample size (i.e. the spatial dimensions of the frame).

Next, we’ll load and initialize our human activity recognition model:

# load the human activity recognition model
print("[INFO] loading human activity recognition model...")
net = cv2.dnn.readNet(args["model"])

# grab a pointer to the input video stream
print("[INFO] accessing video stream...")
vs = cv2.VideoCapture(args["input"] if args["input"] else 0)

Line 27 uses OpenCV’s DNN module to read the PyTorch pre-trained human activity recognition model.

Line 31 then instantiates our video stream using either a video file or webcam.

We’re now ready to begin looping over frames and performing human activity recognition:

# loop until we explicitly break from it
while True:
	# initialize the batch of frames that will be passed through the
	# model
	frames = []

	# loop over the number of required sample frames
	for i in range(0, SAMPLE_DURATION):
		# read a frame from the video stream
		(grabbed, frame) = vs.read()

		# if the frame was not grabbed then we've reached the end of
		# the video stream so exit the script
		if not grabbed:
			print("[INFO] no frame read from stream - exiting")
			sys.exit(0)

		# otherwise, the frame was read so resize it and add it to
		# our frames list
		frame = imutils.resize(frame, width=400)
		frames.append(frame)

Line 34 begins a loop over our frames where first we initialize the batch of frames that will be passed through the neural net (Line 37).

From there, Lines 40-53 populate the batch of frames directly from our video stream. Line 52 resizes each frame to a width of 400 pixels while maintaining aspect ratio.

Let’s construct our blob of input frames which we will soon pass through the human activity recognition CNN:

	# now that our frames array is filled we can construct our blob
	blob = cv2.dnn.blobFromImages(frames, 1.0,
		(SAMPLE_SIZE, SAMPLE_SIZE), (114.7748, 107.7354, 99.4750),
		swapRB=True, crop=True)
	blob = np.transpose(blob, (1, 0, 2, 3))
	blob = np.expand_dims(blob, axis=0)

Lines 56-60 construct a blob from our input frames list.

Notice that we’re using the blobFromImages (i.e. plural) rather than the blobFromImage (i.e. singular) function — the reason here is that we’re building a batch of multiple images to be passed through the human activity recognition network, enabling it to take advantage of spatiotemporal information.

If you were to insert a print(blob.shape) statement into your code you would notice that the blob has the following dimensionality:

(1, 3, 16, 112, 112)

Let’s unpack this dimensionality a bit more:

1: The batch dimension. Here we have only a single data point that is being passed through the network (a “data point” in this context means the N frames that will be passed through the network to obtain a single classification).
3: The number of channels in our input frames.
16: The total number of frames in the blob .
112 (first occurrence): The height of the frames.
112 (second occurrence): The width of the frames.

At this point, we’re ready to perform human activity recognition inference followed by annotating the frame with the predicted label and showing the prediction to our screen:

	# pass the blob through the network to obtain our human activity
	# recognition predictions
	net.setInput(blob)
	outputs = net.forward()
	label = CLASSES[np.argmax(outputs)]

	# loop over our frames
	for frame in frames:
		# draw the predicted activity on the frame
		cv2.rectangle(frame, (0, 0), (300, 40), (0, 0, 0), -1)
		cv2.putText(frame, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX,
			0.8, (255, 255, 255), 2)

		# display the frame to our screen
		cv2.imshow("Activity Recognition", frame)
		key = cv2.waitKey(1) & 0xFF

		# if the `q` key was pressed, break from the loop
		if key == ord("q"):
			break

Lines 64 and 65 pass the blob through the network, obtaining a list of outputs , the predictions.

We then grab the label of the highest prediction for the blob (Line 66).

Using the label , we can then draw the prediction on each and every frame in the frames list (Lines 69-73), displaying the output frames until the q key is pressed at which point we break and exit.

An Alternate Human Activity Implementation Using a Deque Data Structure

Inside our human activity recognition from the previous section, you’ll notice the following lines:

# loop until we explicitly break from it
while True:
	# initialize the batch of frames that will be passed through the
	# model
	frames = []

	# loop over the number of required sample frames
	for i in range(0, SAMPLE_DURATION):
		# read a frame from the video stream
		(grabbed, frame) = vs.read()

		# if the frame was not grabbed then we've reached the end of
		# the video stream so exit the script
		if not grabbed:
			print("[INFO] no frame read from stream - exiting")
			sys.exit(0)

		# otherwise, the frame was read so resize it and add it to
		# our frames list
		frame = imutils.resize(frame, width=400)
		frames.append(frame)

This implementation implies that:

We read a total of SAMPLE_DURATION frames from our input video.
We pass those frames through our human activity recognition model to obtain the output.
And then we read another SAMPLE_DURATION frames and repeat the process.

Thus, our implementation is not a rolling prediction.

Instead, it’s simply grabbing a sample of frames, classifying them, and moving on to the next batch — any frames from the previous batch are discarded.

The reason we do this is for speed.

If we classified each individual frame it would take longer for the script to run.

That said, using rolling frame prediction via a deque data structure can lead to better results as it does not discard all of the previous frames — rolling frame prediction only discards the oldest frame in the list, making room for the newest frame.

To see how this can cause a problem related to inference speed, let’s suppose there are N total frames in a video file:

If we do use rolling frame prediction, we perform N classifications, one for each frame (once the deque data structure is filled, of course)
If we do not use rolling frame prediction, we only have to perform N / SAMPLE_DURATION classifications, thus reducing the amount of time it takes to process a video stream significantly.

**Figure 4:** Rolling prediction (*blue*) uses a fully populated FIFO queue window to make predictions. Batch prediction (*red*) does not “roll” from frame to frame. Rolling prediction requires more computational horsepower but leads to better results for human activity recognition with OpenCV and deep learning.

Given that OpenCV’s dnn module does not support most GPUs (including NVIDIA GPUs), I would recommend you do not use rolling frame prediction for most applications.

That said, inside the .zip file for today’s tutorial (found in the “Downloads” section of the post) you’ll find a file named human_activity_reco_deque.py — this file contains an implementation of Human Activity Recognition that performs rolling frame prediction.

The script is very similar to the previous one, but I’m including it here for you to experiment with:

# import the necessary packages
from collections import deque
import numpy as np
import argparse
import imutils
import cv2

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--model", required=True,
	help="path to trained human activity recognition model")
ap.add_argument("-c", "--classes", required=True,
	help="path to class labels file")
ap.add_argument("-i", "--input", type=str, default="",
	help="optional path to video file")
args = vars(ap.parse_args())

# load the contents of the class labels file, then define the sample
# duration (i.e., # of frames for classification) and sample size
# (i.e., the spatial dimensions of the frame)
CLASSES = open(args["classes"]).read().strip().split("\n")
SAMPLE_DURATION = 16
SAMPLE_SIZE = 112

# initialize the frames queue used to store a rolling sample duration
# of frames -- this queue will automatically pop out old frames and
# accept new ones
frames = deque(maxlen=SAMPLE_DURATION)

# load the human activity recognition model
print("[INFO] loading human activity recognition model...")
net = cv2.dnn.readNet(args["model"])

# grab a pointer to the input video stream
print("[INFO] accessing video stream...")
vs = cv2.VideoCapture(args["input"] if args["input"] else 0)

Imports are the same with the exception of Python’s built-in deque implementation from the collections module (Line 2).

On Line 28, we initialize the FIFO frames queue with a maximum length equal to our sample duration. Our “first-in, first-out” (FIFO) queue will automatically pop out old frames and accept new ones. We’ll perform rolling inference on the queue of frames.

All other lines above are the same, so let’s now inspect our frame processing loop:

# loop over frames from the video stream
while True:
	# read a frame from the video stream
	(grabbed, frame) = vs.read()

	# if the frame was not grabbed then we've reached the end of
	# the video stream so break from the loop
	if not grabbed:
		print("[INFO] no frame read from stream - exiting")
		break

	# resize the frame (to ensure faster processing) and add the
	# frame to our queue
	frame = imutils.resize(frame, width=400)
	frames.append(frame)

	# if our queue is not filled to the sample size, continue back to
	# the top of the loop and continue polling/processing frames
	if len(frames) < SAMPLE_DURATION:
		continue

Lines 41-57 are different than in our previous script.

Previously, we sampled a batch of SAMPLE_DURATION frames and would later perform inference on that batch.

In this script, we still perform inference in batch; however, it is now a rolling batch. The difference is that we add frames to our FIFO queue on Line 52. Again, this queue has a maxlen of our sample duration and the head of the queue will always be the current frame of our video stream. Once the queue fills up, old frames are popped out automatically with the deque FIFO implementation.

The result of this rolling implementation is that once the queue is full, any given frame (with the exception of the very first frame) will be “touched” (i.e. included in the rolling batch) more than once. This method is less efficient; however, it leads to more accurate activity recognition, especially when the video/scene’s activities change periodically.

Lines 56 and 57 allow our frames queue to fill up (i.e. to 16 frames as shown in Figure 4, blue) prior to any inference being performed.

Once the queue is full, we will perform a rolling human activity recognition prediction:

	# now that our frames array is filled we can construct our blob
	blob = cv2.dnn.blobFromImages(frames, 1.0,
		(SAMPLE_SIZE, SAMPLE_SIZE), (114.7748, 107.7354, 99.4750),
		swapRB=True, crop=True)
	blob = np.transpose(blob, (1, 0, 2, 3))
	blob = np.expand_dims(blob, axis=0)

	# pass the blob through the network to obtain our human activity
	# recognition predictions
	net.setInput(blob)
	outputs = net.forward()
	label = CLASSES[np.argmax(outputs)]

	# draw the predicted activity on the frame
	cv2.rectangle(frame, (0, 0), (300, 40), (0, 0, 0), -1)
	cv2.putText(frame, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX,
		0.8, (255, 255, 255), 2)

	# display the frame to our screen
	cv2.imshow("Activity Recognition", frame)
	key = cv2.waitKey(1) & 0xFF

	# if the `q` key was pressed, break from the loop
	if key == ord("q"):
		break

This code block contains lines of code identical to our previous script. Here we:

Construct a blob from our queue of frames .
Perform inference and grab the highest probability prediction for the blob .
Annotate and display the current frame with the resulting label of rolling average human activity recognition.
Exit upon the q key being pressed.

Human Activity Recognition Results

Let’s see the results of our human activity recognition code in action!

Use the “Downloads” section of this tutorial to download the pre-trained human activity recognition model, Python + OpenCV source code, and example demo video.

From there, open up a terminal and execute the following command:

$ python human_activity_reco_deque.py --model resnet-34_kinetics.onnx \
	--classes action_recognition_kinetics.txt \
	--input example_activities.mp4
[INFO] loading human activity recognition model...
[INFO] accessing video stream...

Please note that our Human Activity Recognition model requires at least OpenCV 4.1.2.

If your are running an older version of OpenCV you will receive the following error:

net = cv2.dnn.readNet(args["model"])
cv2.error: OpenCV(4.1.0) /Users/adrian/build/skvark/opencv-python/opencv/modules/dnn/src/onnx/onnx_importer.cpp:245: error: (-215:Assertion failed) attribute_proto.ints_size() == 2 in function 'getLayerParams'

If you receive that error you need to upgrade your OpenCV install to at least OpenCV 4.1.2.

Below is an example of our model correctly labeling an input video clip as “yoga”

Notice how the model waffles back and forth between “yoga” and “stretching leg” — both are technically correct here as in a downward dog position you are, by definition, doing yoga, but also stretching your legs at the same time.

In the next example our human activity recognition model correctly predicts this video as “skateboarding”:

You can see why the model also predicted “parkour” as well — the skater is jumping over a railing which is similar to an action that a parkourist may perform.

Anyone hungry?

If so, you might be interested in “making pizza”:

But before you eat, make sure you’re “washing hands” before you sit down to eat:

If you choose to indulge in “drinking beer” you better watch how much you’re drinking — the bartender might cut you off:

As you can see, our human activity recognition model, while not perfect, is still performing quite well given the simplicity of our technique (converting ResNet to handle 3D inputs versus 2D ones).

Human activity recognition is far from solved, but with deep learning and Convolutional Neural Networks, we’re making great strides.

Credits

The videos on this page, including the ones in the example_activities.mp4 file found in the “Downloads” of this guide come from the following sources:

Beginner Yoga | Floating the Foot Forward Tutorial by Stouffville Yoga Life
Lad downs 23 pints for his 21st birthday by CONTENTbible
Proper Hand Washing Technique by Children’s Hospital Los Angeles
Best Skateboarding Clips of the Year (2019) by Skate Box
Food in Rome – Wood Fired Pizza – Italy by Aden Films

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial you learned how to perform human activity recognition using OpenCV and Deep Learning.

To accomplish this task, we leveraged a human activity recognition model pre-trained on the Kinetics dataset, which includes 400-700 human activities (depending on which version of the dataset you’re using) and over 300,000 video clips.

The model we utilized was ResNet, but with a twist — the model architecture had been modified to utilize 3D kernels rather than the standard 2D filters, enabling the model to include a temporal component for activity recognition.

You can read more about the model in Hara et al.’s 2018 paper, Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Finally, we implemented human activity recognition using OpenCV and Hara et al.’s PyTorch implementation which we loaded via OpenCV’s dnn module.

Based on our results, we can see that while not perfect, our human activity recognition model is performing quite well!

To download the source code and pre-trained human activity recognition model (and be notified when future tutorials are published here on PyImageSearch), just enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

About the Author

Hi there, I’m Adrian Rosebrock, PhD. All too often I see developers, students, and researchers wasting their time, studying the wrong things, and generally struggling to get started with Computer Vision, Deep Learning, and OpenCV. I created this website to show you what I believe is the best possible way to get your start.

108 responses to: Human Activity Recognition with OpenCV and Deep Learning

Walid

November 25, 2019 at 10:59 am

Well. I am speechless with such a great post
You make me hoping that every weekend will finish early so that I will learn from your articles on Monday.
Figure 4 is worth a thousands words
- Adrian Rosebrock
  
  November 25, 2019 at 2:00 pm
  
  Thanks Walid!
Dave Xanatos

November 25, 2019 at 11:01 am

As usual, this is fantastic! Thank you very much & I hope you have a happy Thanksgiving!

Dave
- Adrian Rosebrock
  
  November 25, 2019 at 1:59 pm
  
  Thanks Dave! Have a Happy Thanksgiving as well.
- Frederik
  
  November 25, 2019 at 6:15 pm
  
  How well does it perform on unknown labels, lets say activities that havnt been trained on?
  - Adrian Rosebrock
    
    November 27, 2019 at 11:22 am
    
    It can’t predict activities it was never trained on nor does the model have an “unknown/ignore” class (which I think is a bit unfortunate).
Walid

November 25, 2019 at 11:12 am

Hi Adrian
I am having the following error

cv2.error: OpenCV(4.0.0) C:\projects\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp:215: error: (-215:Assertion failed) attribute_proto.ints_size() == 2 in function ‘

Can you please help?
- Adrian Rosebrock
  
  November 25, 2019 at 1:59 pm
  
  Make sure you are using at least OpenCV version 4.12.
  - olof
    
    November 25, 2019 at 3:20 pm
    
    Hi Adrian,
    I’m using your gurus image but also got the same error as Walid.
    - Adrian Rosebrock
      
      November 27, 2019 at 11:21 am
      
      Hi Olof, use OpenCV 4.1.2 and it will work.
  - Matt
    
    November 25, 2019 at 9:49 pm
    
    Hello Adrian. I am getting the same error and using 4,.1
    cv2.error: OpenCV(4.1.0) C:\projects\opencv-python\opencv\modules\dnn\src\onnx\onnx_importer.cpp:245: error: (-215:Assertion failed) attribute_proto.ints_size() == 2 in function ‘cv::dnn::dnn4_v20190122::ONNXImporter::getLayerParams’
    - Matt
      
      November 25, 2019 at 10:21 pm
      
      Scratch that you have to have OpenCV 4.1.2 the newest version and it works fine
  - Rohit
    
    November 26, 2019 at 6:14 am
    
    Hi Adrian,
    
    I am using opencv 4.1.0. I am still facing the same error.
    
    Can you please help?
    - Adrian Rosebrock
      
      November 27, 2019 at 11:21 am
      
      You need at least OpenCV 4.1.2 to run this example.
  - Zaigham Abbas Randhawa
    
    November 26, 2019 at 7:39 am
    
    Hey Adrain, I was having the same issue.
    
    And I also have openCV 4.1.0 installed.
    
    Do you know of any other thing that we should be aware of?
    - Adrian Rosebrock
      
      November 27, 2019 at 11:24 am
      
      You need at least OpenCV 4.1.2.
Mkhuseli

November 25, 2019 at 12:08 pm

Hei Adrian, great blog. One question is how can i use my own dataset on this model or how do i prepare my own training dataset.
- Adrian Rosebrock
  
  November 25, 2019 at 1:58 pm
  
  I’ll be doing a separate tutorial on that in the future.
  - Philippe
    
    November 27, 2019 at 10:14 am
    
    Can’t wait for that one, as I also have my own set I would like to use. How can I tempt you to change ‘future’ in ‘near future’? 🙂
    - Adrian Rosebrock
      
      November 27, 2019 at 11:29 am
      
      Haha, thanks Philippe. It will certainly be part of the next edition of Deep Learning for Computer Vision with Python so that should make it “sooner” rather than “later” 😉
Nick

November 26, 2019 at 12:53 am

Thanks for the tutorial Adrian!

I would like to apply “activity recognition” to my own dataset. Will this be taught in the new edition of DL4CV coming out on the 28th?

Kind Regards,

Nick
- Adrian Rosebrock
  
  November 27, 2019 at 11:22 am
  
  Not in the 3rd edition of DL4CV but it will be taught in the 4th edition of DL4CV coming out in 2020. If you purchase a copy now you will receive a free update to the 4th edition when it is released.
Todorov

November 26, 2019 at 1:49 am

opencv-python 4.1.1.26 works
- Adrian Rosebrock
  
  November 27, 2019 at 11:21 am
  
  hanks for sharing, Todorov!
Kiran Prakash Kamble

November 26, 2019 at 7:00 am

Hello Adrian,

Can you please more elaboration on SAMPLE_SIZE..?
- Adrian Rosebrock
  
  November 27, 2019 at 11:26 am
  
  This model requires multiple input frames in a single to the network when making a prediction. The SAMPLE_SIZE controls the number of frames in that batch.
Abkul

November 26, 2019 at 9:00 am

Excellent work.

I would like to train model for doing the same on phone, Kindly cover the option of mobile phone based human activity recognition procedures.

Keep it up.
- Adrian Rosebrock
  
  November 27, 2019 at 11:26 am
  
  I’ll consider it for a future topic but cannot guarantee if/when that may be.
Wagner Rosa

November 26, 2019 at 9:47 am

Hi Adrian, finally I figured it out that the problem is the Open CV version, the resnet-34 only works for version 4.1.1 and above. I updated Open CV and now it works!

Thanks for the amazing job.
Cheers!
- Adrian Rosebrock
  
  November 27, 2019 at 11:26 am
  
  Thanks Wagner!
Zheng Li

November 26, 2019 at 10:30 am

Hi,Adrian:

How to train the model with my own dataset? In general, how many video clips should be provided for every class?
- Adrian Rosebrock
  
  November 27, 2019 at 11:27 am
  
  I’ll be covering that in a future blog post/tutorial.
Mats Önnerby

November 26, 2019 at 5:01 pm

Hi Adrian
Great post as always. I’m have built a human sized InMoov robot. I have noticed that people very often wave their hands to get the robots attention. Is there any neural network that has been trained to recognize “waving hands”. It would be awesome to be able to make the robot turn his attention to that person. I’m using MyrobotLab to control the robot and opencv to do face recognition already.
Thanks in advance
/Mats
- Adrian Rosebrock
  
  November 27, 2019 at 11:27 am
  
  That’s an interesting insight that people wave their hands to get the robots attention. I don’t know of a “waving hands” dataset or existing model though.
Clark

November 26, 2019 at 5:36 pm

This is fantastic! My colleagues are working on similar project to detect passenger falling down on escalators. There are challenges on both processing speed and model reliability and they have not good idea how to target setting the precision. I think this post is good to understand state-of-the-art method.
One point, if we have multiple bodies in the video, do you have any tutorial on pre-processing before feeding to the model. That is one question when i read your books.
Thanks, Adrian,

BR

Clark
- Adrian Rosebrock
  
  November 27, 2019 at 11:28 am
  
  I would apply object detection to find all people in the input frame and then apply activity recognition to each person.
HJYOO

November 26, 2019 at 10:07 pm

Thank you, Adrian.
I showed and explained this code to my students.
I am sure that it’s helpful for them.
- Adrian Rosebrock
  
  November 27, 2019 at 11:28 am
  
  Thanks so much!
Yaser Sakkaf

November 27, 2019 at 2:46 am

Hi Adrian,
I was hoping to work on this use case.

Verifying that a food service worker has washed their hands after visiting the restroom or handling food that could cause cross-contamination (i.e,. chicken and salmonella).

I figured out that I will have to combine the face identification and this blog’s model(to see if some worker out of multiple ones have washed their hands or not) to get the resuling output.

Can you hand out some more tips.
- Adrian Rosebrock
  
  November 27, 2019 at 11:29 am
  
  Hey Yaser, you basically have the general idea of the project. For each input frame:
  
  1. Run face recognition
  2. Run activity recognition
  
  You’ll then be able to know who was performing what activity.
  - Yaser Sakkaf
    
    November 28, 2019 at 2:33 am
    
    Thanks for the advice.
Pranav

November 29, 2019 at 5:59 am

Cheers Adrian,
Thanks for the wonderful tutorial.

however, I am facing an error as mentioned below:

the following arguments are required: -m/–model, -c/–classes
An exception has occurred, use %tb to see the full traceback.

I did also check the link you mentioned:
https://pyimagesearch.com/2018/03/12/python-argparse-command-line-arguments/

However, i am not able to move forward in the above tutorial without tackling the error. I just cant get passed through, like compile the other half after

args = vars(ap.parse_args())

please do let me know how do I go and what code has to be used to move on

Regards and best wished
- Adrian Rosebrock
  
  December 5, 2019 at 10:09 am
  
  Hey Pranav — what have you tried thus far? Are you trying to execute the code via command line?
  - Agb
    
    December 24, 2019 at 8:48 am
    
    Hey Adrian…i am also getting same error..This is Very Important for us..We have downloaded the zip file, opencv4.1.2, python3.6 properly, and We are running the code in python idle
    Where we have to run the program actually ? and what is later process.Please kindly tell step wise on how to proceed.
    - Adrian Rosebrock
      
      December 26, 2019 at 9:44 am
      
      Make sure you read my tutorial on how to use command line arguments:
      
      https://pyimagesearch.com/2018/03/12/python-argparse-command-line-arguments/
David

November 30, 2019 at 2:14 pm

Amazing Adrian, I have no words to thank you enough all the work you are sharing.

Listen, I’ve been reading the Guru course and the different Bundles. I’m quite interesting in human activity recognition. What product of yours do you recommend me?

All the best,

Dave
- Adrian Rosebrock
  
  December 5, 2019 at 10:09 am
  
  Hey Dave — I would recommend the Deep Learning for Computer Vision with Python. The next edition of that book will cover how to train human activity recognition models from scratch. If you purchase now you’ll get the next update for free.
aashu

December 1, 2019 at 4:14 am

hii mate..
great blog…but when i am executing this stuff the video running is very slow and lagging..what is the issue??
- Adrian Rosebrock
  
  December 5, 2019 at 10:07 am
  
  Which method are you using to run the script? And what are the specs of your machine?
  - Max
    
    December 5, 2019 at 11:36 pm
    
    Hi Adrian, great article! I’m experiencing the same problem. Running the script from the Pycharm IDE. I’m using a laptop with i7 quad core, 16GB RAM, 64bit. Closed all other programs too.
    
    What would be the recommended specs?
    
    Cheers!
  - aashu
    
    February 18, 2020 at 5:25 am
    
    I am running using command prompt and my specs are intel i3 with 8gb ram…facing the same issue for a long time now
Zayne

December 2, 2019 at 4:57 am

Hi Adrian,
What should I do if my training dataset is extremely imbalance，for example,500,000 samples for the label(named others), 10,000 to 20,000 samples for each of the remaining categories. I know the data augmentation may be the first choice, but how can we improve it from the algorithm level(loss function maybe).
- Adrian Rosebrock
  
  December 5, 2019 at 10:08 am
  
  Take a look at Deep Learning for Computer Vision with Python which will show you how to handle data imbalance.
  - Zayne
    
    December 6, 2019 at 1:26 am
    
    I have read all of it, but I still have no clue.
Engr Don

December 2, 2019 at 9:28 pm

How can we train from scratch?
- Adrian Rosebrock
  
  December 5, 2019 at 10:08 am
  
  I’ll be covering how to train the model from scratch in a separate tutorial.
  - Hassan
    
    December 5, 2019 at 11:04 pm
    
    This will be extremely interesting tutorial on how to train the model with own data. In addition, transfer learning might be a very useful for an advanced tutorial
Ranga priyan V

December 3, 2019 at 1:01 am

Hi Adrian,
Can it be used in real time ?? Is there a way to train a dataset with a single particular activity ??

Thank you.
- Adrian Rosebrock
  
  December 5, 2019 at 10:07 am
  
  1. Yes, it can run in real-time but you will need a GPU.
  
  2. I’ll be doing a separate blog post on training on specific activities.
  - Ranga priyan V
    
    December 9, 2019 at 12:28 am
    
    hey thanks for the reply,
    my machine has nvidia gtx1650 gpu (4gb) can i run the same code for real-time by making a few changes ?
    - Adrian Rosebrock
      
      December 12, 2019 at 9:52 am
      
      No, OpenCV’s “dnn” module does not yet support NVIDIA GPUs.
      - Tham
        
        December 15, 2019 at 1:33 am
        
        Hi, Adrian, thanks for this interesting post.
        If you check the github issue list, you will find out dnn module of opencv begin to support cuda and cudnn, although it is still on early stage, but may worth to take a look(issue 14827)
      - Adrian Rosebrock
        
        December 18, 2019 at 9:31 am
        
        Thanks Tham.
Mohd Aman

December 5, 2019 at 3:24 pm

Hi Adrian,

Thanks for excellent blog on human activity recognition.
Human activity recognition is my master’s project. Will you please give me some idea about this project, how to train own model from scratch on other data set. Can be used LSTM on top of this network ??. Because i read some papers in which author used 3D CNN + LSTM for spatio-temporal features.
- Adrian Rosebrock
  
  December 12, 2019 at 9:54 am
  
  Please refer to the previous comments. I’ll be doing a separate tutorial on training human activity recognition models.
Walid

December 5, 2019 at 5:57 pm

Hi Adrian

The extension of the model file is .onnx and not .pth.
is the model a pytorch model or open ecosystem for interchangeable AI models?

Thanks a lot
- Adrian Rosebrock
  
  December 12, 2019 at 9:53 am
  
  The model has been converted to ONNX format from a PyTorch model.
Rachita

December 6, 2019 at 1:32 am

Hey Adrian,

Amazing post. I was wondering how you downloaded kinetics dataset. I’ve been having problems doing that.
- Walid
  
  December 30, 2019 at 1:44 pm
  
  Me too. I tried more than one openbsource repo but did not work well
Tom

December 9, 2019 at 7:13 am

Hi Adrian, another great post.

I tried this on a random video of me cooking some sausages, and it did a pretty good job, however on occasional sections, it decided I was changing a wheel instead. I can understand why it got confused, a big dark circle (the pan/wheel) with a metal object working around the center (cooking tongs/tire iron). The use case I’m looking at doesn’t require real-time predictions, so I was thinking on if there was a good approach to “smoothing” the predictions to give a more accurate overall classification?

I’ve had 2 thoughts on possible ways to address this, one is to construct a domain specific transition matrix for each of the model states, and for example, if I determine that there is a very low probability that subsequent frames will be “cooking sausages” and “changing a wheel” then I create a random between 0 and 1 and only accept the models change in state if the random is below the relevant threshold. I quite like this, but with over 400 action labels, that’s quite a big matrix that would need to be defined for each possible domain. Although, for certain domains, large portions of the matrix will be irrelevant.

My second idea was to do a simple look forward/backward approach, if prediction X is “changing a wheel” but predictions x-1 and x+1 are “cooking sausages” then I might choose to modify X to be the same as those either side of it. Obviously the window size either side of the prediction could be varied.

I’m planning on working through each of the above when I get a chance, and maybe even some hybrid of the 2, but wanted to ask if you had any tricks up your sleeve, or thoughts on the above?

Thanks
- Aaron
  
  December 17, 2019 at 5:16 pm
  
  Hi have you made any progress on this issue? I have asked Adrian a similar question but would love to hear your thoughts. I have 2 other ideas:
  
  1. Use moving average with irregular time intervals
  2. training a secondary classifier to take the inputs of the sliding windows and output
  
  although I am not 100% certain how to structure idea 2, would love to hear your thoughts.
  - Tom
    
    December 20, 2019 at 10:31 am
    
    Hi Aaron, I’ve done some basic testing of these, and managed in the majority of cases for the one short video to eliminate spurious classifications, however, this video only has one activity (the cooking of the sausages) and I’ve not experimented with other examples yet, so not sure whether I’ve just made it harder for the overall prediction to change or whether it would really work in practice.
    
    I’ve just used excel to apply these tests for now, much as I like python, it would take me 5 times as long to work out the syntax!
    
    I’m hoping to do a bit more testing in the new year, but am also eagerly anticipating Adrian’s upcoming blog about training your own custom version, as, depending on how much training data is required, that might actually be an easier fix for my specific use case
praduman

December 12, 2019 at 1:08 am

Hey, great blog!!
I wanted to know how to enable GPU.
- Adrian Rosebrock
  
  December 12, 2019 at 9:51 am
  
  Unfortunately you cannot (yet). OpenCV’s “dnn” module does not yet support many GPUs for deep learning inference.
Mani

December 13, 2019 at 1:00 am

Hi sir, can I get the code of pretrained model of resent-34?
can you please mail me?
- Adrian Rosebrock
  
  December 18, 2019 at 9:30 am
  
  You can use the “Downloads” section of this tutorial to download the source code and pre-trained model.
Odianosen Ejale

December 15, 2019 at 12:52 pm

Great post as always boss.

I would really have loved to be able to detect if a person is sleeping. How can this be expanded please?
- Adrian Rosebrock
  
  December 18, 2019 at 9:31 am
  
  Refer to the comments. I’ll be doing a separate tutorial on training your own custom activity recognition model.
Aaron

December 17, 2019 at 3:09 pm

Hi adrian, big fan of your site. One question about activity recognition? how can I aggregate the instantaneous rolling frame predictions? for instance:

f1…f5 -> pizza
f2…f6 -> burger
f3…f7 -> pizza
f4…f8 -> pizza

How can I infer that f1 ..f8 is pizza? and ignore the burger prediction especially if the window size is varying.

p.s. a simple moving average might not do the trick here afaict.
- Adrian Rosebrock
  
  December 18, 2019 at 9:32 am
  
  Why would the window size by varying? Typically that’s a fixed parameter. A moving average sounds like what you need here.
mahsa

December 21, 2019 at 7:24 am

Hi Adrian, I appreciate you because of all of your effective tutorials. Hope you the best
- Adrian Rosebrock
  
  December 26, 2019 at 9:44 am
  
  Thanks Mahsa!
ariawan

January 5, 2020 at 9:03 pm

very good blog, I want to ask a few things. first if i run the file human_activity_reco and human_activity_reco_deque then the video looks intermittent. is there a way to make it look smooth like making a video example without running python. thank you
Sarthak Jain

January 9, 2020 at 4:26 pm

Hi Sir!
Firstly, I want to thank you for this great tutorial. I have one question. We have to build a system, for the mid-day meal at educational sites, which can recognize how many students have taken the food from the food counters.

Activity will be like —> A person is carrying a plate and taking food from the counter

So I want to know whether we can do this with a Pre-trained human activity recognition model. If yes, then how to approach?

I would love to hear some tips from you. 🙂
Fahim Shahriar

January 14, 2020 at 9:55 am

Where is the dataset? I need the dataset…. Will you give me the link of dataset?
- Adrian Rosebrock
  
  January 16, 2020 at 10:17 am
  
  It’s a pre-trained model so there is no dataset provided.
Jason

January 15, 2020 at 12:23 am

It’s was great. But as I skim through your post. Is that the model is already been trained before? And now we apply that model on this article?

Could you please show me how did you do the training process?

Many thanks. Hope to see more post from you in the future
- Adrian Rosebrock
  
  January 16, 2020 at 10:17 am
  
  Correct, the model has already been trained (i.e., it’s a pre-trained model). I’ll be covering how to train a custom human activity recognition model in a future tutorial.
  - Mohd Aman
    
    February 3, 2020 at 1:06 am
    
    Still, I am waiting for your next tutorial of how to train custom human activity, recognition model.
    - Adrian Rosebrock
      
      February 5, 2020 at 1:39 pm
      
      Mohd — it’s my pleasure to provide these tutorials for free to you, do not take advantage of that genoristy. I have a lot of things to do and a lot of content to cover. As I said, I will be covering it in the future but I cannot guarantee exactly when that will be.
SUMEET SAURAV

January 24, 2020 at 5:54 pm

Hii Adrian,

Thankx for making the demo available. I just wanted to know whether the script used for converting the PyTorch model in the .onx version is available? Could you please help me out.

Thanking You
Rachita

January 27, 2020 at 10:53 pm

Hey Adrian !

Great post. However, I’m facing an issue while running it on google colab.
Following is the error: cannot connect to X server

How do i recommend i fix this?

Thanks in advance
Solomon

February 3, 2020 at 3:47 am

Amazing tutorial. I really enjoyed your blogs. Could you please make such tutorials on signal processing, like EEG or ECG signals? I would love to see that. Thanks again
- Adrian Rosebrock
  
  February 5, 2020 at 1:38 pm
  
  Thanks for the suggestion. I cannot guarantee if/when I will cover it but I’ll certainly consider it.
Abishek

February 11, 2020 at 8:00 am

Hey Adrian, Good work…

Can you tell me what are the algorithms you used in training the model…
aashu

February 18, 2020 at 5:34 am

hiii adrian… the processing of this video is still very slow.how can i make it smooth?? the video lags alot
Husain Madraswala

February 21, 2020 at 1:14 am

Hello Adrian I want to do the human activity recognition on still images. Can you help me out with it?
I have been searching about it since a long time i have found models only for videos.
It will be of a great help.
Thanks
chris

February 22, 2020 at 8:58 am

hello,

thanks for your content.
When I launched the code on my computer the output of the video stream is so slow in my computer.
I would like to know where the problem can come from?

Thank you for your help
- Adrian Rosebrock
  
  February 27, 2020 at 8:57 am
  
  Have you tried compiling OpenCV with GPU support? Using your GPU will allow the model to run faster.
Khan

February 28, 2020 at 5:32 am

Well. I am speechless with such a great post. I want to train or want to do transfer learning to my own custom data set using this technique. I have 5 classes in my data set, every classes have small clips ,can you guide me what steps I need do to train my own model .Please guide thanks
- Adrian Rosebrock
  
  March 4, 2020 at 1:18 pm
  
  I don’t have any tutorials on tutorials on training a human activity recognition from scratch. I will likely cover it in a future tutorial.
Shrey

March 3, 2020 at 5:18 am

Can you also share the code to convert the .pth file to .onnx?
- Adrian Rosebrock
  
  March 4, 2020 at 1:17 pm
  
  Sorry, I do not have such code.
Pranathi Rayavaram

March 7, 2020 at 11:26 am

Hi
Does this model works to detect multiple human activities ?
If so, How ?

Thank You
- Adrian Rosebrock
  
  March 11, 2020 at 4:30 pm
  
  No, this model does not work with multiple human activities.
  
  For multiple human activities I would recommend applying a person detector first, extracting the ROI (likely padded), and then apply the human activity recognition model.
Akhil G Krishnan

March 20, 2020 at 1:31 pm

This is more helpful to me. Actually i currently working on a project which is used to identify the censorable contents in movies ( Smoking , alcohol drinking, riding without helmet, driving wihtout seatbelt and it automatically label statutory warnings on corresponding scenes. I used yoloV3 for detection , but i am stuck with detecting “Whether a person is wearing seatbelt or not” and alcohol drinking,
Could you please help me..?
Bushra Shahid

March 24, 2020 at 2:58 pm

Hey Adrian, You Did Wonderful Work…
Can you please tell me how can you produce the sample video for this blog??? You said it’s from You Tube But You Tube have noisy video but your sample video are out of noise??? So How can I make this type of sample video???

& I want to run object detection algorithm for this so i am little bit confused how to detect only Human
from video frames and then re-assemble them into another video by cropped the human detected section only in which Human do some activity , can you please suggest me some code!…
Joel

April 18, 2020 at 4:08 am

Hi, what exactly do you mean by spatial dimension of the frame? and when you set it to 112, what does it mean?..As per knowledge spatial refers to height,width etc. I am not able to comprehend the relation here.
Thank you

Comment section

Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.

At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.

Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.

If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.

Click here to browse my full catalog.

Looking for the source code to this post?

Human Activity Recognition with OpenCV and Deep Learning

The Kinetics Dataset

3D ResNet for Human Activity Recognition

Downloading the Human Activity Recognition Model for OpenCV

Project structure

Implementing Human Activity Recognition with OpenCV

An Alternate Human Activity Implementation Using a Deque Data Structure

Human Activity Recognition Results

Credits

What's next? We recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

108 responses to: Human Activity Recognition with OpenCV and Deep Learning

Comment section

PyImageSearch University

Comparison Between BagofWords and Word2Vec

Convolutions with OpenCV and Python

Histogram of Oriented Gradients and Object Detection

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Human Activity Recognition with OpenCV and Deep Learning

The Kinetics Dataset

3D ResNet for Human Activity Recognition

Downloading the Human Activity Recognition Model for OpenCV

Project structure

Implementing Human Activity Recognition with OpenCV

An Alternate Human Activity Implementation Using a Deque Data Structure

Human Activity Recognition Results

Credits

What's next? We recommend PyImageSearch University.

Summary

Download the Source Code and FREE 17-page Resource Guide

About the Author

Reader Interactions

An interview with Paul Lee – Doctor, Cardiologist and Deep Learning Researcher

OpenCV Vehicle Detection, Tracking, and Speed Estimation

108 responses to: Human Activity Recognition with OpenCV and Deep Learning

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?