Video Understanding and Grounding with Qwen 2.5

Building on the foundation laid by our previous explorations, this blog post dives deeper into the powerful capabilities of Qwen 2.5 Vision-Language models, shifting focus to video understanding and grounding. In our 1st blog post, we unraveled the model’s prowess in content moderation tasks through zero-shot learning, maintaining digital spaces safe and inclusive.

The 2nd blog post expanded the horizon by highlighting the model’s capabilities in object detection, visual grounding, counting, and interpreting relationships within images — pioneering advancements in spatial understanding.

This 3rd lesson in the series pushes boundaries by venturing into dynamic realms — analyzing spatial and temporal elements of videos. Here, we uncover how Qwen 2.5’s multimodal understanding extends beyond static frames to provide deep insights into video content.

Whether you’re seeking to extract textual cues from moving scenes, identify actions, or decode object interactions across frames, this blog post equips you with practical knowledge and steps for leveraging Qwen 2.5 in video analysis tasks. Prepare to explore how cutting-edge AI transforms video understanding, making it accessible, efficient, and impactful.

This lesson is the last of a 3-part series on Qwen 2.5 Unleashed: Transforming Vision Tasks with AI:

Content Moderation via Zero Shot Learning with Qwen 2.5
Object Detection and Visual Grounding with Qwen 2.5 Models
Video Understanding and Grounding with Qwen 2.5 (this tutorial)

To learn how Qwen 2.5 can be used for video understanding and grounding, just keep reading.

Looking for the source code to this post?

Enhanced Video Comprehension Ability in Qwen 2.5 Models

The evolution of Qwen2.5-VL brings a significant leap in video understanding by addressing temporal dynamics with precision and efficiency. Recognizing the complexities of video data, which spans both spatial and temporal dimensions, Qwen2.5 introduces cutting-edge advancements to enhance its comprehension abilities (Figure 1). Here’s how:

**Figure 1:** Video Comprehension with the Qwen 2.5 series (source: Qwen Team, 2025).

Dynamic Frame Rate (FPS) and Absolute Time Encoding

To effectively analyze video content, Qwen2.5-VL incorporates dynamic frame rate training and absolute time encoding. By accommodating varying frame rates during training, the model achieves a nuanced understanding of temporal dynamics, seamlessly adapting to different speeds of video content (Figure 2).

**Figure 2:** Dynamic Frame Rate Sampling with Qwen 2.5 series (source: Qwen Team, 2025).

Unlike traditional methods that rely on textual timestamps or add computational layers to enable temporal grounding, Qwen2.5 employs an innovative approach. It aligns Multimodal Rotary Position Embedding (MRoPE) IDs with timestamps, allowing the model to grasp the tempo of time directly through intervals between temporal dimension IDs. This novel strategy eliminates the need for additional overhead while maintaining high temporal accuracy.

Multimodal Rotary Position Embedding (MRoPE)

Building on the original MRoPE introduced in Qwen2-VL, Qwen2.5 extends its capabilities to handle temporal information in videos better. The MRoPE framework decomposes position embeddings into three dimensions: temporal, height, and width (Figure 3).

**Figure 3:** Multimodal Rotary Position Embedding for Temporal Precision (source: Qwen Team and Alibaba Group, 2024).

This segmentation allows for comprehensive modeling of multimodal inputs. For textual data, MRoPE functions akin to traditional 1D RoPE, maintaining uniformity across all components. For visual data, temporal IDs increment with each video frame while height and width IDs remain constant within spatial positions.

One key enhancement in Qwen2.5-VL is the alignment of temporal position IDs with absolute time. By leveraging intervals between temporal IDs, the model achieves consistent temporal alignment across videos, irrespective of frame rate variations. This improvement allows Qwen2.5 to capture not just sequential data but also the actual timing of events, offering a deeper understanding of video content.

Robustness Through Training Innovations

To further enhance its video comprehension abilities, Qwen2.5-VL employs innovative training techniques. Dynamic Frames Per Second (FPS) sampling during training ensures a well-distributed representation of frame rates within the dataset.

For lengthy videos, surpassing half an hour in duration, the model synthesizes multi-frame captions using a targeted pipeline, enabling a cohesive understanding of extended content. Additionally, Qwen2.5 handles video grounding data across various timestamp formats, including second-based and hour-minute-second-frame (HMSF), ensuring precise comprehension and output versatility (Table 1).

**Table 1:** Training data volume and composition across different stages (source: Qwen Team and Alibaba Group, 2025).

Hands-On Qwen2.5 for Video Understanding Tasks

In this section, we will see how we can prompt Qwen 2.5 to perform video understanding tasks (e.g., extracting text from video frames, video summarization, video grounding, and video captioning). We will start by installing the necessary libraries.

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
pip install kagglehub
pip install decord

Setting Up the Qwen 2.5 VL Model and Inference Function

We start by loading the Qwen 2.5 VL model in memory and implementing the inference function that will be used to call the model across all types of video understanding tasks.

Loading the Model and Processor

For this hands-on, we will use "Qwen/Qwen2.5-VL-3B-Instruct", which is a 3B parameter model of the Qwen2.5 vision-language model series.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

We first import necessary classes and functions from the transformers library and a utility function from qwen_vl_utils (Lines 1 and 2). We then load the pre-trained Qwen2_5_VLForConditionalGeneration model from the "Qwen/Qwen2.5-VL-3B-Instruct" repository using the from_pretrained method (Lines 5-7).

The torch_dtype="auto" argument ensures the appropriate data type is used for the model, and device_map="auto" ensures that the model is loaded on available devices (e.g., GPUs) to optimize performance (Line 6). We then create a processor object using the AutoProcessor class from the same repository (Line 9).

This processor will handle the pre-processing of vision-related information (e.g., resizing images to fit within the specified pixel range). The combination of the model and processor allows us to generate conditional outputs based on the input images efficiently.

Video Utility Functions

Next, we will implement some utility functions that will help us sample and visualize the video frames from a given video and also visualize them.

from IPython.display import Markdown, display
import numpy as np
from PIL import Image
import decord
from decord import VideoReader, cpu

def get_video_frames(video_file_path, num_frames=128):
    vr = VideoReader(video_file_path, ctx=cpu(0))
    total_frames = len(vr)

    indices = np.linspace(0, total_frames - 1, num=num_frames, dtype=int)
    frames = vr.get_batch(indices).asnumpy()
    timestamps = np.array([vr.get_frame_timestamp(idx) for idx in indices])

    return video_file_path, frames, timestamps

The get_video_frames function (Line 16) extracts evenly spaced frames and timestamps from a video file. Using VideoReader (Line 17), we load the video and calculate total frames (Line 18). Then, we use linspace to select frame indices (Line 20), retrieve frames (Line 21), and extract their timestamps (Line 22). Finally, the function returns the file path, frames, and timestamps (Line 24).

def create_image_grid(images, num_columns=8):
    pil_images = [Image.fromarray(image) for image in images]
    num_rows = (len(images) + num_columns - 1) // num_columns

    img_width, img_height = pil_images[0].size
    grid_width = num_columns * img_width
    grid_height = num_rows * img_height
    grid_image = Image.new('RGB', (grid_width, grid_height))

    for idx, image in enumerate(pil_images):
        row_idx = idx // num_columns
        col_idx = idx % num_columns
        position = (col_idx * img_width, row_idx * img_height)
        grid_image.paste(image, position)

    return grid_image

The create_image_grid function arranges images into a grid and returns the final combined image. We first convert each input image into a PIL image format (Line 26) and calculate the number of rows needed for the grid based on the number of columns (Line 27). The dimensions of the grid are then determined by multiplying the image width and height by the grid’s column and row counts, respectively (Lines 29-31). A blank canvas is created to hold the final grid (Line 32).

Next, we loop through the images (Line 34) to calculate each image’s row and column position (Lines 35 and 36) and paste them onto the canvas at the appropriate location (Lines 37 and 38). The function finally returns the assembled grid of images (Line 40).

Implementing the Inference Function

Now that our model is loaded, we will implement an inference() function that takes in a prompt and video path, calls the model, and returns the output. This function can then be used across different video understanding tasks, as we will see subsequently.

def inference(model, processor, video_path, prompt, max_new_tokens=1024, total_pixels=20480 * 28 * 28, min_pixels=16 * 28 * 28):
  messages = [
      {
          "role": "system",
          "content": "You are a helpful assistant."
      },

      {
          "role": "user",
          "content": [
              {"type": "text", "text": prompt},
              {"video": video_path, "total_pixels": total_pixels, "min_pixels": min_pixels}
          ]
      }
  ]

  # Preparation for inference
  text = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
  )
  image_inputs, video_inputs, video_kwargs = process_vision_info([messages], return_video_kwargs=True)
  fps_inputs = video_kwargs['fps']
  print("video input:", video_inputs[0].shape)
  num_frames, _, resized_height, resized_width = video_inputs[0].shape
  print("num of video tokens:", int(num_frames / 2 * resized_height / 28 * resized_width / 28))

  inputs = processor(text=[text], images=image_inputs, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt")
  inputs = inputs.to('cuda')

  output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
  generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
  output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
  return output_text[0]

First, we define messages (Lines 42-55) containing system and user instructions, including the prompt and video metadata (e.g., total_pixels and min_pixels). These values represent a token range for video frame processing that can be adjusted to balance performance and computational cost.

The text input is prepared using the processor.apply_chat_template (Lines 58-60), while process_vision_info extracts image and video inputs along with their frame rate information (Lines 61 and 62). Metadata (e.g., video dimensions and token counts) are printed (Lines 63-65) for debugging purposes.

Next, we process the inputs by combining text, images, and video data into tensors (Lines 67 and 68) and transferring them to the GPU (cuda) for efficient inference. The model generates outputs (Line 70), where new tokens beyond the initial input length are extracted (Line 71).

These generated tokens are decoded into human-readable text using processor.batch_decode (Line 72), and the final result is returned (Line 73) as the function’s output. This setup enables advanced multimodal processing for vision-language tasks.

Prompting the Qwen 2.5 VL Model

Finally, it’s time to call our model on a sample video and prompt using the inference() function and see how it performs for different video understanding tasks. Before that, let’s first download a sample video from YouTube on a breakfast recipe: One Potato & One Egg! Quick Recipe Perfect For Breakfast.

To download the video locally, simply paste the video URL into any YouTube video downloader website (e.g., Yt1z: Free YouTube Video Downloader).

Once the video is downloaded, rename it to recipe.mp4 and upload it to your Colab Notebook. The following code snippet can then be used to visualize the frames (Figure 4) of the downloaded video.

video_file_path, frames, timestamps = get_video_frames("./recipe.mp4", num_frames=64)

print(frames.shape, timestamps.shape)
image_grid = create_image_grid(frames, num_columns=8)
display(image_grid)

**Figure 4:** Video Sample (source: image by the author).

Extracting Text from Video Frames

The first video use case we will explore with the Qwen 2.5 models is “Extracting text from video frames”. The code snippet below shows how we can prompt the model to recognize the text in the video frames at different points in time.

For our breakfast recipe video, we want to extract the recipe steps in the order they appear in the video.

prompt = "Watch the video and list the OCR text in the video frames."
response = inference(model, processor, video_file_path, prompt)
display(Markdown(response))

The following is the output of the above snippet:

Output:

As we can see, the model is able to extract the text in the video with very high precision, making it ideal for smart OCR applications such as license plate recognition (in traffic surveillance videos), etc.

Video Understanding and Comprehension

Secondly, we will prompt the model to provide us with a comprehensive summary of the video, listing all the ingredients and recipe instructions in the correct order.

prompt = "Watch the video till end and provide the detailed recipe instructions with exact measurement of ingredients."
response = inference(model, processor, video_file_path, prompt)
display(Markdown(response))

The following is the output of the above snippet:

Output:

This demonstrates how the model can effectively process and analyze extended video content, extracting meaningful insights.

Video Grounding

In our third use case, we will see how we can prompt the model to answer specific questions about the video using natural language queries. This is commonly referred to as video grounding (similar to object grounding).

The following code snippet prompts the model to call out the time when a certain action (e.g., “adding 2 tbsp of all purpose flour”, “Brush inside the frying pan with oil”) occurs in the video.

prompt = "Give the query: 'Adding 2 tbsp of all purpose flour', when does the described content occur exactly (in seconds) in the video?"
response = inference(model, processor, video_file_path, prompt)
display(Markdown(response))

The following is the output of the above snippet:

Output:

The query "Adding 2 tbsp of all purpose flour" occurs at approximately 70.4 seconds into the video.

This showcases the model’s ability to understand exact timestamps and search the detailed queries in the video.

Note that the accuracy of understanding the exact timestamps improves with bigger versions of the Qwen 2.5 models (e.g., 7B, 13B, 72B, etc.). Smaller models such as "Qwen/Qwen2.5-VL-3B-Instruct" (used in this scenario) can sometimes fail to answer such questions accurately.

Structured Video Captioning

The last use case that we will explore in this blog post is video captioning. Video captioning refers to localizing a series of events in the video along with a caption or description of that event. This way, we can segment the whole video into multiple parts — where each part is about a specific event.

prompt = "Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. Provide the result in json format with 'seconds' format for time depiction."
response = inference(model, processor, video_file_path, prompt)
display(Markdown(response))

The following is the output of the above snippet:

Output:

[
    {"start_time": "0.0", "end_time": "14.2", "description": "Peel 1 potato."},
    {"start_time": "15.3", "end_time": "46.8", "description": "Slice thin and shred the potatoes."},
    {"start_time": "47.9", "end_time": "123.0", "description": "Put the shredded potatoes in a bowl, add an egg, flour, salt, and water, then mix well."},
    {"start_time": "124.1", "end_time": "189.0", "description": "Cook the mixture in a pan with oil, cover it, and cook for 3 minutes under medium heat. Open and turn it over, then cover and cook for 2 minutes more under medium heat."},
    {"start_time": "190.1", "end_time": "235.0", "description": "Serve the dish and cut into pieces. Enjoy the delicious potato dish."}
]

In the output, we can see that the model identifies significant events within the video, providing start and end timestamps for each event along with descriptive sentences.

With the help of a small Python script, we can also parse the above JSON and visualize these video segments.

import json
import markdown
from bs4 import BeautifulSoup
from datetime import datetime

def parse_json(response):
    html = markdown.markdown(response, extensions=['fenced_code'])
    soup = BeautifulSoup(html, 'html.parser')
    json_text = soup.find('code').text

    data = json.loads(json_text)
    return data

data = parse_json(response)

for item in data:
    start_time = item["start_time"]
    end_time = item["end_time"]
    description = item["description"]

    display(Markdown(f"**{start_time} - {end_time}:**\t\t" + description))

    start_time = float(start_time)
    end_time = float(end_time)
    current_frames = []
    for frame, timestamp in zip(frames, timestamps):
        if timestamp[0] > start_time and timestamp[1] < end_time:
            current_frames.append(frame)
   
    current_frames = np.array(current_frames)
    current_image_grid = create_image_grid(current_frames, num_columns=8)

    display(current_image_grid.resize((480, (int(len(current_frames) / 8) + 1) * 60)))

This code extracts JSON data, displays segment descriptions with timestamps, and visualizes video frames for each segment. The parse_json function converts markdown JSON input into structured data (Lines 96-102).

For each segment, it retrieves timestamps and descriptions (Lines 107-109), displays them with Markdown (Line 111), filters frames within the time range (Lines 113-120), and creates a grid of frames using create_image_grid (Line 121) for easy visualization (Line 123).

Figure 5 shows the output of the above snippet:

**Figure 5:** Structured Video Captioning (source: image by the author).

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

This blog post highlights the advanced capabilities of the Qwen 2.5 VL models for video understanding and grounding tasks. The model employs innovative techniques like dynamic frame rate adjustments, absolute time encoding, and Multimodal Rotary Position Embedding (MRoPE) to enhance video comprehension and align multimodal data effectively. These advancements, combined with robust training methodologies, ensure exceptional performance in processing complex video scenarios.

We provide a step-by-step guide to leveraging the Qwen 2.5 VL models for video understanding. This includes setting up the model, implementing utility functions, and prompting the inference function to extract text from frames and perform tasks like video comprehension and grounding. These practical instructions empower users to harness Qwen 2.5’s capabilities confidently.

Finally, we explore use cases such as video grounding and structured video captioning. The model excels in linking visual elements with textual inputs and generating coherent, context-rich narratives. These applications demonstrate the model’s versatility, transforming video analysis into an intelligent and efficient process for diverse scenarios.

Citation Information

Mangla, P. “Video Understanding and Grounding with Qwen 2.5,” PyImageSearch, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2025, https://pyimg.co/uhtbd

@incollection{Mangla_2025_video-understanding-and-grounding-with-qwen-2-5,
  author = {Puneet Mangla},
  title = {{Video Understanding and Grounding with Qwen 2.5}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2025},
  url = {https://pyimg.co/uhtbd},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!