Object Detection and Visual Grounding with Qwen 2.5

The rapid evolution of Vision-Language Models like Qwen 2.5 is reshaping the boundaries of artificial intelligence (AI), offering groundbreaking capabilities across diverse applications. In the previous blog post of this series, we explored the transformative role of Qwen 2.5 in content moderation through zero-shot learning! This blog post shifts focus to another captivating domain where the Qwen 2.5 model series excels (i.e., spatial understanding).

We’ll uncover how Qwen 2.5’s advanced multimodal capabilities tackle spatial understanding tasks with unprecedented accuracy, from identifying objects without prior examples (a.k.a. zero shot object detection) to precisely mapping visual elements to textual references (a.k.a. visual grounding) and decoding intricate interrelationships in images.

Whether it’s enhancing autonomous systems, advancing e-commerce, or providing deeper contextual insights, these innovations push the boundaries of visual grounding and object detection. So, without any delay, let’s dive into the transformative impact of Qwen 2.5 in these fascinating domains.

This lesson is the 2nd in a 3-part series on Qwen 2.5 Unleashed: Transforming Vision Tasks with AI:

Content Moderation via Zero Shot Learning with Qwen 2.5
Object Detection and Visual Grounding with Qwen 2.5 (this tutorial)
Video Understanding and Grounding with Qwen 2.5

To learn how Qwen 2.5 can be used for object detection and visual grounding tasks, just keep reading.

Looking for the source code to this post?

Introduction and Types of Spatial Understanding

Spatial understanding is a critical aspect of artificial intelligence, enabling models to perceive and interpret visual data in a way that mirrors human cognition. Qwen 2.5 excels in this domain, demonstrating remarkable capabilities in addressing complex spatial reasoning tasks. In this section, we will dive deeper into three key areas of spatial understanding: Object Detection, Visual Grounding, and Understanding Relationships.

Object Detection

Object detection involves identifying and classifying objects within an image, a fundamental task for various AI applications. Qwen 2.5’s zero-shot object detection capability sets it apart, allowing it to recognize objects it hasn’t been explicitly trained on.

Leveraging a robust training dataset and its vision-language understanding, Qwen 2.5 can associate textual labels with visual elements seamlessly (Figure 1). This approach is particularly valuable in scenarios requiring flexibility, such as identifying rare objects or adapting to new environments without additional training.

**Figure 1:** Zero Shot Object Detection with the Qwen 2.5 series (source: Qwen Team, 2025).

Visual Grounding and Counting

Visual grounding goes beyond mere object detection, focusing on linking textual descriptions to specific visual elements within an image. Qwen 2.5 excels at precise object grounding, where it maps natural language queries to corresponding parts of an image (Figure 2).

For example, when given a phrase like “the red car in the background,” the model can pinpoint and highlight the specific object accurately. This capability is pivotal in fields (e.g., human-computer interaction, augmented reality, and e-commerce), where understanding user queries in a visual context is essential.

**Figure 2:** Precise Visual Grounding with the Qwen 2.5 series (source: Qwen Team, 2025).

Additionally, Qwen 2.5 integrates advanced counting capabilities into its visual grounding framework. This allows it not only to identify objects but also to count them with precision (Figure 3). For example, in a query like “How many apples are in the basket?”, the model can both locate the apples and provide an accurate count. This combination of grounding and counting is invaluable in scenarios such as inventory management, data annotation, and real-time monitoring.

**Figure 3:** Counting and Visual Grounding with the Qwen 2.5 series (source: Qwen Team, 2025).

Understanding Relationships

Understanding relationships in images is a more nuanced aspect of spatial reasoning. It requires the ability to identify interactions and connections between multiple objects (e.g., “a person holding a book” or “a dog lying under a tree”).

Qwen 2.5’s ability to decode these relationships stems from its multimodal capabilities, which combine visual recognition with contextual language understanding (Figure 4). This skill is invaluable for applications (e.g., scene understanding, robotics, and storytelling), where a deeper comprehension of visual contexts is necessary.

**Figure 4:** Understanding relationships in images with the Qwen 2.5 series (source: Qwen Team, 2025).

How Spatial Understanding Works in Qwen 2.5 VL Models

Prompt Structure

Spatial understanding with Qwen 2.5 VL (vision language) models starts by providing a natural language query as a prompt along with the image of interest. A usual prompt can be broken down into some standard components:

Task-Specific Instruction

The prompt usually starts with task-specific instructions that define whether it is a detection, counting, identification, or combination of these. The clarity and specificity of this instruction are critical for guiding the model’s response.

Examples:

"Locate the person who acted bravely in the image..."

"Detect all red cars in the image..."

"Identify basketball players and detect key points..."

"Count and detect all birds in this image..."

Object or Feature Specification

After the task instruction, the prompt narrows down the focus by specifying the object or feature of interest. This could include objects (e.g., birds, cars, or people) or attributes (e.g., color, location, or size). By providing these details, the prompt ensures the model can identify the relevant elements within the image.

Examples:

"Detect all red cars in the image..."
"Locate all motorcyclists who are not wearing helmets..."
"Identify basketball players and detect key points such as their hands and heads..."

This segment adds context and detail, enabling the model to refine its detection or grounding process.

Contextual Clues or Relationships

Additionally, some prompts include contextual details or relationships between objects, which allow the model to go beyond simple detection. These details help in understanding interactions or spatial relationships within the scene.

Examples:

"Identify the person holding a basketball..."
"Locate the person standing near the car..."
"Detect birds sitting on a tree and count them..."

Output Requirements

Many prompts explicitly state the required output format or additional attributes to include in the response. This could involve bounding box coordinates, key points, or structured data like JSON (JavaScript Object Notation).

Examples:

"Return their locations in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
"Count and list the total number of objects detected in JSON format."
"Output key points for each basketball player’s head and hands in JSON format."

Model Response Format

Qwen 2.5’s responses to spatial understanding tasks follow a structured (usually a JSON format) and systematic format, enabling accurate interpretation and integration into downstream applications. The following is a breakdown of the standard components present in the responses.

Bounding Box Coordinates (bbox_2d or point_2d)

These values in the JSON response represent the precise spatial location of detected objects within the image. Bounding boxes enable object localization, while point coordinates are used for fine-grained key point detection.

Formats:

bbox_2d: A bounding box with four coordinates [x1, y1, x2, y2], where (x1, y1) is the top-left corner, and (x2, y2) is the bottom-right corner.
point_2d: A specific coordinate for key points (e.g., hands or heads) expressed as [x, y].

Examples:

Bounding box: {"bbox_2d": [341, 258, 397, 360]}
Key point: {"point_2d": ["394", "105"]}

Primary Label (label), Sub-Labels, and Descriptions

Based on the prompt, each bounding box or key point detection can be accompanied by the primary label, sub-labels, or description.

The primary label provides the identity or category of the object for the primary task (e.g., detection, counting). At the same time, sub-labels or descriptions provide additional metadata about the detected object (e.g., attributes, states, or relationships).

Examples:

{"label": "motorcyclist", "sub_label": "wearing helmet"}
{"label": "birds", "color": "yellow"}
{"label": "cake with white frosting and colorful sprinkles"}

Hands-on with Qwen 2.5 VL for Spatial Understanding

In this section, we will see how we can use Qwen 2.5 for performing spatial understanding tasks (e.g., object detection, precise visual grounding, and understanding relationships). We will start by installing the necessary libraries.

pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
pip install kagglehub

Setting Up Qwen 2.5 VL Model and Inference Function

Next, we load the Qwen-2.5 VL model in memory and implement the zero-shot inference function that will be used to call the model across all kinds of spatial understanding tasks.

Loading the Model and Processor

For this hands-on, we will use "Qwen/Qwen2.5-VL-3B-Instruct", which is a 3B parameter model of the Qwen2.5 vision language model series.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
min_pixels = 256*28*28
max_pixels = 1280*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

Here, we first import necessary classes and functions from the transformers library and a utility function from qwen_vl_utils (Lines 1 and 2). We then load the pre-trained Qwen2_5_VLForConditionalGeneration model from the "Qwen/Qwen2.5-VL-3B-Instruct" repository using the from_pretrained method (Line 5). The torch_dtype="auto" argument ensures the appropriate data type is used for the model, and device_map="auto" ensures that the model is loaded on available devices (e.g., GPUs) to optimize performance (Line 6).

In the second part, we set the min_pixels and max_pixels values to define a token range for image processing (Lines 10 and 11). These values can be adjusted to balance performance and computational cost. We then create a processor object using the AutoProcessor class from the same repository (Line 12).

This processor will handle the pre-processing of vision-related information (e.g., resizing images to fit within the specified pixel range). The combination of the model and processor allows us to generate conditional outputs based on the input images efficiently.

Implementing the Inference Function

Now that our model is loaded, we will implement a zero_shot_inference() function that takes in a prompt and image, calls the model, and returns the output. This function can then be used across different spatial understanding tasks, as we will see subsequently.

def zero_shot_inference(model, processor, image, prompt):
  messages = [
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  "image": image,
              },
              {"type": "text", "text": prompt},
          ],
      }
  ]

  # Preparation for inference
  text = processor.apply_chat_template(
      messages, tokenize=False, add_generation_prompt=True
  )
  image_inputs, video_inputs = process_vision_info(messages)
  inputs = processor(
      text=[text],
      images=image_inputs,
      videos=video_inputs,
      padding=True,
      return_tensors="pt",
  )
  inputs = inputs.to("cuda")

  # Inference: Generation of the output
  generated_ids = model.generate(**inputs, max_new_tokens=1024)
  generated_ids_trimmed = [
      out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
  ]
  output_text = processor.batch_decode(
      generated_ids_trimmed, do_sample=True, skip_special_tokens=True, clean_up_tokenization_spaces=False
  )[0]

  input_height = inputs['image_grid_thw'][0][1]*14
  input_width = inputs['image_grid_thw'][0][2]*14

  return output_text, input_height, input_width

On Lines 14-25, we format the input structure (messages) to include the image and prompt, ensuring seamless integration of visual and natural language data.

On Lines 28-38, we preprocess the data, combining text and vision inputs using the processor. The formatted inputs are then transferred to GPU (cuda) for efficient computation (Line 39). Finally, on Lines 42-48, the model generates output tokens, trims them to exclude input text, and decodes them into meaningful responses (output_text). The function also calculates image dimensions for spatial tasks, returning decoded text and image details as outputs. This pipeline efficiently bridges vision and language for zero-shot reasoning.

Parsing the Response and Plotting Bounding Boxes

Next, we will also implement a plot_bounding_boxes utility function that can be used to plot bounding boxes by parsing the JSON format.

import json
import matplotlib.pyplot as plt
from PIL import Image
import matplotlib.patches as patches

def plot_bounding_boxes(image, json_data, height, width):
  image = image.resize((width, height))
  # Parse the JSON input
  json_data = json_data.split('```json')[1].split('```')[0]
  bbox_data = json.loads(json_data)

  # Plot the image
  fig, ax = plt.subplots(1)
  ax.imshow(image)
  ax.axis('off')

  # Plot the bounding boxes and labels
  for item in bbox_data:
      bbox = item['bbox_2d']
      label = item['label']
      rect = patches.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], linewidth=2, edgecolor='r', facecolor='none')
      ax.add_patch(rect)
      plt.text(bbox[0], bbox[1] - 10, label, color='r', fontsize=10)

  plt.show()

On Lines 54-57, we import libraries for image handling (PIL.Image), JSON parsing, and visualization (matplotlib). On Lines 60-63, the image is resized to match the dimensions specified, and JSON data is parsed to extract bounding box details.

On Lines 66-68, the image is displayed, and axes are removed for a clean visual. Lines 71-76 plot bounding boxes using patches.Rectangle, and labels are added above each box for identification. Finally, Line 78 shows the fully annotated image. This function efficiently overlays bounding boxes and labels for spatial understanding tasks.

Testing the Setup

It is time to finally run our model on some images and see how it performs for different spatial understanding tasks.

Object Detection

The following code snippet demonstrates zero-shot inference with Qwen 2.5 VL models to detect objects in an image and visualize them.

from PIL import Image
import requests

prompt = "Detect all objects in the image and return their locations and labels (car, truck, bus, cycle, bike, etc.) in the form of coordinates. "

image = Image.open(requests.get("https://media.istockphoto.com/photos/tailgating-on-a-threelane-autobahn-picture-id115929876?b=1&k=20&m=115929876&s=170667a&w=0&h=CVSiG4VWJyqjnbtzTRh0Lta12v71wXK_hZTMDWPxVfw=", stream=True).raw)

response, height, width = zero_shot_inference(model, processor, image, prompt)
print("Image size: ", image.size)

print(response)
plot_bounding_boxes(image, response, height, width)

It fetches an image via a URL (Line 84) and passes it with a task-specific prompt (Line 82) to the zero_shot_inference function (Lines 86 and 87) to get detection results. Finally, Line 90 uses plot_bounding_boxes to overlay bounding boxes and labels on the image, completing the detection and visualization process efficiently.

Here’s the output of the above snippet:

Output:

```json
[
	{"bbox_2d": [109, 217, 183, 280], "label": "car"},
	{"bbox_2d": [271, 256, 351, 329], "label": "car"},
	{"bbox_2d": [272, 200, 337, 257], "label": "car"},
	{"bbox_2d": [348, 38, 377, 63], "label": "car"},
	{"bbox_2d": [368, 81, 410, 114], "label": "car"},
	{"bbox_2d": [285, 0, 308, 16], "label": "car"},
	{"bbox_2d": [386, 137, 453, 223], "label": "truck"}
]
```

This output represents the detection results of a model, showing seven vehicles identified in the image. Each object is represented with bbox_2d, providing bounding box coordinates ([x1, y1, x2, y2]) and a label. Among the detections, six are labeled as "car" and one as "truck". Figure 5 shows the image after plotting the bounding box coordinates and object labels.

**Figure 5:** Detecting vehicles on the road with the Qwen 2.5 series (source: image by the author).

As we can see, the model can capture all the vehicles (along with their labels) in the image with precise bounding boxes, making it ideal for object detection applications (e.g., surveillance, traffic congestion, etc.).

Precise Visual Grounding

Next, we perform precise visual grounding, which involves detecting and locating specific objects in the image using prompts. The code below identifies cupcakes with choco-chips in an image using the specified prompt.

prompt = "Detect the cupcake with choco-chips on it in the image and return its locations in the form of coordinates. "

image = Image.open(requests.get("https://th.bing.com/th/id/OIP.AfqWYODEWRwb4yP3U7p8ZwHaFX?rs=1&pid=ImgDetMain", stream=True).raw)

response, height, width = zero_shot_inference(model, processor, image, prompt)
print("Image size: ", image.size)

print(response)
plot_bounding_boxes(image, response, height, width)

It fetches the image via a URL, processes it with the zero_shot_inference function to detect the specified objects and their coordinates, and visualizes the bounding boxes on the image using plot_bounding_boxes (Line 99).

Here’s the output of the above snippet:

Output:

```json
[
	{"bbox_2d": [0, 194, 103, 315], "label": "cupcake with choco-chips on it"}
]
```

Figure 6 shows the image after plotting the bounding box coordinates and object labels.

**Figure 6:** Precise Visual Grounding: Detecting cupcakes with choco-chips using the Qwen 2.5 series (source: image by the author).

As we can see, the model can precisely locate the cupcake with choco-chips on it, indicating its expertise in precise visual grounding as well.

Understanding Relationships in Image

Lastly, we will test the model’s ability to decode relationships between multiple objects in the image. As an example, this code identifies a “helpful kid” in an image using the provided prompt.

prompt = "Detect the helpful kid in the image and return its location in the form of coordinates. "

image = Image.open(requests.get("https://clipground.com/images/helping-kids-clipart-3.jpg", stream=True).raw)

response, height, width = zero_shot_inference(model, processor, image, prompt)
print("Image size: ", image.size)

print(response)
plot_bounding_boxes(image, response, height, width)

Here’s the output of the above snippet:

Output:

```json
[
	{"bbox_2d": [43, 15, 560, 753], "label": "helpful kid"}
]
```

Figure 7 shows the image after plotting the bounding box coordinates and object labels.

**Figure 7:** Understanding Relations: Detecting the “helpful kid” in the image using the Qwen 2.5 series (source: image by the author).

As we can see, the model can understand relationships across different instances in an image, making it suitable for scene understanding, robotics, and storytelling, where a deeper comprehension of visual contexts is necessary.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this blog post, we explore how Qwen 2.5 VL models redefine spatial understanding by addressing complex tasks (e.g., object detection, visual grounding, counting, and relationship analysis) in images. We begin by categorizing spatial understanding into three core areas:

detecting objects with precision
mapping visual elements to textual descriptions
interpreting relationships between objects

By examining the structure of prompts, including task-specific instructions, feature specifications, and contextual clues, we demonstrate how users can craft effective inputs for optimal model performance.

Next, we delve into the detailed workings of Qwen 2.5’s response format and hands-on usage for spatial reasoning. We highlight the model’s structured output, including bounding box coordinates, combined labels, and detailed descriptions, which make it adaptable for diverse applications. Through practical steps, we guide users in setting up the model, implementing an inference function, and visualizing detection results. This section emphasizes the flexibility and efficiency of Qwen 2.5 in processing multimodal queries and producing actionable insights.

Finally, we showcase the model’s real-world applications, from detecting pedestrians to locating objects like cupcakes or vehicles, illustrating its robustness and accuracy. By analyzing visual outputs and leveraging bounding box visualizations, we bring to life the capabilities of Qwen 2.5 in addressing spatial challenges. This blog serves as a comprehensive guide, helping users unlock the potential of Qwen 2.5 VL models for innovative and practical spatial understanding tasks.

Citation Information

Mangla, P. “Object Detection and Visual Grounding with Qwen 2.5,” PyImageSearch, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2025, https://pyimg.co/xd4hj

@incollection{Mangla_2025_object-detection-and-visual-grounding-with-qwen-2-5,
  author = {Puneet Mangla},
  title = {{Object Detection and Visual Grounding with Qwen 2.5}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2025},
  url = {https://pyimg.co/xd4hj},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Object Detection and Visual Grounding with Qwen 2.5

Looking for the source code to this post?

Introduction and Types of Spatial Understanding

Object Detection

Visual Grounding and Counting

Understanding Relationships

How Spatial Understanding Works in Qwen 2.5 VL Models

Prompt Structure

Task-Specific Instruction

Object or Feature Specification

Contextual Clues or Relationships

Output Requirements

Model Response Format

Bounding Box Coordinates (bbox_2d or point_2d)

Primary Label (label), Sub-Labels, and Descriptions

Hands-on with Qwen 2.5 VL for Spatial Understanding

Setting Up Qwen 2.5 VL Model and Inference Function

Loading the Model and Processor

Implementing the Inference Function

Parsing the Response and Plotting Bounding Boxes

Testing the Setup

Object Detection

Precise Visual Grounding

Understanding Relationships in Image

What's next? We recommend PyImageSearch University.

Summary

Citation Information

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

PyImageSearch University

Learning JAX in 2023: Part 2 — JAX’s Power Tools grad, jit, vmap, and pmap

How to install OpenCV 4 on Ubuntu

Turning any CNN image classifier into an object detector with Keras, TensorFlow, and OpenCV

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

What's next? We recommend PyImageSearch University.

Download the Source Code and FREE 17-page Resource Guide

About the Author

Content Moderation via Zero Shot Learning with Qwen 2.5

Video Understanding and Grounding with Qwen 2.5

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?