Table of Contents
- Object Detection and Visual Grounding with Qwen 2.5
- Introduction and Types of Spatial Understanding
- How Spatial Understanding Works in Qwen 2.5 VL Models
- Prompt Structure
- Task-Specific Instruction
- Object or Feature Specification
- Contextual Clues or Relationships
- Output Requirements
- Model Response Format
- Hands-on with Qwen 2.5 VL for Spatial Understanding
- Setting Up Qwen 2.5 VL Model and Inference Function
- Loading the Model and Processor
- Implementing the Inference Function
- Parsing the Response and Plotting Bounding Boxes
- Testing the Setup
- Summary
Object Detection and Visual Grounding with Qwen 2.5
The rapid evolution of Vision-Language Models like Qwen 2.5 is reshaping the boundaries of artificial intelligence (AI), offering groundbreaking capabilities across diverse applications. In the previous blog post of this series, we explored the transformative role of Qwen 2.5 in content moderation through zero-shot learning! This blog post shifts focus to another captivating domain where the Qwen 2.5 model series excels (i.e., spatial understanding).
We’ll uncover how Qwen 2.5’s advanced multimodal capabilities tackle spatial understanding tasks with unprecedented accuracy, from identifying objects without prior examples (a.k.a. zero shot object detection) to precisely mapping visual elements to textual references (a.k.a. visual grounding) and decoding intricate interrelationships in images.
Whether it’s enhancing autonomous systems, advancing e-commerce, or providing deeper contextual insights, these innovations push the boundaries of visual grounding and object detection. So, without any delay, let’s dive into the transformative impact of Qwen 2.5 in these fascinating domains.
This lesson is the 2nd in a 3-part series on Qwen 2.5 Unleashed: Transforming Vision Tasks with AI:
- Content Moderation via Zero Shot Learning with Qwen 2.5
- Object Detection and Visual Grounding with Qwen 2.5 (this tutorial)
- Video Understanding and Grounding with Qwen 2.5
To learn how Qwen 2.5 can be used for object detection and visual grounding tasks, just keep reading.
Introduction and Types of Spatial Understanding
Spatial understanding is a critical aspect of artificial intelligence, enabling models to perceive and interpret visual data in a way that mirrors human cognition. Qwen 2.5 excels in this domain, demonstrating remarkable capabilities in addressing complex spatial reasoning tasks. In this section, we will dive deeper into three key areas of spatial understanding: Object Detection, Visual Grounding, and Understanding Relationships.
Object Detection
Object detection involves identifying and classifying objects within an image, a fundamental task for various AI applications. Qwen 2.5’s zero-shot object detection capability sets it apart, allowing it to recognize objects it hasn’t been explicitly trained on.
Leveraging a robust training dataset and its vision-language understanding, Qwen 2.5 can associate textual labels with visual elements seamlessly (Figure 1). This approach is particularly valuable in scenarios requiring flexibility, such as identifying rare objects or adapting to new environments without additional training.
Visual Grounding and Counting
Visual grounding goes beyond mere object detection, focusing on linking textual descriptions to specific visual elements within an image. Qwen 2.5 excels at precise object grounding, where it maps natural language queries to corresponding parts of an image (Figure 2).
For example, when given a phrase like “the red car in the background,” the model can pinpoint and highlight the specific object accurately. This capability is pivotal in fields (e.g., human-computer interaction, augmented reality, and e-commerce), where understanding user queries in a visual context is essential.
Additionally, Qwen 2.5 integrates advanced counting capabilities into its visual grounding framework. This allows it not only to identify objects but also to count them with precision (Figure 3). For example, in a query like “How many apples are in the basket?”, the model can both locate the apples and provide an accurate count. This combination of grounding and counting is invaluable in scenarios such as inventory management, data annotation, and real-time monitoring.
Understanding Relationships
Understanding relationships in images is a more nuanced aspect of spatial reasoning. It requires the ability to identify interactions and connections between multiple objects (e.g., “a person holding a book” or “a dog lying under a tree”).
Qwen 2.5’s ability to decode these relationships stems from its multimodal capabilities, which combine visual recognition with contextual language understanding (Figure 4). This skill is invaluable for applications (e.g., scene understanding, robotics, and storytelling), where a deeper comprehension of visual contexts is necessary.
How Spatial Understanding Works in Qwen 2.5 VL Models
Prompt Structure
Spatial understanding with Qwen 2.5 VL (vision language) models starts by providing a natural language query as a prompt along with the image of interest. A usual prompt can be broken down into some standard components:
Task-Specific Instruction
The prompt usually starts with task-specific instructions that define whether it is a detection, counting, identification, or combination of these. The clarity and specificity of this instruction are critical for guiding the model’s response.
Examples:
"Locate the person who acted bravely in the image..."
"Detect all red cars in the image..."
"Identify basketball players and detect key points..."
"Count and detect all birds in this image..."
Object or Feature Specification
After the task instruction, the prompt narrows down the focus by specifying the object or feature of interest. This could include objects (e.g., birds, cars, or people) or attributes (e.g., color, location, or size). By providing these details, the prompt ensures the model can identify the relevant elements within the image.
Examples:
"Detect all red cars in the image..."
"Locate all motorcyclists who are not wearing helmets..."
"Identify basketball players and detect key points such as their hands and heads..."
This segment adds context and detail, enabling the model to refine its detection or grounding process.
Contextual Clues or Relationships
Additionally, some prompts include contextual details or relationships between objects, which allow the model to go beyond simple detection. These details help in understanding interactions or spatial relationships within the scene.
Examples:
"Identify the person holding a basketball..."
"Locate the person standing near the car..."
"Detect birds sitting on a tree and count them..."
Output Requirements
Many prompts explicitly state the required output format or additional attributes to include in the response. This could involve bounding box coordinates, key points, or structured data like JSON (JavaScript Object Notation).
Examples:
"Return their locations in the form of coordinates in the format {'bbox_2d': [x1, y1, x2, y2]}."
"Count and list the total number of objects detected in JSON format."
"Output key points for each basketball player’s head and hands in JSON format."
Model Response Format
Qwen 2.5’s responses to spatial understanding tasks follow a structured (usually a JSON format) and systematic format, enabling accurate interpretation and integration into downstream applications. The following is a breakdown of the standard components present in the responses.
Bounding Box Coordinates (bbox_2d or point_2d)
These values in the JSON response represent the precise spatial location of detected objects within the image. Bounding boxes enable object localization, while point coordinates are used for fine-grained key point detection.
Formats:
bbox_2d
: A bounding box with four coordinates[x1, y1, x2, y2]
, where(x1, y1)
is the top-left corner, and(x2, y2)
is the bottom-right corner.point_2d
: A specific coordinate for key points (e.g., hands or heads) expressed as[x, y]
.
Examples:
- Bounding box:
{"bbox_2d": [341, 258, 397, 360]}
- Key point:
{"point_2d": ["394", "105"]}
Primary Label (label), Sub-Labels, and Descriptions
Based on the prompt, each bounding box or key point detection can be accompanied by the primary label, sub-labels, or description.
The primary label provides the identity or category of the object for the primary task (e.g., detection, counting). At the same time, sub-labels or descriptions provide additional metadata about the detected object (e.g., attributes, states, or relationships).
Examples:
{"label": "motorcyclist", "sub_label": "wearing helmet"}
{"label": "birds", "color": "yellow"}
{"label": "cake with white frosting and colorful sprinkles"}
Hands-on with Qwen 2.5 VL for Spatial Understanding
In this section, we will see how we can use Qwen 2.5 for performing spatial understanding tasks (e.g., object detection, precise visual grounding, and understanding relationships). We will start by installing the necessary libraries.
pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils pip install kagglehub
Setting Up Qwen 2.5 VL Model and Inference Function
Next, we load the Qwen-2.5 VL model in memory and implement the zero-shot inference function that will be used to call the model across all kinds of spatial understanding tasks.
Loading the Model and Processor
For this hands-on, we will use "Qwen/Qwen2.5-VL-3B-Instruct"
, which is a 3B parameter model of the Qwen2.5 vision language model series.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto" ) # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. min_pixels = 256*28*28 max_pixels = 1280*28*28 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
Here, we first import necessary classes and functions from the transformers
library and a utility function from qwen_vl_utils
(Lines 1 and 2). We then load the pre-trained Qwen2_5_VLForConditionalGeneration
model from the "Qwen/Qwen2.5-VL-3B-Instruct"
repository using the from_pretrained
method (Line 5). The torch_dtype="auto"
argument ensures the appropriate data type is used for the model, and device_map="auto"
ensures that the model is loaded on available devices (e.g., GPUs) to optimize performance (Line 6).
In the second part, we set the min_pixels
and max_pixels
values to define a token range for image processing (Lines 10 and 11). These values can be adjusted to balance performance and computational cost. We then create a processor
object using the AutoProcessor
class from the same repository (Line 12).
This processor will handle the pre-processing of vision-related information (e.g., resizing images to fit within the specified pixel range). The combination of the model and processor allows us to generate conditional outputs based on the input images efficiently.
Implementing the Inference Function
Now that our model is loaded, we will implement a zero_shot_inference()
function that takes in a prompt and image, calls the model, and returns the output. This function can then be used across different spatial understanding tasks, as we will see subsequently.
def zero_shot_inference(model, processor, image, prompt): messages = [ { "role": "user", "content": [ { "type": "image", "image": image, }, {"type": "text", "text": prompt}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, do_sample=True, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] input_height = inputs['image_grid_thw'][0][1]*14 input_width = inputs['image_grid_thw'][0][2]*14 return output_text, input_height, input_width
On Lines 14-25, we format the input structure (messages
) to include the image
and prompt
, ensuring seamless integration of visual and natural language data.
On Lines 28-38, we preprocess the data, combining text and vision inputs using the processor. The formatted inputs
are then transferred to GPU (cuda
) for efficient computation (Line 39). Finally, on Lines 42-48, the model generates output tokens, trims them to exclude input text, and decodes them into meaningful responses (output_text
). The function also calculates image dimensions for spatial tasks, returning decoded text and image details as outputs. This pipeline efficiently bridges vision and language for zero-shot reasoning.
Parsing the Response and Plotting Bounding Boxes
Next, we will also implement a plot_bounding_boxes
utility function that can be used to plot bounding boxes by parsing the JSON format.
import json import matplotlib.pyplot as plt from PIL import Image import matplotlib.patches as patches def plot_bounding_boxes(image, json_data, height, width): image = image.resize((width, height)) # Parse the JSON input json_data = json_data.split('```json')[1].split('```')[0] bbox_data = json.loads(json_data) # Plot the image fig, ax = plt.subplots(1) ax.imshow(image) ax.axis('off') # Plot the bounding boxes and labels for item in bbox_data: bbox = item['bbox_2d'] label = item['label'] rect = patches.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0], bbox[3] - bbox[1], linewidth=2, edgecolor='r', facecolor='none') ax.add_patch(rect) plt.text(bbox[0], bbox[1] - 10, label, color='r', fontsize=10) plt.show()
On Lines 54-57, we import libraries for image handling (PIL.Image
), JSON parsing, and visualization (matplotlib
). On Lines 60-63, the image
is resized to match the dimensions specified, and JSON data is parsed to extract bounding box details.
On Lines 66-68, the image is displayed, and axes are removed for a clean visual. Lines 71-76 plot bounding boxes using patches.Rectangle
, and labels are added above each box for identification. Finally, Line 78 shows the fully annotated image. This function efficiently overlays bounding boxes and labels for spatial understanding tasks.
Testing the Setup
It is time to finally run our model on some images and see how it performs for different spatial understanding tasks.
Object Detection
The following code snippet demonstrates zero-shot inference with Qwen 2.5 VL models to detect objects in an image and visualize them.
from PIL import Image import requests prompt = "Detect all objects in the image and return their locations and labels (car, truck, bus, cycle, bike, etc.) in the form of coordinates. " image = Image.open(requests.get("https://media.istockphoto.com/photos/tailgating-on-a-threelane-autobahn-picture-id115929876?b=1&k=20&m=115929876&s=170667a&w=0&h=CVSiG4VWJyqjnbtzTRh0Lta12v71wXK_hZTMDWPxVfw=", stream=True).raw) response, height, width = zero_shot_inference(model, processor, image, prompt) print("Image size: ", image.size) print(response) plot_bounding_boxes(image, response, height, width)
It fetches an image
via a URL (Line 84) and passes it with a task-specific prompt
(Line 82) to the zero_shot_inference
function (Lines 86 and 87) to get detection results. Finally, Line 90 uses plot_bounding_boxes
to overlay bounding boxes and labels on the image, completing the detection and visualization process efficiently.
Here’s the output of the above snippet:
Output:
```json [ {"bbox_2d": [109, 217, 183, 280], "label": "car"}, {"bbox_2d": [271, 256, 351, 329], "label": "car"}, {"bbox_2d": [272, 200, 337, 257], "label": "car"}, {"bbox_2d": [348, 38, 377, 63], "label": "car"}, {"bbox_2d": [368, 81, 410, 114], "label": "car"}, {"bbox_2d": [285, 0, 308, 16], "label": "car"}, {"bbox_2d": [386, 137, 453, 223], "label": "truck"} ] ```
This output represents the detection results of a model, showing seven vehicles identified in the image. Each object is represented with bbox_2d
, providing bounding box coordinates ([x1, y1, x2, y2]
) and a label
. Among the detections, six are labeled as "car"
and one as "truck"
. Figure 5 shows the image after plotting the bounding box coordinates and object labels.
As we can see, the model can capture all the vehicles (along with their labels) in the image with precise bounding boxes, making it ideal for object detection applications (e.g., surveillance, traffic congestion, etc.).
Precise Visual Grounding
Next, we perform precise visual grounding, which involves detecting and locating specific objects in the image using prompts. The code below identifies cupcakes with choco-chips in an image using the specified prompt.
prompt = "Detect the cupcake with choco-chips on it in the image and return its locations in the form of coordinates. " image = Image.open(requests.get("https://th.bing.com/th/id/OIP.AfqWYODEWRwb4yP3U7p8ZwHaFX?rs=1&pid=ImgDetMain", stream=True).raw) response, height, width = zero_shot_inference(model, processor, image, prompt) print("Image size: ", image.size) print(response) plot_bounding_boxes(image, response, height, width)
It fetches the image via a URL, processes it with the zero_shot_inference
function to detect the specified objects and their coordinates, and visualizes the bounding boxes on the image using plot_bounding_boxes
(Line 99).
Here’s the output of the above snippet:
Output:
```json [ {"bbox_2d": [0, 194, 103, 315], "label": "cupcake with choco-chips on it"} ] ```
Figure 6 shows the image after plotting the bounding box coordinates and object labels.
As we can see, the model can precisely locate the cupcake with choco-chips on it, indicating its expertise in precise visual grounding as well.
Understanding Relationships in Image
Lastly, we will test the model’s ability to decode relationships between multiple objects in the image. As an example, this code identifies a “helpful kid” in an image using the provided prompt.
prompt = "Detect the helpful kid in the image and return its location in the form of coordinates. " image = Image.open(requests.get("https://clipground.com/images/helping-kids-clipart-3.jpg", stream=True).raw) response, height, width = zero_shot_inference(model, processor, image, prompt) print("Image size: ", image.size) print(response) plot_bounding_boxes(image, response, height, width)
Here’s the output of the above snippet:
Output:
```json [ {"bbox_2d": [43, 15, 560, 753], "label": "helpful kid"} ] ```
Figure 7 shows the image after plotting the bounding box coordinates and object labels.
As we can see, the model can understand relationships across different instances in an image, making it suitable for scene understanding, robotics, and storytelling, where a deeper comprehension of visual contexts is necessary.
What's next? We recommend PyImageSearch University.
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: June 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this blog post, we explore how Qwen 2.5 VL models redefine spatial understanding by addressing complex tasks (e.g., object detection, visual grounding, counting, and relationship analysis) in images. We begin by categorizing spatial understanding into three core areas:
- detecting objects with precision
- mapping visual elements to textual descriptions
- interpreting relationships between objects
By examining the structure of prompts, including task-specific instructions, feature specifications, and contextual clues, we demonstrate how users can craft effective inputs for optimal model performance.
Next, we delve into the detailed workings of Qwen 2.5’s response format and hands-on usage for spatial reasoning. We highlight the model’s structured output, including bounding box coordinates, combined labels, and detailed descriptions, which make it adaptable for diverse applications. Through practical steps, we guide users in setting up the model, implementing an inference function, and visualizing detection results. This section emphasizes the flexibility and efficiency of Qwen 2.5 in processing multimodal queries and producing actionable insights.
Finally, we showcase the model’s real-world applications, from detecting pedestrians to locating objects like cupcakes or vehicles, illustrating its robustness and accuracy. By analyzing visual outputs and leveraging bounding box visualizations, we bring to life the capabilities of Qwen 2.5 in addressing spatial challenges. This blog serves as a comprehensive guide, helping users unlock the potential of Qwen 2.5 VL models for innovative and practical spatial understanding tasks.
Citation Information
Mangla, P. “Object Detection and Visual Grounding with Qwen 2.5,” PyImageSearch, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2025, https://pyimg.co/xd4hj
@incollection{Mangla_2025_object-detection-and-visual-grounding-with-qwen-2-5, author = {Puneet Mangla}, title = {{Object Detection and Visual Grounding with Qwen 2.5}}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur}, year = {2025}, url = {https://pyimg.co/xd4hj}, }
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.