Table of Contents
- Content Moderation via Zero Shot Learning with Qwen 2.5
- What Is Content Moderation?
- Overview of Qwen 2.5 Vision-Language Models
- Era of Vision-Language Models
- Introducing Key Features of Qwen 2.5 Models
- Enhanced Pre-Training Dataset
- Extended Generation Length
- Advanced Support for Structured Inputs and Outputs
- Massive Context Length with Qwen2.5-Turbo
- Qwen2.5-72B-Instruct
- Introduction to the Mixture-of-Experts Models
- Enhanced Visual Recognition and Analysis
- Comprehension of Extended Videos and Event Localization
- Accurate Object Localization with Structured Outputs
- Diverse Model Sizes for Flexibility
- Spatial Dimension Enhancements
- Temporal Dimension Innovations
- Zero Shot Learning with Qwen 2.5
- Summary
Content Moderation via Zero Shot Learning with Qwen 2.5
In today’s digital age, maintaining the integrity and safety of online platforms is more crucial than ever. With the exponential growth of user-generated content, the challenge of moderating vast amounts of diverse content has become increasingly complex.
Digital platforms face the formidable task of preventing the spread of harmful or misleading content that can negatively impact individuals and communities. Traditional content moderation often relies heavily on human moderators to review and flag inappropriate content or annotate data for model distillation. This approach can become unmanageable as the volume of content grows and changes frequently. For instance, a major social media platform receives billions of posts, comments, and media uploads daily.
Smaller vision-language models (e.g., Qwen 2.5) have garnered significant attention and popularity due to their advanced capabilities and impressive performance in vision tasks. Qwen 2.5 models, which combine visual and linguistic understanding, offer a cutting-edge solution for multimodal content moderation. These models can accurately detect and manage inappropriate content across text, images, and videos without the need for extensive, task-specific training data moderation, thanks to their zero-shot learning capabilities.
Through this series of blog posts, we will explore the transformative potential of Qwen 2.5 vision-language models (VLMs) for various vision and multimodal tasks (e.g., content moderation, pdf summarization, object detection, etc.).
This lesson is the 1st of a 3-part series on Qwen 2.5 Unleashed: Transforming Vision Tasks with AI:
- Content Moderation via Zero Shot Learning with Qwen 2.5 (this tutorial)
- Object Detection and Visual Grounding with Qwen 2.5
- Video Understanding and Grounding with Qwen 2.5
To learn how Qwen 2.5 can be used for content moderation for social media safety, just keep reading.
What Is Content Moderation?
Content Moderation for Social Media Safety
Content moderation is the process of monitoring, reviewing, and managing user-generated content on digital platforms to ensure it adheres to established community guidelines and standards. The goal is to maintain a safe, respectful, and inclusive environment for users by identifying and addressing harmful, inappropriate, or misleading content. Content moderation involves various tasks such as filtering offensive language, detecting and removing hate speech, preventing the spread of misinformation, and protecting users from harassment and cyberbullying. This process can be carried out manually by human moderators or through automated systems powered by artificial intelligence (AI).
In the context of social media platforms (Figure 1), content moderation plays a vital role in safeguarding users and maintaining a positive online environment. Social media platforms are hubs of diverse interactions, where users share text, images, videos, and other media. While these platforms offer tremendous opportunities for communication and expression, they also pose significant risks related to harmful content, including hate speech, graphic violence, explicit material, and misinformation.
Facebook Hateful Memes Challenge
The Facebook Hateful Memes Challenge is a notable initiative aimed at advancing the detection of hate speech in multimodal memes — content that combines both visual and textual elements. Memes present a unique challenge for content moderation due to their reliance on context, cultural references, and nuanced language, which can make their meaning difficult for traditional AI models to interpret.
Specifically, the challenge focuses on detecting hate speech in multimodal memes. This task poses an interesting multimodal fusion problem. Consider, for example, a sentence like “love the way you smell today” or “look how many people love you.” Alone, these sentences are harmless. However, when combined with an equally harmless image of a skunk or a tumbleweed, their meaning shifts dramatically, becoming mean and potentially harmful (Figure 2). The combination of visual and textual elements creates a new context that traditional text-only or image-only moderation methods may fail to capture.
To tackle this issue, the Facebook Hateful Memes Challenge provided a dataset of labeled multimodal memes and invited researchers and developers to create AI models capable of effectively detecting hate speech within these memes.
Overview of Qwen 2.5 Vision-Language Models
Era of Vision-Language Models
Large-scale models (e.g., OpenAI GPT-4o, Gemini, Llama, Claude, etc.) have showcased impressive performances. They often remain out of reach for many due to their intensive hardware requirements and operational costs. Today, the field of artificial intelligence is experiencing a transformative shift toward the development of smaller yet highly efficient open-source vision-language models (VLMs) such as Phi3.5-Vision, PaliGemma 3B, InternVL2-2B, Qwen2.5-VL-3B, Llava, SmolVLM, etc. (Figure 3).
This new era is characterized by the emergence of compact models that offer robust capabilities without the hefty computational demands of their larger counterparts. The drive toward smaller VLMs is fueled by the need for models that are more accessible, cost-effective, and adaptable to a variety of resource-constrained environments.
Introducing Key Features of Qwen 2.5 Models
Building upon the foundation of its predecessors (Qwen2-VL), Qwen 2.5 is the latest flagship model series from Alibaba’s Qwen Team. Developed with the vision of enhancing accessibility and performance, Qwen 2.5 introduces several key advancements that address previous limitations and expand its potential applications across various domains.
The following are some key features of the Qwen 2.5 model series.
Enhanced Pre-Training Dataset
The Qwen2.5 models have seen a substantial increase in their pre-training data, expanding from 7 trillion (in the Qwen2 series) to 18 trillion tokens (Figure 4).
This massive dataset growth focuses on key areas (e.g., knowledge, coding, and mathematics), enabling the models to develop a deeper understanding and proficiency in these domains. The enriched training allows Qwen2.5 to provide more accurate and sophisticated responses across a wide range of topics.
Extended Generation Length
One of the key enhancements in Qwen2.5 is the increase in maximum generation length from 2,000 tokens to 8,000 tokens. This extension allows the models to generate longer and more coherent text outputs, which is particularly beneficial for tasks requiring detailed explanations or extensive content generation.
Advanced Support for Structured Inputs and Outputs
Qwen2.5 offers better handling of structured data formats, including tables and JSON. This improved support facilitates more seamless integration with applications that rely on structured data, making it easier to input complex data and receive organized outputs.
Massive Context Length with Qwen2.5-Turbo
A standout feature is the Qwen2.5-Turbo model’s ability to support a context length of up to 1 million tokens. This exceptional capacity enables the processing of extensive documents or long conversational histories, making it ideal for applications like large-scale document analysis or comprehensive chatbot interactions.
Qwen2.5-72B-Instruct
The flagship model, Qwen2.5-72B-Instruct (Table 1), delivers performance that rivals the state-of-the-art open-weight model Llama-3-405B-Instruct despite being approximately five times smaller. This efficiency demonstrates advanced optimization, allowing similar or superior capabilities without the need for extraordinarily large model sizes.
You can try out the Qwen2.5-72B-Instruct-based chatbot at Qwen chat.
Introduction to the Mixture-of-Experts Models
Qwen2.5 introduces the proprietary Mixture-of-Experts (MoE) models, namely Qwen2.5-Turbo and Qwen2.5-Plus. These models support a context length of up to 1 million tokens. This exceptional capacity enables the processing of extensive documents or long conversational histories, making it ideal for applications like large-scale document analysis or comprehensive chatbot interactions.
These models perform competitively against leading models like GPT-4o-mini and GPT-4o, respectively. The MoE architecture allows the models to allocate computational resources effectively, enhancing performance on diverse tasks.
Enhanced Visual Recognition and Analysis
Qwen2.5 VL extends beyond recognizing common objects (e.g., flowers, birds, fish, and insects). It excels at analyzing complex visual elements, including texts, charts, icons, graphics, and layouts within images. This capability makes it highly effective for tasks that require detailed image interpretation.
Comprehension of Extended Videos and Event Localization
The model can understand videos over 1 hour in length and introduces a new ability to capture events by pinpointing relevant video segments. This feature is valuable for applications like video summarization, content indexing, and automated highlights generation.
Accurate Object Localization with Structured Outputs
Qwen2.5 VL (Figure 5) can precisely localize objects within images by generating bounding boxes or points. It provides stable JSON outputs containing coordinates and attributes, facilitating integration with systems that require structured data for further processing or analysis.
Diverse Model Sizes for Flexibility
The Qwen2.5 series includes a range of open-source dense models varying in size — 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. This variety provides flexibility, allowing users to select models that best fit their computational resources and specific application requirements.
These dense models rely on transformer-based decoder architectures such as GPT or its predecessor, Qwen2. The architecture incorporates several key components (e.g., Grouped Query Attention, SwiGLU activation, rotational positional embeddings, etc.) to improve the learning process.
Table 2 shows the configurations for different open-weight models in the Qwen2.5 series.
Spatial Dimension Enhancements
Qwen2.5 VL introduces a groundbreaking approach to image processing by handling spatial dimensions more dynamically. Instead of relying on traditional methods like coordinate normalization — which scales coordinates to a standard range — the model directly uses the actual size and scale of each image. Here’s what that means:
- Dynamic Tokenization: Images of varying sizes are converted into tokens of corresponding lengths. Larger images result in more tokens, preserving the detailed information inherent in higher-resolution inputs.
- Direct Coordinate Representation: Detection boxes and points within images are represented using the real, unnormalized coordinates. By doing so, Qwen2.5-VL allows the model to learn and understand the true scale and spatial relationships within each image.
This approach enables the model to interpret spatial information more accurately, leading to better object localization and understanding of scale-related features in images.
Temporal Dimension Innovations
In dealing with videos and temporal data, Qwen2.5 VL makes significant strides:
- Dynamic FPS Training: Instead of a fixed frame rate, the model uses dynamic Frames Per Second during training. This means it can adjust to videos with varying frame rates, making it more adaptable to different types of video content.
- Absolute Time Encoding: The model employs absolute time encoding by aligning Modified Rotary Position Embedding (mRoPE) identifiers directly with the progression of time. In essence, the temporal identifiers within the model correspond to actual time intervals, allowing it to learn the pace and flow of events naturally.
By integrating these temporal techniques, Qwen2.5-VL (Figure 6) can comprehend the timing and sequencing of events within videos more effectively. This is particularly important for tasks like event detection, where understanding the exact timing is crucial.
Zero Shot Learning with Qwen 2.5
In this section, we will see how we can use zero shot learning with the Qwen 2.5 models for hate speech detection in social media memes. For this, we will use the Hateful Memes dataset, which is a multimodal dataset for hateful meme detection (image + text) that contains 10,000+ new multimodal examples created by Facebook AI. We will start by installing the necessary libraries.
pip install git+https://github.com/huggingface/transformers pip install qwen-vl-utils pip install kagglehub
Loading the Hateful Memes Dataset
Next, we will load the Hateful Memes dataset. The training split comprises 10000 annotated memes, whereas the validation split comprises 500 annotated examples. For this hands-on, we will evaluate the performance of zero-shot learning on validation split using Qwen 2.5 models.
import kagglehub # Download latest version path = kagglehub.dataset_download("parthplc/facebook-hateful-meme-dataset", force_download=True) import torch torch.cuda.empty_cache() torch.manual_seed(42) import json import os from PIL import Image import numpy as np import pandas as pd import random data_dir = '/root/.cache/kagglehub/datasets/parthplc/facebook-hateful-meme-dataset/versions/1/data/' img_path = data_dir + "data_dir" train_path = data_dir + "train.jsonl" dev_path = data_dir + "dev.jsonl" test_path = data_dir + "test.jsonl" def get_sample(data_path): data = [json.loads(l) for l in open(data_path)] random.shuffle(data) images, texts, answers = [], [], [] for index in range(len(data)): image = Image.open(os.path.join(data_dir, data[index]["img"])).convert("RGB") text = data[index]["text"] answer = "yes" if data[index]["label"] == 1 else "no" images.append(image) texts.append(text) answers.append(answer) return images, texts, answers images, texts, answers = get_sample(data_path=dev_path)
The above code has two main parts.
In the first part (Lines 1-24), we import the required libraries and set up the paths for the dataset. We use kagglehub
to download the facebook-hateful-meme-dataset
(Lines 1-4). We then import essential libraries such as torch
, json
, os
, PIL.Image
, numpy
, pandas
, and random
(Lines 6-17). Following that, we define the data_dir
where the dataset is stored and set paths for the train, dev, and test JSONL files (Lines 19-24). This setup ensures that the environment is ready to process the dataset.
In the second part (Lines 26-50), we define the get_sample
function, which extracts images, texts, and answers from the specified JSONL file. We read and parse the JSONL data (Line 27) and load the corresponding images and texts, converting the images to RGB format (Lines 33-35). Finally, we store the loaded data in the initialized lists and return them as the function’s output (Line 41). The function is called with the dev_path
(Line 42), providing us with a validation set for further processing.
Zero Shot Inference with Qwen 2.5
Once we have loaded our dataset, we can start by loading the Qwen 2.5 models. For this hands-on, we will use "Qwen/Qwen2.5-VL-3B-Instruct"
, which is a 3B parameter model of the Qwen2.5 series.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto" ) # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. min_pixels = 256*28*28 max_pixels = 1280*28*28 processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
Here, first, we import necessary classes and functions from the transformers
library and a utility function from qwen_vl_utils
(Lines 43 and 44). We then load the pre-trained Qwen2_5_VLForConditionalGeneration
model from the "Qwen/Qwen2.5-VL-3B-Instruct"
repository using the from_pretrained
method (Line 47). The torch_dtype="auto"
argument ensures the appropriate data type is used for the model, and device_map="auto"
ensures that the model is loaded on available devices (e.g., GPUs) to optimize performance.
In the second part, we set the min_pixels
and max_pixels
values to define a token range for image processing (Lines 52 and 53). These values can be adjusted to balance performance and computational cost. We then create a processor
object using the AutoProcessor
class from the same repository (Line 54).
This processor will handle the pre-processing of vision-related information (e.g., resizing images to fit within the specified pixel range). The combination of the model and processor allows us to efficiently generate conditional outputs based on the input images.
Now that our model is loaded, we can start inferences on our validation dataset.
prompt = "Answer in only yes or no whether the following meme is hateful or offensive towards a religion, race, community, gender, caste etc. Think twice before you answer." predictions = [] for i in range(len(images)): image = images[i] messages = [ { "role": "user", "content": [ { "type": "image", "image": image, }, {"type": "text", "text": prompt}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=20) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] predictions.append(output_text) print(f"Images inferred : {i+1}/{len(images)}")
We first define a prompt asking for a yes or no answer regarding the hateful nature of the meme (Line 55).
For each image in the images
list, we create a message containing the image and the prompt (Lines 61-72). We use the processor
to prepare inputs for inference by applying a chat template and processing the vision information (Lines 75-85).
The inputs
are moved to the GPU for efficient processing (Line 86). The model generates the output tokens, which are then trimmed and decoded to get the final text answer (Lines 89-95). The predictions
are stored in a list and printed incrementally (Lines 97-99). This process evaluates whether each meme is considered hateful or offensive according to the model’s output.
Evaluating the Predictions
Once the inferences are complete, it is time to evaluate the accuracy, precision, recall, and F1 score of our zero shot classifier using scikit-learn.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score y_true = [1 if answers[i] == "yes" else 0 for i in range(len(answers))] y_pred = [1 if predictions[i] == "Yes." else 0 for i in range(len(predictions))] print("Total positives : ", sum(y_true)) print("Total Negatives : ", len(y_true) - sum(y_true)) print("Accuracy : ", np.round(accuracy_score(y_true, y_pred), 3)) print("Precision : ", np.round(precision_score(y_true, y_pred), 3)) print("Recall : ", np.round(recall_score(y_true, y_pred), 3)) print("F1 Score : ", np.round(f1_score(y_true, y_pred), 3))
Output:
Total positives : 250 Total Negatives : 250 Accuracy : 0.622 Precision : 0.593 Recall : 0.78 F1 Score : 0.674
As can be seen from the output, our zero shot learning approach achieves a recall of 78% and precision of 60% on classifying memes as hateful or not. When compared to several unimodal and multimodal prior arts such as BERT, Visual BERT, ViLBERT, etc. (Table 3), Qwen 2.5 achieves 62% accuracy with just zero-shot learning (no late fusion, finetuning, etc.).
This proves that the Qwen 2.5 model series can reason complex vision-language problems of the real world.
What's next? We recommend PyImageSearch University.
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: June 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this blog post, we first explore the concept of content moderation and its critical role in maintaining social media safety. We delve into initiatives like the Facebook Hateful Memes Challenge, which aims to tackle harmful content using advanced AI models. Moving forward, we present an overview of Qwen 2.5 Vision-Language Models, highlighting the evolution of vision-language models and introducing the Qwen 2.5 series, which stands out due to its innovative features and enhanced capabilities.
We then discuss the key features of Qwen 2.5 models, focusing on their enhanced pre-training dataset, extended generation length, and advanced support for structured inputs and outputs. We emphasize the significant improvements brought by Qwen2.5-Turbo (e.g., massive context length) and the introduction of the Qwen2.5-72B-Instruct model.
Lastly, we cover the diverse range of model sizes for flexibility, spatial and temporal dimension enhancements, and the capabilities of zero-shot learning with Qwen 2.5. We explain the process of loading the Hateful Memes dataset and performing zero-shot inference with Qwen 2.5, followed by evaluating the predictions. The blog post highlights how these advancements in vision-language models contribute to better content moderation and the overall safety of social media platforms.
Citation Information
Mangla, P. “Content Moderation via Zero Shot Learning with Qwen 2.5,” PyImageSearch, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2025, https://pyimg.co/km5bp
@incollection{Mangla_2025_content-moderation-via-zsl-qwen-2-5, author = {Puneet Mangla}, title = {{Content Moderation via Zero Shot Learning with Qwen 2.5}}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur}, year = {2025}, url = {https://pyimg.co/km5bp}, }
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.