Chat with Graphic PDFs: Building an AI PDF Summarizer

Welcome to the 2nd part of our series on creating a Vision-Language Retrieval-Augmented Generation (RAG) pipeline! In our previous tutorial, “Chat with Graphic PDFs: Understand How AI PDF Summarizers Work,” we explored the theoretical foundations and architecture of this system. Now, we’ll transition from theory to practice, guiding you through the step-by-step implementation of a multimodal, chat-based RAG pipeline for complex portable document format (PDF) analysis using advanced models like ColPali and LLaVA-Gemma-2B.

This lesson is the last of a 2-part series on Vision-Language RAG Pipeline:

Chat with Graphic PDFs: Understand How AI PDF Summarizers Work
Chat with Graphic PDFs: Building an AI PDF Summarizer (this tutorial)

To learn how to implement a Vision-Language RAG pipeline for analyzing complex PDFs, just keep reading.

Looking for the source code to this post?

Configuring Your Development Environment

To follow this guide, we need to prepare our environment by installing the necessary libraries and tools.

!pip install --upgrade -qq byaldi

First, we need to install the byaldi library to enable easy integration with multimodal RAG models. Byaldi serves as a lightweight wrapper around the ColPali repository, offering a simplified application programming interface (API) for working with late-interaction multimodal models.

!sudo apt-get install -y poppler-utils

Since we are working with PDFs, we need to install poppler-utils for manipulating PDF files and converting them to different formats.

!pip install -q pdf2image transformers

We will also install pdf2image for converting PDF pages into images. This library requires poppler-utils to be installed in our system. Finally, we install transformers to load the Hugging Face models (e.g., ColPali and LLaVA).

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Setup and Imports

import torch

from byaldi import RAGMultiModalModel
from pdf2image import convert_from_path
from transformers import LlavaForConditionalGeneration, AutoProcessor

We now import the required libraries for our workflow. We import torch to handle tensor computations. Next, we import RAGMultiModalModel from the byaldi library, which provides a user-friendly wrapper for working with pre-trained multimodal RAG models (e.g., ColPali). We also import convert_from_path from the pdf2image library to extract images from the PDF file. Finally, we import LlavaForConditionalGeneration and AutoProcessor from the transformers library to load and process the LLaVA (Large Language and Vision Assistant) model.

Upload the PDF

pdf_path = "/content/kids.pdf"
images = convert_from_path(pdf_path)

Once the setup is done, we upload the PDF in the Colab and define the path in the variable pdf_path.

Then, we call convert_from_path on the pdf_path to convert the uploaded PDF into a list of images, where each image represents a page in the PDF.

NOTE: If we have not installed poppler-utils, we will get the following error:

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed in PATH?

Load the ColPali Model

The ColPali model excels at multimodal retrieval tasks, as discussed in our previous tutorial. Let’s see it in action.

RAG = RAGMultiModalModel.from_pretrained("vidore/colpali")

We first download and initialize the pre-trained ColPali model using the from_pretrained method of the RAGMultiModalModel class.

Index the Document

Then, we use RAG.index to convert the PDF into an indexed format for efficient search operations.

RAG.index(
   input_path="/content/kids.pdf",
   index_name="image_index",
   store_collection_with_index=False,
   overwrite=True
)

We provide the following parameters:

input_path: Specifies the file path to the PDF document, in this case, /content/kids.pdf.
index_name: Defines the name of the index, here set as "image_index". This identifier is used to manage and query the index.
store_collection_with_index: If set to False, the document collection is not stored alongside the index, reducing storage overhead.
overwrite: Ensures that if an index with the same name already exists, it will be replaced with a new one.

Once indexed, the PDF content is structured and optimized for quick and accurate retrieval of information. This makes the pipeline highly efficient for search tasks, especially when working with large or complex documents.

Figure 1 displays the output generated after running the indexing code.

In this example, we’re working with a single-page PDF for simplicity and demonstration purposes. The output confirms that page 1 of the document has been successfully indexed and stored at .byaldi/image_index.

**Figure 1:** Output of the above code (source: image from the Colab code)

Query the Document

text_query = "What is needed for Healthy diet?"
results = RAG.search(text_query, k=1)
results

After indexing the PDF, we are ready to query it for specific information. Let’s start with a simple question.

Here, we define text_query as "What is needed for Healthy diet?" — this represents the question we want to ask. Then, we use the RAG.search method to retrieve the most relevant answer from the indexed content. The parameter k=1 tells the model to return only the top result.

The output (Figure 2), stored in the results variable, includes detailed information such as the document ID (doc_id), page number (page_num), relevance score (score), and any associated metadata. For example, in our case, the top result indicates that the answer is found on page 1 ('page_num': 1) of the document ('doc_id': 0) with a relevance score of 18.5.

**Figure 2:** Output of above code (source: image from the Colab code)

Retrieved Result

images[results[0]["page_num"] - 1]

Next, we extract the specific page (Figure 3) from the PDF that corresponds to the top search result.

**Figure 3:** An image from the PDF (source: image from the PDF)

We use page_num from the results variable to locate the desired page in the images list. Since list indexing in Python starts from 0, we subtract 1 from the page_num to align it correctly with the list index.

This extracts the image of the page where the answer to our query is located. For instance, if the result indicates 'page_num': 1, this code fetches the first page of the PDF as an image. This image will be used as input for the vision-language model in the next step to generate our chat-based responses.

Load the LLaVA Model

Now, we load the LLaVA-Gemma-2B model, which will process multimodal inputs to generate coherent text responses.

checkpoint = "Intel/llava-gemma-2b"
model = LlavaForConditionalGeneration.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

We start by defining the model checkpoint.

Next, we use the from_pretrained method to load both the model and its processor.

Model Loading: LlavaForConditionalGeneration.from_pretrained(checkpoint) loads the pre-trained LLaVA model, which specializes in generating text based on both visual and textual inputs.
Processor Initialization: AutoProcessor.from_pretrained(checkpoint) initializes the processor, which handles the preprocessing of inputs for the model.

This step ensures that the model and processor are ready to process multimodal data, such as combining images and text to generate meaningful answers to user queries.

Preprocess the Inputs

prompt = processor.tokenizer.apply_chat_template(
   [{'role': 'user', 'content': "<image>\nWhat is written about Healthy Diet?"}],
   tokenize=False,
   add_generation_prompt=True
)

image_index = results[0]["page_num"] - 1
image = images[image_index]
inputs = processor(text=prompt, images=image, return_tensors="pt")

Next, we prepare the input for the LLaVA-Gemma-2B model to generate a response. This involves creating a query prompt and processing the corresponding page image.

We use the processor’s tokenizer to structure our query in a way the model can understand.

The apply_chat_template method structures the query as a chat-like conversation.
The <image> token tells the model that the input includes an image, followed by the user’s question.
The add_generation_prompt=True parameter signals the model to prepare for generating a response.

Then, we extract the image corresponding to the indexed search result:

results[0]["page_num"] - 1 identifies the page number from the search results (adjusted for zero-based indexing).
We retrieve the image using this index from the images list.

The extracted image and the query prompt are passed to the processor to create model-ready inputs:

text=prompt and images=image: specifies the multimodal input.
return_tensors="pt": ensures the inputs are returned as PyTorch tensors, ready for the model.

With these steps, we’ve successfully prepared the multimodal input, combining the query and the relevant page image, for the LLaVA-Gemma-2B model to generate a response.

Generate the Response

generate_ids = model.generate(**inputs, max_length=200)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Finally, we generate a response using the LLaVA-Gemma-2B model. This step involves passing the prepared inputs to the model and decoding its output into human-readable text.

We use the model’s generate method to produce the token IDs for the response:

The **inputs parameter feeds the multimodal inputs (query and image) to the model.
The max_length=200 parameter ensures the generated response doesn’t exceed 200 tokens.

Next, we decode the generated token IDs into a readable format:

The batch_decode method translates the token IDs back into text.
Setting skip_special_tokens=True removes unnecessary model-specific tokens from the output.
The clean_up_tokenization_spaces=False parameter retains the original spacing and punctuation for clarity.

Finally, we print the generated response.

user

What is written about Healthy Diet?
model
The image features a colorful chart with various health-related topics, including a healthy diet. The chart is divided into sections, each containing a different health-related concept. The topics covered in the chart include healthy eating, healthy sleep, healthy body, and healthy habits. The chart is designed to help individuals understand and maintain a healthy lifestyle.
The chart is accompanied by a motivational message, encouraging people to make healthy choices and maintain a balanced diet.
The chart is colorful and visually appealing, making it an engaging and informative resource for those looking to improve their health and well-being.
The chart is likely intended for use in schools, health centers, or other educational settings to help individuals make healthier choices and maintain a balanced lifestyle.
The chart is designed to be visually appealing and easy to understand, making it an effective tool for promoting healthy habits and making positive lifestyle choices.

The model processes the image and query to generate a concise and relevant response.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: July 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, we took a hands-on approach to building a Vision-Language Retrieval-Augmented Generation (RAG) pipeline for analyzing PDFs using the ColPali and LLaVA models.

This concludes our 2-part series on leveraging multimodal models to analyze complex documents and generate meaningful responses. By combining these powerful models, we’ve unlocked the potential for a wide range of applications in fields like education, healthcare, research, and beyond.

Citation Information

Thakur, P. “Chat with Graphic PDFs: Building an AI PDF Summarizer,” PyImageSearch, P. Chugh, S. Huot, and G. Kudriavtsev, eds., 2025, https://pyimg.co/u65dw

@incollection{Thakur_2025_chat-w-graphic-pdfs-building-ai-pdf-summarizer,
  author = {Piyush Thakur},
  title = {{Chat with Graphic PDFs: Building an AI PDF Summarizer}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Georgii Kudriavtsev},
  year = {2025},
  url = {https://pyimg.co/u65dw},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!