Automatic Differentiation Part 1: Understanding the Math

Automatic Differentiation Part 1: Understanding the Math

Introduction

Jacobian
Chain Rule
Mix the Jacobian and Chain Rule
Forward and Reverse Accumulations
Forward Accumulation
Reverse Accumulation

Summary

References
Citation Information

Automatic Differentiation Part 1: Understanding the Math

In this tutorial, you will learn the math behind automatic differentiation needed for backpropagation.

This lesson is the 1st of a 2-part series on Autodiff 101 — Understanding Automatic Differentiation from Scratch:

Automatic Differentiation Part 1: Understanding the Math (this tutorial)
Automatic Differentiation Part 2: Implementation Using Micrograd

To learn about automatic differentiation, just keep reading.

Automatic Differentiation Part 1: Understanding the Math

Imagine you are trekking down a hill. It is dark, and there are a lot of bumps and turns. You have no way of knowing how to reach the center. Now imagine every time you progress, you have to pause, take out the topological map of the hill and calculate your direction and speed for the next set. Sounds painfully less fun, right?

If you have been a reader of our tutorials, you would know what that analogy refers to. The hill is your loss landscape, the topological map is the set of rules for multivariate calculus, and you are the parameters of the neural network. The objective is to reach the global minimum.

And that brings us to the question:

Why do we use a Deep Learning Framework today?

The first thing that pops into the mind is automatic differentiation. We write the forward pass, and that is it; no need to worry about the backward pass. Every operator is automatically differentiated and is waiting to be used in an optimization algorithm (like stochastic gradient descent).

Today in this tutorial, we will walk through the valleys of automatic differentiation.

Introduction

In this section, we will lay out the foundation necessary for understanding autodiff.

Jacobian

Let’s consider a function $F \colon \mathbb{R}^{n} \to \mathbb{R}$ . $F$ is a multivariate function that simultaneously depends on multiple variables. Here the multiple variables can be $x = \{x_{1}, x_{2}, \ldots, x_{n}\}$ . The output of the function is a scalar value. This can be considered as a neural network that takes an image and outputs the probability of a dog’s presence in the image.

Note: Let us recall that in a neural network, we compute gradients with respect to the parameters (weights and biases) and not the inputs (the image). Thus the domain of the function is the parameters and not the inputs, which helps keep the gradient computation accessible. We need to now think of everything we do in this tutorial from the perspective of making it simple and efficient to obtain the gradients with respect to the weights and biases (parameters). This is illustrated in Figure 1.

**Figure 1:** Domain of the function from the perspective of a neural network (source: image by the authors).

A neural network is a composition of many sublayers. So let’s consider our function $F(x)$ as a composition of multiple functions (primitive operations).

$F(x) \ = \ D \circ C \circ B \circ A$

The function $F(x)$ is composed of four primitive functions, namely $D, C, B, \text{ and } A$ . For anyone new to composition, we can call $F(x)$ to be a function where $D(C(B(A(x))))$ is equal to $F(x)$ .

The next step would be to find the gradient of $F(x)$ . However, before diving into the gradients of the function, let us revisit Jacobian matrices. It turns out that the derivatives of a multivariate function are a Jacobian matrix consisting of partial derivatives of the function w.r.t. all the variables upon which it depends.

Consider two multivariate functions, $u$ and $v$ , which depend on the variables $x$ and $y$ . The Jacobian would look like this:

$\displaystyle\frac{\partial{(u, v)}}{\partial{x, y}} \ = \ \begin{bmatrix} \displaystyle\frac{\partial u}{\partial x} & \displaystyle\frac{\partial u}{\partial y}\\ \\ \displaystyle\frac{\partial v}{\partial x} & \displaystyle\frac{\partial v}{\partial y} \end{bmatrix}$

Now let’s compute the Jacobian of our function $F(x)$ . We need to note here that the function depends of $n$ variables $x = \{x_{1}, x_{2}, \ldots, x_{n}\}$ , and outputs a scalar value. This means that the Jacobian will be a row vector.

$F^\prime(x) \ = \ \displaystyle\frac{\partial{y}}{\partial{x}} \ = \ \begin{bmatrix} \displaystyle\frac{\partial y}{\partial x_{1}} & \ldots & \displaystyle\frac{\partial y}{\partial x_{n}} \end{bmatrix}$

Chain Rule

Remember how our function $F(x)$ is composed of many primitive functions? The derivative of such a composed function is done with the help of the chain rule. To help our way into the chain rule, let us first write down the composition and then define the intermediate values.

$F(x) = D(C(B(A(x))))$ is composed of:

$y = D(c)$
$c = C(b)$
$b = B(a)$
$a = A(x)$

Now that the composition is spelled out, let’s first get the derivatives of the intermediate values.

$D^\prime(c) = \displaystyle\frac{\partial{y}}{\partial{c}}$
$C^\prime(b) = \displaystyle\frac{\partial{c}}{\partial{b}}$
$B^\prime(a) = \displaystyle\frac{\partial{b}}{\partial{a}}$
$A^\prime(x) = \displaystyle\frac{\partial{a}}{\partial{x}}$

Now with the help of the chain rule, we derive the derivative of the function $F(x)$ .

$F^\prime(x) \ = \ \displaystyle\frac{\partial{y}}{\partial{c}} \displaystyle\frac{\partial{c}}{\partial{b}} \displaystyle\frac{\partial{b}}{\partial{a}} \displaystyle\frac{\partial{a}}{\partial{x}}$

Mix the Jacobian and Chain Rule

After knowing about the Jacobian and the Chain Rule, let us visualize the two together. Shown in Figure 2.

$F^\prime(x) \ = \ \displaystyle\frac{\partial{y}}{\partial{c}} \displaystyle\frac{\partial{c}}{\partial{b}} \displaystyle\frac{\partial{b}}{\partial{a}} \displaystyle\frac{\partial{a}}{\partial{x}}$

**Figure 2:** Jacobian and chain rule together (source: image by the authors).

The derivative of our function $F(x)$ is just the matrix multiplication of the Jacobian matrices of the intermediate terms.

Now, this is where we ask the question:

Does it matter the order in which we do the matrix multiplication?

Forward and Reverse Accumulations

In this section, we try to understand the answer to the question of ordering the Jacobian matrix multiplication.

There are two extremes in which we could order the multiplications: the forward accumulation and the reverse accumulation.

Forward Accumulation

If we order the multiplication from right to left in the same order in which the function $F(x)$ was evaluated, the process is called forward accumulation. The best way to think about the ordering is to place brackets in the equation, as shown in Figure 3.

$F^\prime(x) \ = \ \displaystyle\frac{\partial{y}}{\partial{c}} \left(\frac{\partial{c}}{\partial{b}} \left(\frac{\partial{b}}{\partial{a}} \displaystyle\frac{\partial{a}}{\partial{x}}\right)\right)$

**Figure 3:** Forward accumulation of gradients (source: image by the authors).

With the function $F : \mathbb{R}^{n} \to \mathbb{R}$ , the forward accumulation process is matrix multiplication in all the steps. This is more FLOPs.

Note: Forward accumulation is beneficial when we want to get the derivative of a function $F: \mathbb{R} \to \mathbb{R}^{n}$ .

Another way to understand forwarding accumulation is to think of a Jacobian-Vector Product (JVP). Consider a Jacobian $F^\prime(x)$ and a vector $v$ . The Jacobian-Vector Product would look to be $F^\prime(x)v$

$F^\prime(x)v \ = \ \displaystyle\frac{\partial{y}}{\partial{c}} \left(\displaystyle\frac{\partial{c}}{\partial{b}} \left(\displaystyle\frac{\partial{b}}{\partial{a}} \left(\displaystyle\frac{\partial{a}}{\partial{x}} v\right)\right)\right)$

This is done for us to have matrix-vector multiplication at all the stages (which makes the process more efficient).

➤ Question: If we have a Jacobian-Vector Product, how can we obtain the Jacobian from it?

➤ Answer: We pass a one-hot vector and get each column of the Jacobian one at a time.

So we can think of forwarding accumulation as a process in which we build the Jacobian per column.

Reverse Accumulation

Suppose we order the multiplication from left to right, in the opposite direction to which the function was evaluated. In that case, the process is called reverse accumulation. The diagram of the process is illustrated in Figure 4.

$F^\prime(x) \ = \ \left(\left(\displaystyle\frac{\partial{y}}{\partial{c}} \displaystyle\frac{\partial{c}}{\partial{b}}\right) \displaystyle\frac{\partial{b}}{\partial{a}} \right)\displaystyle\frac{\partial{a}}{\partial{x}}$

**Figure 4:** Reverse accumulation of gradients (source: image by the authors).

As it turns out, with reverse accumulation deriving the derivative of a function $F : \mathbb{R}^{n} \to \mathbb{R}$ is a vector to matrix multiplication at all steps. This means that for the particular function, reverse accumulation has lesser FLOPs than forwarding accumulation.

Another way to understand forwarding accumulation is to think of a Vector-Jacobian Product (VJP). Consider a Jacobian $F^\prime(x)$ and a vector $v$ . The Vector-Jacobian Product would look to be $v^{T}F^\prime(x)$

$v^{T}F^\prime(x) \ = \ \left(\left(\left(v^{T} \displaystyle\frac{\partial{y}}{\partial{c}}\right) \displaystyle\frac{\partial{c}}{\partial{b}}\right) \displaystyle\frac{\partial{b}}{\partial{a}}\right)\displaystyle\frac{\partial{a}}{\partial{x}}$

This allows us to have vector-matrix multiplication at all stages (which makes the process more efficient).

➤ Question: If we have a Vector-Jacobian Product, how can we obtain the Jacobian from it?

➤ Answer: We pass a one-hot vector and get each row of the Jacobian one at a time.

So we can think of reverse accumulation as a process in which we build the Jacobian per row.

Now, if we consider our previously mentioned function $F(x)$ , we know that the Jacobian $F^\prime(x)$ is a row vector. Therefore, if we apply the reverse accumulation process, which means the Vector-Jacobian Product, we can obtain the row vector in one shot. On the other hand, if we apply the forward accumulation process, the Jacobian-Vector Product, we will obtain a single element as a column, and we would need to iterate to build the entire row.

This is why reverse accumulation is used more often in the Neural Network literature.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: August 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, we studied the math of automatic differentiation and how it is applied to the parameters of a Neural Network. The next tutorial will expand on this and see how we can implement automatic differentiation using a python package. The implementation will involve a step-by-step walkthrough of creating a python package and using it to train a neural network.

Did you enjoy a math-heavy tutorial on the fundamentals of automatic differentiation? Let us know.

Twitter: @PyImageSearch

References

Automatic Differentiation — VideoLectures.NET

Citation Information

A. R. Gosthipaty and R. Raha. “Automatic Differentiation Part 1: Understanding the Math,” PyImageSearch, P. Chugh, S. Huot, K. Kidriavsteva, and A. Thanki, eds., 2022, https://pyimg.co/pyxml

@incollection{ARG-RR_2022_autodiff1,
  author = {Aritra Roy Gosthipaty and Ritwik Raha},
  title = {Automatic Differentiation Part 1: Understanding the Math},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Kseniia Kidriavsteva and Abhishek Thanki},
  year = {2022},
  note = {https://pyimg.co/pyxml},
}

Unleash the potential of computer vision with Roboflow - Free!

Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.

Join Roboflow Now

Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF

Enter your email address below to join the PyImageSearch Newsletter and download my FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.

Table of Contents

Automatic Differentiation Part 1: Understanding the Math

Automatic Differentiation Part 1: Understanding the Math

Introduction

Jacobian

Chain Rule

Mix the Jacobian and Chain Rule

Forward and Reverse Accumulations

Forward Accumulation

Reverse Accumulation

What's next? We recommend PyImageSearch University.

Summary

References

Citation Information

Unleash the potential of computer vision with Roboflow - Free!

Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF

About the Author

Comment section

PyImageSearch University

Breaking the CNN Mold: YOLOv12 Brings Attention to Real-Time Object Detection

Why I started a computer vision and deep learning conference

Installing Tesseract for OCR

Topics

Books & Courses

PyImageSearch

Table of Contents

What's next? We recommend PyImageSearch University.

Unleash the potential of computer vision with Roboflow - Free!

Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF

About the Author

Introduction to OpenCV AI Kit (OAK)

Scaling Kaggle Competitions Using XGBoost: Part 2

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch