Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need to worry about partial derivatives, chain rule, or anything like it.
To illustrate how it works, let’s say we’re trying to fit a simple linear regression with a single feature x, using Mean Squared Error (MSE) as our loss:
We need to create two tensors, one for each parameter our model needs to learn: b and w.
Without PyTorch, we would have to start with our loss, and work the partial derivatives out to compute the gradients manually. Sure, it would be easy enough to do it for this toy problem, but we need something that can scale.
So, how do we do it? PyTorch provides some really handy methods we can use to easily compute the gradients. Let’s check them out!
What distinguishes a tensor used for training data (or validation, or test) from a tensor used as a (trainable) parameter/weight?
The latter requires the computation of its gradients, so we can update their values (the parameters’ values, that is). That’s what the requires_grad=True argument is good for. It tells PyTorch to compute gradients for us.
Remember: a tensor for a learnable parameter requires a gradient!
In code, creating tensors for our two parameters looks like this:
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Step 0 - Initializes parameters "b" and "w" randomly torch.manual_seed(42) b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
So, how do we tell PyTorch to do its thing and compute all gradients? That’s the role of the backward() method. It will compute gradients for all (requiring gradient) tensors involved in the computation of a given variable.
Do you remember the starting point for computing the gradients? It is the loss, which we would use to compute its partial derivatives with respect to our parameters.
Hence, we need to invoke the backward() method from the corresponding Python variable: loss.backward().
The code below illustrates it well, assuming we’re making both predictions and computing the loss using nothing but Numpy:
# Step 1 - Computes our model's predicted output - forward pass yhat = b + w * x_train_tensor # Step 2 - Computes the loss # We are using ALL data points, so this is BATCH gradient descent. # How wrong is our model? That's the error! error = (y_train_tensor - yhat) # It is a regression, so it computes mean squared error (MSE) loss = (error ** 2).mean() # Step 3 - Computes gradients for both "b" and "w" parameters # No more manual computation of gradients! loss.backward()
Which tensors are going to be handled by the backward() method applied to the loss?
We have set requires_grad=True to both b and w, so they are obviously included in the list. We use them both to compute yhat, so it will also make it to the list. Then we use yhat to compute the error, which is also added to the list.
Do you see the pattern here? If a tensor in the list is used to compute another tensor, the latter will also be included in the list. Tracking these dependencies is exactly what the dynamic computation graph is doing, as we’ll see shortly.
What about x_train_tensor and y_train_tensor? They are involved in the computation too… but they contain data, and thus they are not created as gradient-requiring tensors. So, backward() does not care about them.
What about the actual values of the gradients? We can inspect them by looking at the grad attribute of each tensor.
OK, we got gradients, but there is one more thing to pay attention to: by default, PyTorch accumulates the gradients. How to handle that?
Every time we use the gradients to update the parameters, we need to zero the gradients afterward. And that’s what zero_() is good for.
# This code will be placed after Step 4 (updating the parameters) b.grad.zero_(), w.grad.zero_()
So, we can definitely ditch the manual computation of gradients and use both backward() and zero_() methods instead.
That’s it? Well, pretty much… but there is always a catch, and this time it has to do
with the update of the parameters…
To update a parameter, we multiply its gradient by a learning rate, flip the sign, and add it to the parameter’s former value. So, let’s first set our learning rate:
lr = 0.1
And then use it to perform the updates:
# Attempt at Step 4 b -= lr * b.grad w -= lr * w.grad
But, it turns out we cannot simply perform an update like this! Why not?! It turns out to be a case of “too much of a good thing”. The culprit is PyTorch’s ability to build a dynamic computation graph from every Python operation that involves any gradient-computing tensor or its dependencies.
So, how do we tell PyTorch to “back off” and let us update our parameters without messing up with its fancy dynamic computation graph? That’s what torch.no_grad() is good for. It allows us to perform regular Python operations on tensors, without affecting PyTorch’s computation graph
This time, the update will work as expected:
# Step 4, for real with torch.no_grad(): b -= lr * b.grad w -= lr * w.grad
Mission accomplished! We updated our parameters b and w using PyTorch’s automatic differentation package, autograd.
I mean, we updated it once. To actually train a model, we need to place this code inside a loop. Putting it all together, and adding a loop to it, the code should look like this:
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Step 0 - Initializes parameters "b" and "w" randomly torch.manual_seed(42) b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) lr = 0.1 for epoch in range(200): # Step 1 - Computes our model's predicted output - forward pass yhat = b + w * x_train_tensor # Step 2 - Computes the loss # We are using ALL data points, so this is BATCH gradient descent. # How wrong is our model? That's the error! error = (y_train_tensor - yhat) # It is a regression, so it computes mean squared error (MSE) loss = (error ** 2).mean() # Step 3 - Computes gradients for both "b" and "w" parameters # No more manual computation of gradients! loss.backward() # Step 4, for real with torch.no_grad(): b -= lr * b.grad w -= lr * w.grad # This code will be placed after Step 4 (updating the parameters) b.grad.zero_(), w.grad.zero_()
That was autograd in action! Now it is time to take a peek at the…
Dynamic Computation Graph
“Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.”
I want you to see the graph for yourself too!
The PyTorchViz package and its make_dot(variable) method allow us to easily visualize a graph associated with a given Python variable involved in the gradient computation.
So, let’s stick with the bare minimum: two (gradient computing) tensors for our parameters (b and w) and the predictions (yhat) – these are Steps 0 and 1.
Running the code above will show us the graph below:
Let’s take a closer look at its components:
- blue boxes ((1)s): these boxes correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for
- gray box (MulBackward0): a Python operation that involves a gradient-computing tensor or its dependencies
- green box (AddBackward0): the same as the gray box, except that it is the starting point for the computation of gradients (assuming the backward() method is called from the variable used to visualize the graph)— they are computed from the bottom-up in a graph
Now, take a closer look at the green box at the bottom of the graph: two arrows are pointing to it since it is adding up two variables, b, and w*x. Seems obvious, right?
Then, look at the gray box (MulBackward0) of the same graph: it is performing a multiplication, namely, w*x. But there is only one arrow pointing to it! The arrow comes from the blue box that corresponds to our parameter w.
“Why don’t we have a box for our data (x)?“
The answer is: we do not compute gradients for it!
So, even though there are more tensors involved in the operations performed by the computation graph, it only shows gradient-computing tensors and its dependencies.
What would happen to the computation graph if we set requires_grad to False for our parameter b?
# New Step 0 b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) # New Step 1 yhat = b_nograd + w * x_train_tensor make_dot(yhat)
Unsurprisingly, the blue box corresponding to the parameter b is no more!
Simple enough: no gradients, no graph!
The best thing about the dynamic computation graph is the fact that you can make it as complex as you want it. You can even use control flow statements (e.g., if statements) to control the flow of the gradients.
The figure below shows an example of this. And yes, I do know that the computation itself is complete nonsense…
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device) yhat = b + w * x_train_tensor error = y_train_tensor - yhat loss = (error ** 2).mean() # this makes no sense!! if loss > 0: yhat2 = w * x_train_tensor error2 = y_train_tensor - yhat2 # neither does this :-) loss += error2.mean() make_dot(loss)
Even though the computation is nonsensical, you can clearly see the effect of adding a control flow statement like if loss > 0: it branches the computation graph in two parts. The right branch performs the computation inside the if statement, which gets added to the result of the left branch in the end. Cool, right?
To be continued…
Autograd is just the beginning! Interested in learning more about training a model using PyTorch in a structured, and incremental way?
Don’t miss my talk at ODSC Europe 2020: “PyTorch 101: building a model step-by-step.”
The content of this post was adapted from my book “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”. Learn more about it at http://leanpub.com/pytorch.
About the author/speaker:
Daniel is a data scientist, developer, and author of “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”.
He has been teaching machine learning and distributed computing technologies at Data Science Retreat, the longest-running Berlin-based bootcamp, for more than three years, helping more than 150 students advance their careers.