Pandas is a popular data analysis library built on top of the Python programming language, and getting started with Pandas is an easy task. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential tool in the modern data analyst’s toolbox.
Pandas represents a fantastic step forward for graphical spreadsheet users who’d like to handle larger amounts of data, perform more complex operations, and automate the steps of their analysis routines. I like to introduce the tool as “Excel on steroids.”
Here’s some good news: you don’t need to be a software engineer to work effectively with the library. In fact, pandas offers Excel users a great bridge to get started with Python and programming in general. If you’ve never written a line of code before, you’ll be pleasantly surprised by how many spreadsheet operations already require you to think like a developer.
Let’s explore a sample dataset to see some of the library’s powerful features. If you’d like a deeper dive into the syntax and mechanics of pandas, tune in to my upcoming ODSC workshop this October, “Getting Started with Pandas for Data Analysis.”
We’ll start by importing pandas and assigning it an alias to start off strong getting started with Pandas.
import pandas as pd
Our dataset is a CSV file of titles available on the online streaming service Netflix. Each row includes the title’s name, type, release year, duration, and the categories it’s listed in.
netflix = pd.read_csv("netflix_titles.csv") netflix.head()
Let’s say we’re in the mood for a 90s comedy film. We can find the subset of rows that fit our criteria by applying filtering conditions to the type, release_year and listed_in columns. First up, let’s extract the rows with a value of “Movie” in the type column.
movies = netflix["type"] == "Movie" netflix[movies].head()
Next up, let’s find our comedies. We’ll need to be a bit clever here. There are 4 categories in the listed_in column that we should include: “Comedies”, “Stand-Up Comedy” “TV Comedies”, and “Stand-Up Comedy & Talk Shows“. These categories can also be nested amongst other non-related categories. We can use regular expressions to identify the titles whose listed_in text includes the substring “Comed” followed by any characters.
comedies = netflix["listed_in"].str.contains(r'Comed.*') netflix[comedies].head()
made_in_nineties = netflix["release_year"].between(1990, 1999) netflix[made_in_nineties].head()
We’ve now declared three individual conditions to filter the dataset. The final step is to apply all three conditions together. In the next example, we ask pandas for all titles that are movies and comedies and released between 1990 and 1999.
netflix[movies & comedies & made_in_nineties].head()
netflix[movies & comedies & made_in_nineties].sort_values("release_year").head()
For a more in-depth overview of getting started with Pandas, check out my upcoming ODSC talk this October, “Getting Started with Pandas for Data Analysis.” We’ll explore several real-world datasets and walk through many of the features of this powerful data analysis tool.
About the author/ODSC West 2020 speaker: Boris Paskhaver is a full-stack web developer based in New York City with experience building apps in React / Redux and Ruby on Rails. His favorite part of programming is the never-ending sense that there’s always something new to master — a secret language feature, a popular design pattern, an emerging library or — most importantly — a different way of looking at a problem.
Autograd is PyTorch’s automatic differentiation package. Thanks to it, we don’t need to worry about partial derivatives, chain rule, or anything like it.
To illustrate how it works, let’s say we’re trying to fit a simple linear regression with a single feature x, using Mean Squared Error (MSE) as our loss:
Without PyTorch, we would have to start with our loss, and work the partial derivatives out to compute the gradients manually. Sure, it would be easy enough to do it for this toy problem, but we need something that can scale.
So, how do we do it? PyTorch provides some really handy methods we can use to easily compute the gradients. Let’s check them out!
What distinguishes a tensor used for training data (or validation, or test) from a tensor used as a (trainable) parameter/weight?
The latter requires the computation of its gradients, so we can update their values (the parameters’ values, that is). That’s what the requires_grad=True argument is good for. It tells PyTorch to compute gradients for us.
Remember: a tensor for a learnable parameter requires a gradient!
In code, creating tensors for our two parameters looks like this:
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Step 0 - Initializes parameters "b" and "w" randomly torch.manual_seed(42) b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
So, how do we tell PyTorch to do its thing and compute all gradients? That’s the role of the backward() method. It will compute gradients for all (requiring gradient) tensors involved in the computation of a given variable.
Do you remember the starting point for computing the gradients? It is the loss, which we would use to compute its partial derivatives with respect to our parameters.
Hence, we need to invoke the backward() method from the corresponding Python variable: loss.backward().
The code below illustrates it well, assuming we’re making both predictions and computing the loss using nothing but Numpy:
# Step 1 - Computes our model's predicted output - forward pass yhat = b + w * x_train_tensor # Step 2 - Computes the loss # We are using ALL data points, so this is BATCH gradient descent. # How wrong is our model? That's the error! error = (y_train_tensor - yhat) # It is a regression, so it computes mean squared error (MSE) loss = (error ** 2).mean() # Step 3 - Computes gradients for both "b" and "w" parameters # No more manual computation of gradients! loss.backward()
Which tensors are going to be handled by the backward() method applied to the loss?
We have set requires_grad=True to both b and w, so they are obviously included in the list. We use them both to compute yhat, so it will also make it to the list. Then we use yhat to compute the error, which is also added to the list.
Do you see the pattern here? If a tensor in the list is used to compute another tensor, the latter will also be included in the list. Tracking these dependencies is exactly what the dynamic computation graph is doing, as we’ll see shortly.
What about x_train_tensor and y_train_tensor? They are involved in the computation too… but they contain data, and thus they are not created as gradient-requiring tensors. So, backward() does not care about them.
What about the actual values of the gradients? We can inspect them by looking at the grad attribute of each tensor.
OK, we got gradients, but there is one more thing to pay attention to: by default, PyTorch accumulates the gradients. How to handle that?
Every time we use the gradients to update the parameters, we need to zero the gradients afterward. And that’s what zero_() is good for.
# This code will be placed after Step 4 (updating the parameters) b.grad.zero_(), w.grad.zero_()
So, we can definitely ditch the manual computation of gradients and use both backward() and zero_() methods instead.
That’s it? Well, pretty much… but there is always a catch, and this time it has to do
with the update of the parameters…
To update a parameter, we multiply its gradient by a learning rate, flip the sign, and add it to the parameter’s former value. So, let’s first set our learning rate:
lr = 0.1
And then use it to perform the updates:
# Attempt at Step 4 b -= lr * b.grad w -= lr * w.grad
But, it turns out we cannot simply perform an update like this! Why not?! It turns out to be a case of “too much of a good thing”. The culprit is PyTorch’s ability to build a dynamic computation graph from every Python operation that involves any gradient-computing tensor or its dependencies.
So, how do we tell PyTorch to “back off” and let us update our parameters without messing up with its fancy dynamic computation graph? That’s what torch.no_grad() is good for. It allows us to perform regular Python operations on tensors, without affecting PyTorch’s computation graph
This time, the update will work as expected:
# Step 4, for real with torch.no_grad(): b -= lr * b.grad w -= lr * w.grad
Mission accomplished! We updated our parameters b and w using PyTorch’s automatic differentation package, autograd.
I mean, we updated it once. To actually train a model, we need to place this code inside a loop. Putting it all together, and adding a loop to it, the code should look like this:
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Step 0 - Initializes parameters "b" and "w" randomly torch.manual_seed(42) b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) lr = 0.1 for epoch in range(200): # Step 1 - Computes our model's predicted output - forward pass yhat = b + w * x_train_tensor # Step 2 - Computes the loss # We are using ALL data points, so this is BATCH gradient descent. # How wrong is our model? That's the error! error = (y_train_tensor - yhat) # It is a regression, so it computes mean squared error (MSE) loss = (error ** 2).mean() # Step 3 - Computes gradients for both "b" and "w" parameters # No more manual computation of gradients! loss.backward() # Step 4, for real with torch.no_grad(): b -= lr * b.grad w -= lr * w.grad # This code will be placed after Step 4 (updating the parameters) b.grad.zero_(), w.grad.zero_()
That was autograd in action! Now it is time to take a peek at the…
Dynamic Computation Graph
“Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.”
I want you to see the graph for yourself too!
The PyTorchViz package and its make_dot(variable) method allow us to easily visualize a graph associated with a given Python variable involved in the gradient computation.
So, let’s stick with the bare minimum: two (gradient computing) tensors for our parameters (b and w) and the predictions (yhat) – these are Steps 0 and 1.
Running the code above will show us the graph below:
Let’s take a closer look at its components:
- blue boxes ((1)s): these boxes correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for
- gray box (MulBackward0): a Python operation that involves a gradient-computing tensor or its dependencies
- green box (AddBackward0): the same as the gray box, except that it is the starting point for the computation of gradients (assuming the backward() method is called from the variable used to visualize the graph)— they are computed from the bottom-up in a graph
Now, take a closer look at the green box at the bottom of the graph: two arrows are pointing to it since it is adding up two variables, b, and w*x. Seems obvious, right?
Then, look at the gray box (MulBackward0) of the same graph: it is performing a multiplication, namely, w*x. But there is only one arrow pointing to it! The arrow comes from the blue box that corresponds to our parameter w.
“Why don’t we have a box for our data (x)?“
The answer is: we do not compute gradients for it!
So, even though there are more tensors involved in the operations performed by the computation graph, it only shows gradient-computing tensors and its dependencies.
What would happen to the computation graph if we set requires_grad to False for our parameter b?
# New Step 0 b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) # New Step 1 yhat = b_nograd + w * x_train_tensor make_dot(yhat)
Unsurprisingly, the blue box corresponding to the parameter b is no more!
Simple enough: no gradients, no graph!
The best thing about the dynamic computation graph is the fact that you can make it as complex as you want it. You can even use control flow statements (e.g., if statements) to control the flow of the gradients.
The figure below shows an example of this. And yes, I do know that the computation itself is complete nonsense…
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device) w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device) yhat = b + w * x_train_tensor error = y_train_tensor - yhat loss = (error ** 2).mean() # this makes no sense!! if loss > 0: yhat2 = w * x_train_tensor error2 = y_train_tensor - yhat2 # neither does this :-) loss += error2.mean() make_dot(loss)
Even though the computation is nonsensical, you can clearly see the effect of adding a control flow statement like if loss > 0: it branches the computation graph in two parts. The right branch performs the computation inside the if statement, which gets added to the result of the left branch in the end. Cool, right?
To be continued…
Autograd is just the beginning! Interested in learning more about training a model using PyTorch in a structured, and incremental way?
Don’t miss my talk at ODSC Europe 2020: “PyTorch 101: building a model step-by-step.”
The content of this post was adapted from my book “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”. Learn more about it at http://leanpub.com/pytorch.
About the author/speaker:
He has been teaching machine learning and distributed computing technologies at Data Science Retreat, the longest-running Berlin-based bootcamp, for more than three years, helping more than 150 students advance their careers.
Editor’s Note: See Phil present his talk “Python for Data Acquisition” at ODSC West 2019.
What does it take, on the technical side, to get a project started? After you have an idea and find something you want to study or look into, you need to get some data. Where do you get data? Primary sources? Web sites? Database? There are many different sources and possibilities. Which ones should you choose? How can you trust that the data remains and allows for reproducibility? Will it be easy to update once new data becomes available? These are all just the beginnings of the issues involved in acquiring data for your project, but you can use Python for data acquisition to make it easier.
There are so many sources of freely available public data. The US Federal Government runs Data.gov for its public data. The topics covered on this site include everything the government runs such as agriculture, climate, education, transportation, and energy. Individual divisions of the federal government, like NASA, may also have their own open data. Most states and cities also run web sites with a lot of data. ODSC West 2019 is in San Francisco and they have their own web site of local government data, Data SF.
Other governments and NGO’s have the same features
[Related Article: 25 Excellent Machine Learning Open Datasets]
- European Union
- Data World
Google has Public Data Directory, Amazon AWS Open Data, Microsoft, and IBM Cloud Data Services all have open data sets for public use. Github keeps track of so many more sites, like Awesome Public Datasets. With a little looking around, there is a set for almost anything you want to study!
Even with this almost infinite supply of options, doesn’t mean that the data is ready to go for your application or model. You still need to actually download this data and parse it into a usable format. The data on these sites are stored in a variety of different formats. They range from GIS, CSV, XML, JSON, text, HTML and various binary types. It is quite possible for your project to need data from multiple sources and in multiple formats. This can create a variety of issues for any project getting started or continuing on.
Once we have this data, how do we hang on to it? For each application or model that is built on this, do we want to download it again? What happens if the website goes away or changes its policies or changes its format? Storing all of this data in your own database can ease this issue. Once you have downloaded, cleaned up and gotten your data ready, store it in a local database. From there all future applications and models only need to access the database without worrying about all of the other issues of getting the data.
This is where Python comes in! It can handle all of these tasks with the right libraries and some coding. Python has libraries to cover all of these topics and then some. Using the Requests Library, downloading web pages and other files is very simple. With the correct information, it can also log into a server for non-public or restricted data. If the files are compressed, Python has archiving libraries for this. For the various formats, there are Python libraries such as CSV, JSON, and regular expressions. From here storing data in a database can be done by wrapping SQL in python via psycopg2 or creating ORMs in SQL Alchemy.
[Related Article: Jupyter Notebook: Python or R—Or Both?]
This Course on Python for Data Acquisition
The goal of this course, which I’ll be presenting at ODSC West 2019, is to expose all of the students to this process and give them a few labs where they will get to do this. The students will learn to parse various data file formats, download data and interact with a database for storing and retrieving data.
Originally posted on OpenDataScience.com.