Python is one of the most popular languages in the world. It’s used in a lot of different fields, like web services, automation, data science, managing computer infrastructure, and artificial intelligence and machine learning.

Its readable and concise syntax makes it a great option for teaching students their “your first programming language,” but under the façade of an easy and amicable language, there’s a huge amount of power.

Python is easy to learn and to use, sure, but it’s also capable of fantastic feats in demanding environments like video games, banking services, healthcare, or state-of-the-art scientific research. 

Python, in particular, is an appreciated language for Data Processing, for several reasons, among others:

  • Python for data processing allows writing code using different styles, without forcing you to set into a specific way of doing things. It’s very easy to create prototypes and experiment with code. Processing data, particularly from not-very-clean sources requires a lot of tweaking, back and forth, and struggling to capture every possibility.
  • Python3 greatly improved the multilanguage support, making every string in the system UTF-8, which helps to process data in different languages encoded in different character sets.
  • The standard library is very powerful and full of useful modules to work natively with common file formats like CSV files, zip files, databases, etc.
  • The third-party library for Python is huge, and has incredibly good modules that allow it to extend the capabilities of a program. There are libraries to connect to any kind of database, to create complex RESTful services, to generate machine learning models, draw all kind of graphs, including interactive ones, produce reports in formats like PDF or Word, modules to analyze geospatial data, creating command-line interfaces, graphical interfaces, parse data and everything in between. The composability of the tools makes it easy to use several of them, for example, to analyze geospatial data, create some graphs with the findings, and to generate a report in PDF afterward
  • This can include powerful tools like integrated environments like Jupyter Notebooks where you can execute code and get instant feedback. Python is quite agnostic in terms of the required development environments, allowing it to work from a simple text editor (I personally use Vim) to advanced options like Visual Studio.

A quick, but insightful way of showing the power of Python for data processing, is to present how easy it is to operate with common files. Let’s imagine that we have a text file with a bunch of lines with numbers, and we want to calculate their average and store it in a new file


We can read the file a with a clause that will automatically close the file when it’s finished. The file is open by default as text, which allows to read it in lines iterating through it.

with open('example.txt') as file:
    numbers = [int(line) for line in file]

The list numbers process the file line by line and transform each into an integer, as it’s read as text. This structure, with a loop between brackets, is called a list comprehension  in Python and allows to generate lists in an easy and readable way.

The average can be calculated by adding all numbers and dividing by the number of them, the length of the list.

    average = sum(numbers) / len(numbers)

Finally, we store the result in a new file. We use the same with clause, but this time opening in writing mode adding ‘w’. The file is also written in text format.

with open('result.txt', 'w') as file:
    file.write(f'Average: {average}')

The f-string allows to replace in a template string the variable average by putting it inside between curly brackets.

And that’s it. Five lines of code that deal with reading and writing into two files, transform the input from text to integers, and perform the calculation. All the code is very easy to follow.

This method of reading from a text input, performing some calculations, and dumps the results in a text output, is very useful in data processing, as several steps can be performed in order to generate complex pipelines. The intermediate steps are stored, allowing to repeat, in case of an error, only the required steps, and not from the start. Because the read and the write is so easy, it allows saving points to avoid repeating processing the data multiple times.

This barely scratches the surface, of course. Python has included in its standard library modules to read and write in CSV format, and there are a lot of third-party options to read and write in other formats like HTML, PDF, or even Word or Excel format.

There are also modules that allow to present the information, not only in text format, but including different kinds of graphics, like the useful Matplotlib. And powerful data manipulation modules like Pandas to crunch the numbers and obtain insightful results.

I’ll be running the virtual workshop “Basic Python for Data Processing” at the ODSC Europe conference this June, where we will talk about basic techniques for starting with Python for Data Analysis, including dealing with all these kinds of files, generating graphics, creating command-line interfaces, and other tips for working with data.

Python for Data Processing

About the author/ODSC Europe 2021 speaker on Python for Data Processing:

Jaime Buelta works as Software Architect in Double Yard in Dublin, where he works to deliver innovative AI solutions in the area of Education. He recently published the second edition of the “Python Automation Cookbook”, which describes useful ways of using Python to do the heavy lifting for people beginning their Python journey.