Recently I’ve been trying to expand my programming know-how by investing some time in learning a compiled language. I ended up choosing Rust, not for any particular technical reason such as code security or a helpful compiler, but because the community felt a lot the Python community, which I am a huge fan of.

I’ve been enjoying Rust a lot, but I am a Python boy at heart so I wanted to learn how I can bind the two languages together so that I can enjoy the best of both. Unfortunately, as I have been discovering lately, it turns out that Python’s GIL intentionaly gets in the way of writing threadable code.

The long and short of it is that Python’s GIL functions like a talking stick. Only the thread holding GIL can execute its bytecode. This means that executing multiple tasks acrosss threads is essentially moot, as only one thread gets executed at a time as decided by the OS scheduler.

Despite being the current best model available to keep Python threadsafe without exposing the end user to security issues, GIL is a topic in the Python community that draws a lot of contention and has generated many attempts to replace GIL. None have been successful enough to be pulled into CPython, however this year Python’s Steering Committee has accepted a PEP which will allow GIL to be optional. Opting to turn GIL off comes at the expense of thread security. I am curious how people will make use of this new feature when it arrives.

However, turning off GIL isn’t the only path to threadable code. There are already built-in ways to “pass” GIL back to another thread if you want to. Of these ways I am aware of two: release GIL during IO bound tasks such as reading in a large file and during the executiong of an external library from a compiled language like C or Rust. It is this latter option that I am interested in exploring.

In what follows, we’ll write a small Rust library and bind its features to a Python module. After that, we’ll run some benchmarking to exemplify some of the points I’ve made in the last couple of paragraphs.

Prerequisites

As is no surprise, you need to have Rust and Python installed. Additionally, you will need to install the package maturin to your Python environment and the crate PyO3 to your Rust environment, though this should be taken care of automatically by cargo if you configure things correctly.

Brief Overview

The goal is to write a basic function in Rust and bind it to Python using PyO3 such that it releases GIL to allow for the use of threads. Let’s go over the tools we are going to use to make this happen:

PyO3 is a Rust crate that provides bindings for the Python interpreter. Basically, it allows us to write Rust code that can interact with Python and vice versa. Key features include:

  • the ability to call Python APIs and work with Python objects directly within Rust code,
  • simple macros for the creation of Python extensions,
  • management of GIL, allowing for its safe acquisition and release when necessary.

maturin is a python package that handles building and packaging Python extensions from Rust code. Key features include:

  • compiling Rust into a Python module and handling the intricacies of linking rustc’s compiled output with Python’s native extension API,
  • building Python wheels out of the compiled Rust code for quick and easy publishing to pip,
  • addressing cross-platform compatibility,
  • integration with Sphinx for documentation.

It is worth noting that these are not the only tools for the job, but I found them to be accessible, well-documented, and pretty seamless to use (at least for this trivial use case).

Initializing the Project

We can start by running maturin to create a project called pycrust:

maturin new pycrust

The command asks which type of bindings to use, make sure to select “PyO3”. After this selection, we obtain a new cargo project with the configurations in Cargo.toml and pyproject.toml which will be used to build package the project later. We even start with lib.rs which defines a basic addition function and adds it to the API of the Python package. If you want, you can run maturin develop immediately to install the module as-is to your Python environment and test out the addition function in Python.

Adding something worth scaling

The only real contribution we need to make to the project template is to replace the addition function in lib.rs with another function that makes sense to thread. For that purpose we’ll write a sum-of squares function:

#[pyfunction]
fn sum_of_squares(_py: Python, n: i128) -> PyResult<i128> {
	Ok((1..=n).map(|x| x * x).sum())
}

There are a few components we have to add so that the function works with our PyO3/maturin toolchain:

  • #[pyfunction] is an attribute macro which indicates that the following function is intended to be a Python callable. This macro needs to be present on any function we want to expose in our module’s API.
  • _py: Python is an argument in the function’s signature that represents the Python interpreter’s state. We’ll use this later when dealing with GIL. Even if we don’t actually access the object in the function’s body it is required by the PyO3 API, so in this case we prefix the argument’s name with an underscore to indicate that it isn’t used.
  • PyResult<i128> is a result type necessary for, among other things, error propagation to the Python interpreter.

We can include this function in our Python module by adding the line

m.add_function(wrap_pyfunction!(sum_of_squares, m)?)?;

to the module declaration code block under #[pymodule]. However, we need one more step to make this code threadable. By default, any function called from PyO3 extension does not release GIL unless explicitly told to. In order for our Rust function to be concurrently executed across multiple cores, we’ll have to do exactly that. Luckily, PyO3 makes it incredibly easy.

#[pyfunction]
fn threadable_sum_of_squares(py: Python, n: i128) -> PyResult<i128> {
	py.allow_threads(|| {
		Ok((1..=n).map(|x| x * x).sum())
	})
}

The allow_threads method on the Python object temporarily releases GIL while the sum of squares is computed. The move keyword is used to transfer ownership of variables (n in this case) into the environment that is released. Be sure to add this threadable version of sum_of_squares to the module with add_function.

Packaging and flight test

At this point we are ready to build our package with a single command:

maturin develop

After opening a Python shell, you can test the Rust extension by running:

>>> import pycrust
>>> pycrust.sum_of_squares(100)

Benchmarking

For benchmarking, I compute the sum of squares for n=100000000 k times for k=1,2,4. In the columns with “threaded” in the header, this is carried out by opening k threading.Threads and having each one run the computation. In the other columns, the computation is just run k many times in a row. In order for the columns to be horizontally comparable, I divide each row by k so that they are all on the same scale. The first two columns run a Python equivalent of the sum_of_squares function we wrote above, the next two columns run pycrust.sum_of_squares, and the last column runs pycrust.threaded_sum_of_squares.

  python threaded python rust threaded rust threaded rust with allow_threads
k=1 5.36s 5.17s 2.32s 2.35s 2.22s
k=2 5.20s 5.06s 2.14s 2.25s 1.135s
k=4 5.47s 5.19s 2.19s 2.25s 0.66s
  • Comparing between column 1 & 2 reminds us that threading is pointless for CPU bound tasks written in native Python.
  • Comparing between column 1 & 3 reminds us that running a compiled extension from a more efficient language like Rust is usually faster than native Python.
  • Comparing between column 3 & 4 shows us that functions in a Rust extension bound with PyO3 will not release GIL by default, and therefore does not receive a speedup from threading.
  • Comparing between columns 1 & 5 shows us the power of writing a Rust extension that releases GIL: we receive an 8x speedup in the k=4 case (which is the total number of cores on them machine this ran on).

Closing Thoughts

This was an enlightening exercise for me. I am new to the world of concurrency, threads, GIL, and language binding so I learned lots. I am not sure what I am going to do with this knowledge yet, but I am glad to have this in my toolkit now as I am sure it has the potential to help me address performance bottlenecks in the future. I know I am only hitting the tip of the iceberg here though. Concurrency is complicated, especially with shared, mutable data, so there’s plenty more to explore.

Till next time ~