pyo3/guide/src/parallelism.md

# Parallelism

CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://stackoverflow.com/questions/868568/) tasks and often forces developers to accept the overhead of multiprocessing.

In PyO3 parallelism can be easily achieved in Rust-only code. Let's take a look at our [word-count](https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs) example, where we have a `search` function that utilizes the [rayon](https://github.com/nikomatsakis/rayon) crate to count words in parallel.
```rust, ignore
#[pyfunction]
fn search(contents: &str, needle: &str) -> usize {
    contents
        .par_lines()
        .map(|line| count_line(line, needle))
        .sum()
}
```

But let's assume you have a long running Rust function which you would like to execute several times in parallel. For the sake of example let's take a sequential version of the word count:
```rust, ignore
fn search_sequential(contents: &str, needle: &str) -> usize {
    contents.lines().map(|line| count_line(line, needle)).sum()
}
```

To enable parallel execution of this function, the [`Python::allow_threads`] method can be used to temporarily release the GIL, thus allowing other Python threads to run. We then have a function exposed to the Python runtime which calls `search_sequential` inside a closure passed to [`Python::allow_threads`] to enable true parallelism:
```rust, ignore
#[pyfunction]
fn search_sequential_allow_threads(py: Python, contents: &str, needle: &str) -> usize {
    py.allow_threads(|| search_sequential(contents, needle))
}
```

Now Python threads can use more than one CPU core, resolving the limitation which usually makes multi-threading in Python only good for IO-bound tasks:
```Python
from concurrent.futures import ThreadPoolExecutor
from word_count import search_sequential_allow_threads

executor = ThreadPoolExecutor(max_workers=2)

future_1 = executor.submit(
    word_count.search_sequential_allow_threads, contents, needle
)
future_2 = executor.submit(
    word_count.search_sequential_allow_threads, contents, needle
)
result_1 = future_1.result()
result_2 = future_2.result()
```

## Benchmark

Let's benchmark the `word-count` example to verify that we really did unlock parallelism with PyO3.

We are using `pytest-benchmark` to benchmark four word count functions:

1. Pure Python version
2. Rust parallel version
3. Rust sequential version
4. Rust sequential version executed twice with two Python threads

The benchmark script can be found [here](https://github.com/PyO3/pyo3/blob/main/examples/word-count/tests/test_word_count.py), and we can run `tox` in the `word-count` folder to benchmark these functions.

While the results of the benchmark of course depend on your machine, the relative results should be similar to this (mid 2020):
```ignore
-------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                                          Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_word_count_rust_parallel                           1.7315 (1.0)       4.6495 (1.0)       1.9972 (1.0)      0.4299 (1.0)       1.8142 (1.0)      0.2049 (1.0)         40;46  500.6943 (1.0)         375           1
test_word_count_rust_sequential                         7.3348 (4.24)     10.3556 (2.23)      8.0035 (4.01)     0.7785 (1.81)      7.5597 (4.17)     0.8641 (4.22)         26;5  124.9457 (0.25)        121           1
test_word_count_rust_sequential_twice_with_threads      7.9839 (4.61)     10.3065 (2.22)      8.4511 (4.23)     0.4709 (1.10)      8.2457 (4.55)     0.3927 (1.92)        17;17  118.3274 (0.24)        114           1
test_word_count_python_sequential                      27.3985 (15.82)    45.4527 (9.78)     28.9604 (14.50)    4.1449 (9.64)     27.5781 (15.20)    0.4638 (2.26)          3;5   34.5299 (0.07)         35           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

You can see that the Python threaded version is not much slower than the Rust sequential version, which means compared to an execution on a single CPU core the speed has doubled.

[`Python::allow_threads`]: https://docs.rs/pyo3/latest/pyo3/struct.Python.html#method.allow_threads
User guide documentaion (#37) User guide documentaion 2017-06-18 02:02:02 +00:00			`# Parallelism`

Address review comments 2020-06-05 19:22:41 +00:00			`CPython has the infamous [Global Interpreter Lock](https://docs.python.org/3/glossary.html#term-global-interpreter-lock), which prevents several threads from executing Python bytecode in parallel. This makes threading in Python a bad fit for [CPU-bound](https://stackoverflow.com/questions/868568/) tasks and often forces developers to accept the overhead of multiprocessing.`
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00
main: tweaks from rename 2021-03-16 22:14:57 +00:00			In PyO3 parallelism can be easily achieved in Rust-only code. Let's take a look at our [word-count](https://github.com/PyO3/pyo3/blob/main/examples/word-count/src/lib.rs) example, where we have a `search` function that utilizes the [rayon](https://github.com/nikomatsakis/rayon) crate to count words in parallel.
Ignore doctests 2020-06-05 19:02:52 +00:00			```rust, ignore
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`#[pyfunction]`
			`fn search(contents: &str, needle: &str) -> usize {`
			`contents`
			`.par_lines()`
			`.map(\|line\| count_line(line, needle))`
			`.sum()`
Finish parallelism user guide 2017-08-06 05:50:28 +00:00			`}`
			```

Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`But let's assume you have a long running Rust function which you would like to execute several times in parallel. For the sake of example let's take a sequential version of the word count:`
Ignore doctests 2020-06-05 19:02:52 +00:00			```rust, ignore
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`fn search_sequential(contents: &str, needle: &str) -> usize {`
			`contents.lines().map(\|line\| count_line(line, needle)).sum()`
Finish parallelism user guide 2017-08-06 05:50:28 +00:00			`}`
			```

Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			To enable parallel execution of this function, the [`Python::allow_threads`] method can be used to temporarily release the GIL, thus allowing other Python threads to run. We then have a function exposed to the Python runtime which calls `search_sequential` inside a closure passed to [`Python::allow_threads`] to enable true parallelism:
Ignore doctests 2020-06-05 19:02:52 +00:00			```rust, ignore
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`#[pyfunction]`
			`fn search_sequential_allow_threads(py: Python, contents: &str, needle: &str) -> usize {`
			`py.allow_threads(\|\| search_sequential(contents, needle))`
Finish parallelism user guide 2017-08-06 05:50:28 +00:00			`}`
			```

Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`Now Python threads can use more than one CPU core, resolving the limitation which usually makes multi-threading in Python only good for IO-bound tasks:`
			```Python
			`from concurrent.futures import ThreadPoolExecutor`
			`from word_count import search_sequential_allow_threads`

			`executor = ThreadPoolExecutor(max_workers=2)`

			`future_1 = executor.submit(`
			`word_count.search_sequential_allow_threads, contents, needle`
			`)`
			`future_2 = executor.submit(`
			`word_count.search_sequential_allow_threads, contents, needle`
			`)`
			`result_1 = future_1.result()`
			`result_2 = future_2.result()`
			```

Finish parallelism user guide 2017-08-06 05:50:28 +00:00			`## Benchmark`

Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			Let's benchmark the `word-count` example to verify that we really did unlock parallelism with PyO3.
Finish parallelism user guide 2017-08-06 05:50:28 +00:00
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			We are using `pytest-benchmark` to benchmark four word count functions:
Finish parallelism user guide 2017-08-06 05:50:28 +00:00
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`1. Pure Python version`
			`2. Rust parallel version`
			`3. Rust sequential version`
			`4. Rust sequential version executed twice with two Python threads`
Finish parallelism user guide 2017-08-06 05:50:28 +00:00
main: tweaks from rename 2021-03-16 22:14:57 +00:00			The benchmark script can be found [here](https://github.com/PyO3/pyo3/blob/main/examples/word-count/tests/test_word_count.py), and we can run `tox` in the `word-count` folder to benchmark these functions.
Finish parallelism user guide 2017-08-06 05:50:28 +00:00
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`While the results of the benchmark of course depend on your machine, the relative results should be similar to this (mid 2020):`
Ignore doctests 2020-06-05 19:02:52 +00:00			```ignore
Rewrite parallelism chapter in guide 2020-06-05 18:25:07 +00:00			`-------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------`
			`Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations`
			`-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------`
			`test_word_count_rust_parallel 1.7315 (1.0) 4.6495 (1.0) 1.9972 (1.0) 0.4299 (1.0) 1.8142 (1.0) 0.2049 (1.0) 40;46 500.6943 (1.0) 375 1`
			`test_word_count_rust_sequential 7.3348 (4.24) 10.3556 (2.23) 8.0035 (4.01) 0.7785 (1.81) 7.5597 (4.17) 0.8641 (4.22) 26;5 124.9457 (0.25) 121 1`
			`test_word_count_rust_sequential_twice_with_threads 7.9839 (4.61) 10.3065 (2.22) 8.4511 (4.23) 0.4709 (1.10) 8.2457 (4.55) 0.3927 (1.92) 17;17 118.3274 (0.24) 114 1`
			`test_word_count_python_sequential 27.3985 (15.82) 45.4527 (9.78) 28.9604 (14.50) 4.1449 (9.64) 27.5781 (15.20) 0.4638 (2.26) 3;5 34.5299 (0.07) 35 1`
			`-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------`
			```
Refactor links in the guide 2020-03-18 15:26:34 +00:00
Fix typos 2020-06-05 18:54:38 +00:00			`You can see that the Python threaded version is not much slower than the Rust sequential version, which means compared to an execution on a single CPU core the speed has doubled.`
Refactor links in the guide 2020-03-18 15:26:34 +00:00
			[`Python::allow_threads`]: https://docs.rs/pyo3/latest/pyo3/struct.Python.html#method.allow_threads