Repeating tasks with different inputs#

Do you want to repeat a task over a range of inputs? Loop over your task function!

Important

Before v0.2.0, pytask supported only one approach to repeat tasks. It is also called parametrizations, and similarly to pytest, it uses a @pytask.mark.parametrize decorator. If you want to know more about it, you can find it here.

Here you find the new and preferred approach.

An example#

We reuse the task from the previous tutorial, which generates random data and repeats the same operation over several seeds to receive multiple, reproducible samples.

Apply the @pytask.mark.task decorator, loop over the function and supply different seeds and output paths as default arguments of the function.

import numpy as np
import pytask


for i in range(10):

    @pytask.mark.task
    def task_create_random_data(produces=f"data_{i}.pkl", seed=i):
        rng = np.random.default_rng(seed)
        ...

Executing pytask gives you this:

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.3.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">10</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                                     ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_0.pkl-0] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_1.pkl-1] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_2.pkl-2] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_3.pkl-3] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_4.pkl-4] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_5.pkl-5] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_6.pkl-6] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_7.pkl-7] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_8.pkl-8] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_9.pkl-9] │ <span class="termynal-success">.</span>       │
└──────────────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 10  Collected tasks </span>          <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 10  Succeeded       (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.43 seconds ──────────────────────────</span>

`depends_on` and `produces`#

You can also use decorators to supply values to the function.

To specify a dependency that is the same for all iterations, add it with @pytask.mark.depends_on. And add a product with @pytask.mark.produces

for i in range(10):

    @pytask.mark.task
    @pytask.mark.depends_on(SRC / "common_dependency.file")
    @pytask.mark.produces(f"data_{i}.pkl")
    def task_create_random_data(produces, seed=i):
        rng = np.random.default_rng(seed)
        ...

The id#

Every task has a unique id that can be used to select it. The standard id combines the path to the module where the task is defined, a double colon, and the name of the task function. Here is an example.

../task_data_preparation.py::task_create_random_data

This behavior would produce duplicate ids for parametrized tasks. By default, auto-generated ids are used which are explained here.

More powerful are user-defined ids.

User-defined ids#

The @pytask.mark.task decorator has an id keyword, allowing the user to set a unique name for the iteration.

for seed, id_ in [(0, "first"), (1, "second")]:

    @pytask.mark.task(id=id_)
    def task_create_random_data(seed=i, produces=f"out_{i}.txt"):
        ...

produces these ids

task_data_preparation.py::task_create_random_data[first]
task_data_preparation.py::task_create_random_data[second]

Complex example#

Parametrizations are becoming more complex quickly. Often, there are many tasks with ids and arguments.

To organize your ids and arguments, use nested dictionaries where keys are ids and values are dictionaries mapping from argument names to values.

ID_TO_KWARGS = {
    "first": {
        "seed": 0,
        "produces": "data_0.pkl",
    },
    "second": {
        "seed": 1,
        "produces": "data_1.pkl",
    },
}

The parametrization becomes

for id_, kwargs in ID_TO_KWARGS.items():

    @pytask.mark.task(id=id_)
    def task_create_random_data(seed=kwargs["seed"], produces=kwargs["produces"]):
        ...

Unpacking all the arguments can become tedious. Instead, use the kwargs argument of the @pytask.mark.task decorator to pass keyword arguments to the task.

for id_, kwargs in ID_TO_KWARGS.items():

    @pytask.mark.task(id=id_, kwargs=kwargs)
    def task_create_random_data(seed, produces):
        ...

Writing a function that creates ID_TO_KWARGS would be even more pythonic.

def create_parametrization():
    id_to_kwargs = {}
    for i, id_ in enumerate(["first", "second"]):
        id_to_kwargs[id_] = {"produces": f"out_{i}.txt"}

    return id_to_kwargs


ID_TO_KWARGS = create_parametrization()


for id_, kwargs in ID_TO_KWARGS.items():

    @pytask.mark.task(id=id_, kwargs=kwargs)
    def task_create_random_data(i, produces):
        ...

The best-practices guide on parametrizations goes into even more detail on how to scale parametrizations.

A warning on globals#

The following example warns against accidentally using running variables in your task definition.

You won’t encounter these problems if you strictly use the below-mentioned interfaces.

Look at this repeated task which runs three times and tries to produce a text file with some content.

import pytask
from pathlib import Path


for i in range(3):

    @pytask.mark.task
    @pytask.mark.produces(f"out_{i}.txt")
    def task_example():
        path_of_module_folder = Path(__file__).parent
        path_to_product = path_of_module_folder.joinpath(f"out_{i}.txt")
        path_to_product.write_text("I use running globals. How funny.")

If you executed these tasks, pytask would collect three tasks as expected. But, only the last task for i = 2 would succeed.

The other tasks would fail because they did not produce out_0.txt and out_1.txt.

Why did the first two tasks fail?