Repeating tasks with different inputs¶

Do you want to repeat a task over a range of inputs? Loop over your task function!

An example¶

We reuse the task from the previous tutorial, which generates random data and repeat the same operation over several seeds to receive multiple, reproducible samples.

Apply the @task decorator, loop over the function and supply different seeds and output paths as default arguments of the function.

Annotatedproduces

from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import task

for seed in range(10):

    @task
    def task_create_random_data(
        path: Annotated[Path, Product] = Path(f"data_{seed}.pkl"), seed: int = seed
    ) -> None: ...

from pathlib import Path

from pytask import task

for seed in range(10):

    @task
    def task_create_random_data(
        produces: Path = Path(f"data_{seed}.pkl"), seed: int = seed
    ) -> None: ...

Executing pytask gives you this:

$ pytask
────────────────────────── Start pytask session ─────────────────────────
Platform: win32 -- Python 3.13.0, pytask 0.6.0, pluggy 1.3.0
Root: C:\Users\pytask-dev\git\my_project
Collected 10 task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                                     ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_0.pkl-0] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_1.pkl-1] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_2.pkl-2] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_3.pkl-3] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_4.pkl-4] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_5.pkl-5] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_6.pkl-6] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_7.pkl-7] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_8.pkl-8] │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_9.pkl-9] │ <span class="termynal-success">.</span>       │
└──────────────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">─────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 10  Collected tasks </span>          <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 10  Succeeded       (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">─────────────────────── Succeeded in 0.43 seconds ───────────────────────</span>

Dependencies¶

You can also add dependencies to repeated tasks just like with any other task.

Annotatedproduces

from pathlib import Path
from typing import Annotated

from my_project.config import SRC

from pytask import Product
from pytask import task

for seed in range(10):

    @task
    def task_create_random_data(
        path_to_parameters: Path = SRC / "parameters.yml",
        path_to_data: Annotated[Path, Product] = Path(f"data_{seed}.pkl"),
        seed: int = seed,
    ) -> None: ...

from pathlib import Path

from my_project.config import SRC

from pytask import task

for seed in range(10):

    @task
    def task_create_random_data(
        path_to_parameters: Path = SRC / "parameters.yml",
        produces: Path = Path(f"data_{seed}.pkl"),
        seed: int = seed,
    ) -> None: ...

The id¶

Every task has a unique id that can be used to select it. The standard id combines the path to the module where the task is defined, a double colon, and the name of the task function. Here is an example.

../task_data_preparation.py::task_create_random_data

This behavior would produce duplicate ids for parametrized tasks. By default, auto-generated ids are used.

Auto-generated ids¶

pytask constructs ids by extending the task name with representations of the values used for each iteration. Booleans, floats, integers, and strings enter the task id directly. For example, a task function that receives four arguments, True, 1.0, 2, and "hello", one of each data type, has the following id.

task_data_preparation.py::task_create_random_data[True-1.0-2-hello]

Arguments with other data types cannot be converted to strings and, thus, are replaced with a combination of the argument name and the iteration counter.

For example, the following function is parametrized with tuples.

Annotatedproduces

from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import task

for seed in ((0,), (1,)):

    @task
    def task_create_random_data(
        seed: tuple[int] = seed,
        path_to_data: Annotated[Path, Product] = Path(f"data_{seed[0]}.pkl"),
    ) -> None: ...

from pathlib import Path

from pytask import task

for seed in ((0,), (1,)):

    @task
    def task_create_random_data(
        produces: Path = Path(f"data_{seed[0]}.pkl"), seed: tuple[int] = seed
    ) -> None: ...

Since the tuples are not converted to strings, the ids of the two tasks are

task_data_preparation.py::task_create_random_data[seed0]
task_data_preparation.py::task_create_random_data[seed1]

User-defined ids¶

The @task decorator has an id keyword, allowing the user to set a unique name for the iteration.

Annotatedproduces

from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import task

for seed, id_ in ((0, "first"), (1, "second")):

    @task(id=id_)
    def task_create_random_data(
        seed: int = seed,
        path_to_data: Annotated[Path, Product] = Path(f"data_{seed}.txt"),
    ) -> None: ...

from pathlib import Path

from pytask import task

for seed, id_ in ((0, "first"), (1, "second")):

    @task(id=id_)
    def task_create_random_data(
        produces: Path = Path(f"out_{seed}.txt"), seed: int = seed
    ) -> None: ...

produces these ids

task_data_preparation.py::task_create_random_data[first]
task_data_preparation.py::task_create_random_data[second]

Complex example¶

Parametrizations are becoming more complex quickly. Often, there are many tasks with ids and arguments. Here are three tips to organize the repetitions.

Use suitable containers to organize your ids and the function arguments.

Dataclass

dataclasses.dataclass is a useful container to organize the arguments of the parametrizations. It also works well with type checkers.

from dataclasses import dataclass
from pathlib import Path


@dataclass
class Arguments:
    seed: int
    path_to_data: Path


ID_TO_KWARGS = {
    "first": Arguments(seed=0, path_to_data=Path("data_0.pkl")),
    "second": Arguments(seed=1, path_to_data=Path("data_1.pkl")),
}

@task has a kwargs argument that allows you inject arguments to the function instead of adding them as default arguments.
If the generation of arguments for the task function is complex, we should use a function.

Following these three tips, the parametrization becomes

Annotatedproduces

from dataclasses import dataclass
from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import task


@dataclass
class _Arguments:
    seed: int
    path_to_data: Path


ID_TO_KWARGS = {
    "first": _Arguments(seed=0, path_to_data=Path("data_0.pkl")),
    "second": _Arguments(seed=1, path_to_data=Path("data_1.pkl")),
}


for id_, kwargs in ID_TO_KWARGS.items():

    @task(id=id_, kwargs=kwargs)
    def task_create_random_data(
        seed: int, path_to_data: Annotated[Path, Product]
    ) -> None: ...

from dataclasses import dataclass
from pathlib import Path

from pytask import task


@dataclass
class _Arguments:
    seed: int
    path_to_data: Path


ID_TO_KWARGS = {
    "first": _Arguments(seed=0, path_to_data=Path("data_0.pkl")),
    "second": _Arguments(seed=1, path_to_data=Path("data_1.pkl")),
}


for id_, kwargs in ID_TO_KWARGS.items():

    @task(id=id_, kwargs=kwargs)
    def task_create_random_data(seed: int, produces: Path) -> None: ...

Unpacking all the arguments can become tedious. Instead, use the kwargs argument of the @task decorator to pass keyword arguments to the task.

for id_, kwargs in ID_TO_KWARGS.items():

    @task(id=id_, kwargs=kwargs)
    def task_create_random_data(seed, produces): ...

Writing a function that creates ID_TO_KWARGS would be even more pythonic.

def create_parametrization():
    id_to_kwargs = {}
    for i, id_ in enumerate(["first", "second"]):
        id_to_kwargs[id_] = {"produces": f"out_{i}.txt"}

    return id_to_kwargs


ID_TO_KWARGS = create_parametrization()


for id_, kwargs in ID_TO_KWARGS.items():

    @task(id=id_, kwargs=kwargs)
    def task_create_random_data(i, produces): ...

The best-practices guide on parametrizations goes into even more detail on how to scale parametrizations.

A warning on globals¶

The following example warns against accidentally using running variables in your task definition.

You won't encounter these problems if you strictly use the below-mentioned interfaces.

Look at this repeated task, which runs three times and tries to produce a text file with some content.

from pytask import Product
from pytask import task
from pathlib import Path


for i in range(3):

    @task
    def task_example(path: Annotated[Path, Product] = Path(f"out_{i}.txt")):
        path_of_module_folder = Path(__file__).parent
        path_to_product = path_of_module_folder.joinpath(f"out_{i}.txt")
        path_to_product.write_text("I use running globals. How funny.")

If you executed these tasks, pytask would collect three tasks as expected. But, only the last task for i = 2 would succeed.

The other tasks would fail because they did not produce out_0.txt and out_1.txt.

Why did the first two tasks fail?

Explanation

The problem with this example is the running variable i which is a global variable with changing state.

When pytask imports the task module, it collects all three task functions, each of them having the correct product assigned.

But, when pytask executes the tasks, the running variable i in the function body is 2, or the last state of the loop.

So, all three tasks create the same file, out_2.txt.

The solution is to use the intended channels to pass variables to tasks which are the kwargs argument of @task or the default value in the function signature.

for i in range(3):

    @task(kwargs={"i": i})
    def task_example(i, path: Annotated[Path, Product] = Path(f"out_{i}.txt")): ...

    # or

    @task
    def task_example(i=i, path: Annotated[Path, Product] = Path(f"out_{i}.txt")): ...