Repeating tasks with different inputs#
Do you want to repeat a task over a range of inputs? Loop over your task function!
Important
Before v0.2.0, pytask supported only one approach to repeat tasks. It is also called
parametrizations, and similarly to pytest, it uses a
@pytask.mark.parametrize
decorator. If you want to
know more about it, you can find it
here.
Here you find the new and preferred approach.
An example#
We reuse the task from the previous tutorial, which generates random data and repeats the same operation over several seeds to receive multiple, reproducible samples.
Apply the @pytask.mark.task
decorator, loop over the function
and supply different seeds and output paths as default arguments of the function.
import numpy as np
import pytask
for i in range(10):
@pytask.mark.task
def task_create_random_data(produces=f"data_{i}.pkl", seed=i):
rng = np.random.default_rng(seed)
...
Executing pytask gives you this:
$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.3.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">10</span> task.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_0.pkl-0] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_1.pkl-1] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_2.pkl-2] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_3.pkl-3] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_4.pkl-4] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_5.pkl-5] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_6.pkl-6] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_7.pkl-7] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_8.pkl-8] │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_repeating.py::</span>task_create_random_data[data_9.pkl-9] │ <span class="termynal-success">.</span> │
└──────────────────────────────────────────────────────────┴─────────┘
<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 10 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 10 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.43 seconds ──────────────────────────</span>
depends_on
and produces
#
You can also use decorators to supply values to the function.
To specify a dependency that is the same for all iterations, add it with
@pytask.mark.depends_on
. And add a product with
@pytask.mark.produces
for i in range(10):
@pytask.mark.task
@pytask.mark.depends_on(SRC / "common_dependency.file")
@pytask.mark.produces(f"data_{i}.pkl")
def task_create_random_data(produces, seed=i):
rng = np.random.default_rng(seed)
...
The id#
Every task has a unique id that can be used to select it. The standard id combines the path to the module where the task is defined, a double colon, and the name of the task function. Here is an example.
../task_data_preparation.py::task_create_random_data
This behavior would produce duplicate ids for parametrized tasks. By default, auto-generated ids are used which are explained here.
More powerful are user-defined ids.
User-defined ids#
The @pytask.mark.task
decorator has an id
keyword, allowing
the user to set a unique name for the iteration.
for seed, id_ in [(0, "first"), (1, "second")]:
@pytask.mark.task(id=id_)
def task_create_random_data(seed=i, produces=f"out_{i}.txt"):
...
produces these ids
task_data_preparation.py::task_create_random_data[first]
task_data_preparation.py::task_create_random_data[second]
Complex example#
Parametrizations are becoming more complex quickly. Often, there are many tasks with ids and arguments.
To organize your ids and arguments, use nested dictionaries where keys are ids and values are dictionaries mapping from argument names to values.
ID_TO_KWARGS = {
"first": {
"seed": 0,
"produces": "data_0.pkl",
},
"second": {
"seed": 1,
"produces": "data_1.pkl",
},
}
The parametrization becomes
for id_, kwargs in ID_TO_KWARGS.items():
@pytask.mark.task(id=id_)
def task_create_random_data(seed=kwargs["seed"], produces=kwargs["produces"]):
...
Unpacking all the arguments can become tedious. Instead, use the kwargs
argument of
the @pytask.mark.task
decorator to pass keyword arguments to
the task.
for id_, kwargs in ID_TO_KWARGS.items():
@pytask.mark.task(id=id_, kwargs=kwargs)
def task_create_random_data(seed, produces):
...
Writing a function that creates ID_TO_KWARGS
would be even more pythonic.
def create_parametrization():
id_to_kwargs = {}
for i, id_ in enumerate(["first", "second"]):
id_to_kwargs[id_] = {"produces": f"out_{i}.txt"}
return id_to_kwargs
ID_TO_KWARGS = create_parametrization()
for id_, kwargs in ID_TO_KWARGS.items():
@pytask.mark.task(id=id_, kwargs=kwargs)
def task_create_random_data(i, produces):
...
The best-practices guide on parametrizations goes into even more detail on how to scale parametrizations.
A warning on globals#
The following example warns against accidentally using running variables in your task definition.
You won’t encounter these problems if you strictly use the below-mentioned interfaces.
Look at this repeated task which runs three times and tries to produce a text file with some content.
import pytask
from pathlib import Path
for i in range(3):
@pytask.mark.task
@pytask.mark.produces(f"out_{i}.txt")
def task_example():
path_of_module_folder = Path(__file__).parent
path_to_product = path_of_module_folder.joinpath(f"out_{i}.txt")
path_to_product.write_text("I use running globals. How funny.")
If you executed these tasks, pytask would collect three tasks as expected. But, only the
last task for i = 2
would succeed.
The other tasks would fail because they did not produce out_0.txt
and out_1.txt
.
Why did the first two tasks fail?
Explanation
The problem with this example is the running variable i
which is a global variable
with changing state.
When pytask imports the task module, it collects all three task functions, each of them having the correct product assigned.
But, when pytask executes the tasks, the running variable i
in the function body is 2,
or the last state of the loop.
So, all three tasks create the same file, out_2.txt
.
The solution is to use the intended channels to pass variables to tasks which are the
kwargs
argument of @pytask.mark.task <pytask.mark.task>
or the default value in the
function signature.
for i in range(3):
@pytask.mark.task(kwargs={"i": i})
@pytask.mark.produces(f"out_{i}.txt")
def task_example(i):
...
# or
@pytask.mark.task
@pytask.mark.produces(f"out_{i}.txt")
def task_example(i=i):
...