Scalable repetitions of tasks#

This section advises on how to use repetitions to scale your project quickly.

TL;DR#

Loop over dictionaries that map ids to kwargs to create multiple tasks.
Create the dictionary with a separate function.
Create functions to build intermediate objects like output paths which can be shared more easily across tasks than the generated values.

Scalability#

Parametrizations allow scaling tasks from \(1\) to \(N\) in a simple way. What is easily overlooked is that parametrizations usually trigger other parametrizations and the growth in tasks is more \(1\) to \(N \cdot M \cdot \dots\) or \(1\) to \(N^{M \cdot \dots}\).

This guide lays out a simple, modular, and scalable structure to fight complexity.

For example, assume we have four datasets with one binary dependent variable and some independent variables. We fit three models on each data set: a linear model, a logistic model, and a decision tree. In total, we have \(4 \cdot 3 = 12\) tasks.

First, let us look at the folder and file structure of such a project.

my_project
├───pyproject.toml
│
├───src
│   └───my_project
│       ├────config.py
│       │
│       ├───data
│       │   ├────data_0.csv
│       │   ├────data_1.csv
│       │   ├────data_2.csv
│       │   └────data_3.csv
│       │
│       ├───data_preparation
│       │   ├────__init__.py
│       │   ├────config.py
│       │   └────task_prepare_data.py
│       │
│       └───estimation
│           ├────__init__.py
│           ├────config.py
│           └────task_estimate_models.py
│
│
├───setup.py
│
├───.pytask.sqlite3
│
└───bld

The folder structure, the main config.py which holds SRC and BLD, and the tasks follow the same structure advocated throughout the tutorials.

What is new are the local configuration files in each subfolder of my_project, which contain objects shared across tasks. For example, config.py holds the paths to the processed data and the names of the data sets.

# Content of config.py

from my_project.config import BLD
from my_project.config import SRC

DATA = ["data_0", "data_1", "data_2", "data_3"]

def path_to_input_data(name):
    return SRC / "data" / f"{name}.csv"

def path_to_processed_data(name):
    return BLD / "data" / f"processed_{name}.pkl"

The task file task_prepare_data.py uses these objects to build the parametrization.

# Content of task_prepare_data.py

import pytask

from my_project.data_preparation.config import DATA
from my_project.data_preparation.config import path_to_input_data
from my_project.data_preparation.config import path_to_processed_data


def _create_parametrization(data):
    id_to_kwargs = {}
    for data_name in data:
        depends_on = path_to_input_data(data_name)
        produces = path_to_processed_data(data_name)

        id_to_kwargs[data_name] = {"depends_on": depends_on, "produces": produces}

    return id_to_kwargs


_ID_TO_KWARGS = _create_parametrization(DATA)

for id_, kwargs in _ID_TO_KWARGS.items():

    @pytask.mark.task(id=id_, kwargs=kwargs)
    def task_prepare_data(depends_on, produces):
        ...

All arguments for the loop and the @pytask.mark.task decorator is built within a function to keep the logic in one place and the module’s namespace clean.

Ids are used to make the task ids more descriptive and to simplify their selection with expressions. Here is an example of the task ids with an explicit id.

# With id
.../my_project/data_preparation/task_prepare_data.py::task_prepare_data[data_0]

Next, we move to the estimation to see how we can build another parametrization upon the previous one.

# Content of config.py

from my_project.config import BLD
from my_project.data_preparation.config import DATA


_MODELS = ["linear_probability", "logistic_model", "decision_tree"]


ESTIMATIONS = {
    f"{data_name}_{model_name}": {"model": model_name, "data": data_name}
    for model_name in _MODELS
    for data_name in DATA
}


def path_to_estimation_result(name):
    return BLD / "estimation" / f"estimation_{name}.pkl"

In the local configuration, we define ESTIMATIONS which combines the information on data and model. The dictionary’s key can be used as a task id whenever the estimation is involved. It allows triggering all tasks related to one estimation - estimation, figures, tables - with one command.

pytask -k linear_probability_data_0

And here is the task file.

# Content of task_estimate_models.py

import pytask

from my_project.data_preparation.config import path_to_processed_data
from my_project.estimations.config import ESTIMATIONS
from my_project.estimations.config import path_to_estimation_result


def _create_parametrization(estimations):
    id_to_kwargs = {}
    for name, config in estimations.items():
        depends_on = path_to_processed_data(config["data"])
        produces = path_to_estimation_result(name)

        id_to_kwargs[name] = {
            "depends_on": depends_on,
            "model": config["model"],
            "produces": produces,
        }

    return id_to_kwargs


_ID_TO_KWARGS = _create_parametrization(ESTIMATIONS)


for id_, kwargs in _ID_TO_KWARGS.items():

    @pytask.mark.task(id=id_, kwargs=kwars)
    def task_estmate_models(depends_on, model, produces):
        if model == "linear_probability":
            ...

Replicating this pattern across a project allows a clean way to define parametrizations.

Extending parametrizations#

Some parametrized tasks are costly to run - costly in terms of computing power, memory, or time. Users often extend parametrizations triggering all parametrizations to be rerun. Thus, use the @pytask.mark.persist decorator, which is explained in more detail in this tutorial.