Defining dependencies and products#

To ensure pytask executes all tasks in the correct order, you need to define dependencies and products for each task.

This tutorial offers you different interfaces. If you are comfortable with type annotations or not afraid to try them, take a look at the tabs named Python 3.10+ or Python 3.8+.

If you want to avoid type annotations for now, look at the tab named produces.

The deprecated approaches can be found in the tabs named Decorators.

See also

An overview on the different interfaces and their strength and weaknesses is given in Interfaces for dependencies and products.

First, we focus on how to define products which should already be familiar to you. Then, we focus on how task dependencies can be declared.

We use the same project layout as before and add a task_plot_data.py module.

my_project
├───pyproject.toml
│
├───src
│   └───my_project
│       ├────config.py
│       ├────task_data_preparation.py
│       └────task_plot_data.py
│
├───setup.py
│
├───.pytask
│   └────...
│
└───bld
    ├────data.pkl
    └────plot.png

Products#

Let’s revisit the task from the previous tutorial that we defined in task_data_preparation.py.

from pathlib import Path
from typing import Annotated

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = BLD / "data.pkl"
) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path_to_data)

Product allows to declare an argument as a product. After the task has finished, pytask will check whether the file exists.

from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = BLD / "data.pkl"
) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path_to_data)

Using Product allows to declare an argument as a product. After the task has finished, pytask will check whether the file exists.

from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD


def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(produces)

Tasks can use produces as an “magic” argument name. Every value, or in this case path, passed to this argument is automatically treated as a task product. Here, the path is given by the default value of the argument.

Warning

This approach is deprecated and will be removed in v0.5

from pathlib import Path

import numpy as np
import pandas as pd
import pytask
from my_project.config import BLD


@pytask.mark.produces(BLD / "data.pkl")
def task_create_random_data(produces: Path) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(produces)

The @pytask.mark.produces marker attaches a product to a task which is a pathlib.Path to file. After the task has finished, pytask will check whether the file exists.

Add produces as an argument of the task function to get access to the same path inside the task function.

Tip

If you do not know about pathlib check out [1] and [2]. The module is beneficial for handling paths conveniently and across platforms.

Dependencies#

Most tasks have dependencies and it is important to specify. Then, pytask ensures that the dependencies are available before executing the task.

As an example, we want to extend our project with another task that plots the data that we generated with task_create_random_data. The task is called task_plot_data and we will define it in task_plot_data.py.

To specify that the task relies on the data set data.pkl, you can simply add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that do not have the Product annotation are dependencies of the task.

from pathlib import Path
from typing import Annotated

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from pytask import Product


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl",
    path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(path_to_plot)
    plt.close()

To specify that the task relies on the data set data.pkl, you can simply add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that do not have the Product annotation are dependencies of the task.

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl",
    path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(path_to_plot)
    plt.close()

To specify that the task relies on the data set data.pkl, you can simply add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that are not passed to the argument produces are dependencies of the task.

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl", produces: Path = BLD / "plot.png"
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(produces)
    plt.close()

Warning

This approach is deprecated and will be removed in v0.5

Equivalent to products, you can use the @pytask.mark.depends_on decorator to specify that data.pkl is a dependency of the task. Use depends_on as a function argument to access the dependency path inside the function and load the data.

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import pytask
from my_project.config import BLD


@pytask.mark.depends_on(BLD / "data.pkl")
@pytask.mark.produces(BLD / "plot.png")
def task_plot_data(depends_on: Path, produces: Path) -> None:
    df = pd.read_pickle(depends_on)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(produces)
    plt.close()

Now, let us execute the two paths.

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">2</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                              ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_plot_data.py::</span>task_plot_data                 │ <span class="termynal-success">.</span>       │
└───────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 2  Collected tasks </span>           <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 2  Succeeded        (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>

Relative paths#

Dependencies and products do not have to be absolute paths. If paths are relative, they are assumed to point to a location relative to the task module.

from pathlib import Path
from typing import Annotated

from pytask import Product


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = Path("../bld/data.pkl")
) -> None:
    ...
from pathlib import Path

from pytask import Product
from typing_extensions import Annotated


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = Path("../bld/data.pkl")
) -> None:
    ...
from pathlib import Path


def task_create_random_data(produces: Path = Path("../bld/data.pkl")) -> None:
    ...

Warning

This approach is deprecated and will be removed in v0.5

You can also use absolute and relative paths as strings that obey the same rules as the pathlib.Path.

from pathlib import Path

import pytask


@pytask.mark.produces("../bld/data.pkl")
def task_create_random_data(produces: Path) -> None:
    ...

If you use depends_on or produces as arguments for the task function, you will have access to the paths of the targets as pathlib.Path.

Multiple dependencies and products#

Of course, tasks can have multiple dependencies and products.

from pathlib import Path
from typing import Annotated

from my_project.config import BLD
from pytask import Product


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    path_to_plot_0: Annotated[Path, Product] = BLD / "plot_0.png",
    path_to_plot_1: Annotated[Path, Product] = BLD / "plot_1.png",
) -> None:
    ...

You can group your dependencies and product if you prefer not having a function argument per input. Use dictionaries (recommended), tuples, lists, or more nested structures if you need.

from pathlib import Path
from typing import Annotated

from my_project.config import BLD
from pytask import Product


_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: dict[str, Path] = _DEPENDENCIES,
    path_to_plots: Annotated[dict[str, Path], Product] = _PRODUCTS,
) -> None:
    ...
from pathlib import Path

from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    path_to_plot_0: Annotated[Path, Product] = BLD / "plot_0.png",
    path_to_plot_1: Annotated[Path, Product] = BLD / "plot_1.png",
) -> None:
    ...

You can group your dependencies and product if you prefer not having a function argument per input. Use dictionaries (recommended), tuples, lists, or more nested structures if you need.

from pathlib import Path
from typing import Dict

from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: Dict[str, Path] = _DEPENDENCIES,
    path_to_plots: Annotated[Dict[str, Path], Product] = _PRODUCTS,
) -> None:
    ...

If your task has multiple products, group them in one container like a dictionary (recommended), tuples, lists or a more nested structures.

from pathlib import Path
from typing import Dict

from my_project.config import BLD


_PRODUCTS = {"first": BLD / "data_0.pkl", "second": BLD / "data_1.pkl"}


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    produces: Dict[str, Path] = _PRODUCTS,
) -> None:
    ...

You can do the same with dependencies.

from __future__ import annotations

from typing import TYPE_CHECKING

from my_project.config import BLD

if TYPE_CHECKING:
    from pathlib import Path


_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: dict[str, Path] = _DEPENDENCIES,
    produces: dict[str, Path] = _PRODUCTS,
) -> None:
    ...

Warning

This approach is deprecated and will be removed in v0.5

The easiest way to attach multiple dependencies or products to a task is to pass a dict (highly recommended), list, or another iterator to the marker containing the paths.

To assign labels to dependencies or products, pass a dictionary. For example,

from typing import Dict


@pytask.mark.produces({"first": BLD / "data_0.pkl", "second": BLD / "data_1.pkl"})
def task_create_random_data(produces: Dict[str, Path]) -> None:
    ...

Then, use produces inside the task function.

>>> produces["first"]
BLD / "data_0.pkl"

>>> produces["second"]
BLD / "data_1.pkl"

You can also use lists and other iterables.

@pytask.mark.produces([BLD / "data_0.pkl", BLD / "data_1.pkl"])
def task_create_random_data(produces):
    ...

Inside the function, the arguments depends_on or produces become a dictionary where keys are the positions in the list.

>>> produces
{0: BLD / "data_0.pkl", 1: BLD / "data_1.pkl"}

Why does pytask recommend dictionaries and convert lists, tuples, or other iterators to dictionaries? First, dictionaries with positions as keys behave very similarly to lists.

Secondly, dictionaries use keys instead of positions that are more verbose and descriptive and do not assume a fixed ordering. Both attributes are especially desirable in complex projects.

Multiple decorators

pytask merges multiple decorators of one kind into a single dictionary. This might help you to group dependencies and apply them to multiple tasks.

common_dependencies = pytask.mark.depends_on(
    {"first_text": "text_1.txt", "second_text": "text_2.txt"}
)


@common_dependencies
@pytask.mark.depends_on("text_3.txt")
def task_example(depends_on):
    ...

Inside the task, depends_on will be

>>> depends_on
{"first_text": ... / "text_1.txt", "second_text": "text_2.txt", 0: "text_3.txt"}

Nested dependencies and products

Dependencies and products can be nested containers consisting of tuples, lists, and dictionaries. It is beneficial if you want more structure and nesting.

Here is an example of a task that fits some model on data. It depends on a module containing the code for the model, which is not actively used but ensures that the task is rerun when the model is changed. And it depends on the data.

@pytask.mark.depends_on(
    {
        "model": [SRC / "models" / "model.py"],
        "data": {"a": SRC / "data" / "a.pkl", "b": SRC / "data" / "b.pkl"},
    }
)
@pytask.mark.produces(BLD / "models" / "fitted_model.pkl")
def task_fit_model(depends_on, produces):
    ...

depends_on within the function will be

{
    "model": [SRC / "models" / "model.py"],
    "data": {"a": SRC / "data" / "a.pkl", "b": SRC / "data" / "b.pkl"},
}

Depending on a task#

In some situations you want to define a task depending on another task without specifying the relationship explicitly.

pytask allows you to do that, but you loose features like access to paths which is why defining dependencies explicitly is always preferred.

There are two modes for it and both use @task(after=...).

First, you can pass the task function or multiple task functions to the decorator. Applied to the tasks from before, we could have written task_plot_data as

@task(after=task_create_random_data)
def task_plot_data(...):
    ...

You can also pass a list of task functions.

The second mode is to pass an expression, a substring of the name of the dependent tasks. Here, we can pass the function name or a significant part of the function name.

@task(after="random_data")
def task_plot_data(...):
    ...

You will learn more about expressions in Selecting tasks.

References#