Defining dependencies and products

Tasks have dependencies and products that you must define to run your tasks.

Defining dependencies and products also serves another purpose. By analyzing them, pytask determines the order to run the tasks.

This tutorial offers you different interfaces. If you are comfortable with type annotations or are not afraid to try them, look at the Python 3.10+ or 3.8+ tabs. You find a tutorial on type hints here.

If you want to avoid type annotations for now, look at the tab named produces.

See also

In this tutorial, we only deal with local files. If you want to use pytask with files online, S3, GCP, Azure, etc., read the guide on remote files.

First, we focus on defining products that should already be familiar to you. Then, we focus on how you can declare task dependencies.

We use the same project as before and add a task_plot_data.py module.

my_project
│
├───.pytask
│
├───bld
│   ├────data.pkl
│   └────plot.png
│
├───src
│   └───my_project
│       ├────__init__.py
│       ├────config.py
│       ├────task_data_preparation.py
│       └────task_plot_data.py
│
└───pyproject.toml

Products

Let’s revisit the task from the previous tutorial that we defined in task_data_preparation.py.

from pathlib import Path
from typing import Annotated

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = BLD / "data.pkl",
) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path_to_data)

Product allows marking an argument as a product. After the task has finished, pytask will check whether the file exists.

from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = BLD / "data.pkl",
) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path_to_data)

Product allows marking an argument as a product. After the task has finished, pytask will check whether the file exists.

from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD


def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(produces)

Tasks can use produces as a “magic” argument name. Every value, or in this case path, passed to this argument is automatically treated as a task product. Here, we pass the path as the default argument.

Tip

If you do not know about pathlib check out this guide by RealPython. The module is beneficial for handling paths conveniently and across platforms.

Dependencies

Adding a dependency to a task ensures that the dependency is available before execution.

To show how dependencies work, we extend our project with another task that plots the data generated with task_create_random_data. The task is called task_plot_data, and we will define it in task_plot_data.py.

To specify that the task relies on the data set data.pkl, you can add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that do not have a Product annotation are dependencies of the task.

from pathlib import Path
from typing import Annotated

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from pytask import Product


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl",
    path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(path_to_plot)
    plt.close()

To specify that the task relies on the data set data.pkl, you can add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that do not have the Product annotation are dependencies of the task.

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl",
    path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(path_to_plot)
    plt.close()

To specify that the task relies on the data set data.pkl, you can add the path to the function signature while choosing any argument name, here path_to_data.

pytask assumes that all function arguments that are not passed to the argument produces are dependencies of the task.

from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD


def task_plot_data(
    path_to_data: Path = BLD / "data.pkl", produces: Path = BLD / "plot.png"
) -> None:
    df = pd.read_pickle(path_to_data)

    _, ax = plt.subplots()
    df.plot(x="x", y="y", ax=ax, kind="scatter")

    plt.savefig(produces)
    plt.close()

Now, let us execute the two paths.

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.3.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">2</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                              ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span>       │
│ <span class="termynal-dim">task_plot_data.py::</span>task_plot_data                 │ <span class="termynal-success">.</span>       │
└───────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 2  Collected tasks </span>           <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 2  Succeeded        (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>

Relative paths

Dependencies and products do not have to be absolute paths. If paths are relative, they are assumed to point to a location relative to the task module.

from pathlib import Path
from typing import Annotated

from pytask import Product


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = Path("../bld/data.pkl"),
) -> None: ...
from pathlib import Path

from pytask import Product
from typing_extensions import Annotated


def task_create_random_data(
    path_to_data: Annotated[Path, Product] = Path("../bld/data.pkl"),
) -> None: ...
from pathlib import Path


def task_create_random_data(produces: Path = Path("../bld/data.pkl")) -> None: ...

Multiple dependencies and products

Of course, tasks can have multiple dependencies and products.

from pathlib import Path
from typing import Annotated

from my_project.config import BLD
from pytask import Product


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    path_to_plot_0: Annotated[Path, Product] = BLD / "plot_0.png",
    path_to_plot_1: Annotated[Path, Product] = BLD / "plot_1.png",
) -> None: ...

You can group your dependencies and product if you prefer not to have a function argument per input. Use dictionaries (recommended), tuples, lists, or more nested structures if needed.

from pathlib import Path
from typing import Annotated

from my_project.config import BLD
from pytask import Product

_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: dict[str, Path] = _DEPENDENCIES,
    path_to_plots: Annotated[dict[str, Path], Product] = _PRODUCTS,
) -> None: ...
from pathlib import Path

from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    path_to_plot_0: Annotated[Path, Product] = BLD / "plot_0.png",
    path_to_plot_1: Annotated[Path, Product] = BLD / "plot_1.png",
) -> None: ...

You can group your dependencies and product if you prefer not to have a function argument per input. Use dictionaries (recommended), tuples, lists, or more nested structures if needed.

from pathlib import Path
from typing import Dict

from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated

_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: Dict[str, Path] = _DEPENDENCIES,
    path_to_plots: Annotated[Dict[str, Path], Product] = _PRODUCTS,
) -> None: ...

If your task has multiple products, group them in one container like a dictionary (recommended), tuples, lists, or more nested structures.

from pathlib import Path
from typing import Dict

from my_project.config import BLD

_PRODUCTS = {"first": BLD / "data_0.pkl", "second": BLD / "data_1.pkl"}


def task_plot_data(
    path_to_data_0: Path = BLD / "data_0.pkl",
    path_to_data_1: Path = BLD / "data_1.pkl",
    produces: Dict[str, Path] = _PRODUCTS,
) -> None: ...

You can do the same with dependencies.

from __future__ import annotations

from typing import TYPE_CHECKING

from my_project.config import BLD

if TYPE_CHECKING:
    from pathlib import Path


_DEPENDENCIES = {"data_0": BLD / "data_0.pkl", "data_1": BLD / "data_1.pkl"}
_PRODUCTS = {"plot_0": BLD / "plot_0.png", "plot_1": BLD / "plot_1.png"}


def task_plot_data(
    path_to_data: dict[str, Path] = _DEPENDENCIES,
    produces: dict[str, Path] = _PRODUCTS,
) -> None: ...

Depending on a task

In some situations, you want to define a task depending on another task.

pytask allows you to do that, but you lose features like access to paths, which is why defining dependencies explicitly is always preferred.

There are two modes for it, and both use @task(after=...).

First, you can pass the task function or multiple task functions to the decorator. Applied to the tasks from before, we could have written task_plot_data as

@task(after=task_create_random_data)
def task_plot_data(...):
    ...

You can also pass a list of task functions.

The second mode is to pass an expression, a substring of the name of the dependent tasks. Here, we can pass the function name or a significant part of the function name.

@task(after="random_data")
def task_plot_data(...):
    ...

You will learn more about expressions in Selecting tasks.

Further reading

  • There is an additional way to specify products by treating the returns of a task function as a product. Read Using task returns to learn more about it.

  • An overview of all ways to specify dependencies and products and their strengths and weaknesses can be found in Interfaces for dependencies and products.