Write a task#

Starting from the project structure in the previous tutorial, you will learn how to write your first task.

The task, task_create_random_data, will be defined in src/my_project/task_data_preparation.py, and it will generate a data set stored in bld/data.pkl.

The task_ prefix for modules and task functions is important so that pytask automatically discovers them.

my_project
│
├───.pytask
│
├───bld
│   └────data.pkl
│
├───src
│   └───my_project
│       ├────__init__.py
│       ├────config.py
│       └────task_data_preparation.py
│
└───pyproject.toml

Generally, a task is a function whose name starts with task_. Tasks produce outputs and the most common output is a file which we will focus on throughout the tutorials.

The following interfaces are different ways to specify the products of a task which is necessary for pytask to correctly run a workflow. The interfaces are ordered from most (left) to least recommended (right).

Important

You cannot mix different interfaces for the same task. Choose only one.

Python 3.10+

The task accepts the argument path that points to the file where the data set will be stored. The path is passed to the task via the default value, BLD / "data.pkl". To indicate that this file is a product we add some metadata to the argument.

Look at the type hint Annotated[Path, Product]. It uses the Annotated syntax. The first entry is the type of the argument, Path. The second entry is Product that marks this argument as a product.

# Content of task_data_preparation.py.
from pathlib import Path
from typing import Annotated

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product


def task_create_random_data(path: Annotated[Path, Product] = BLD / "data.pkl") -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path)

Tip

If you want to refresh your knowledge about type hints, read this guide.

Python 3.8+

# Content of task_data_preparation.py.
from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated


def task_create_random_data(path: Annotated[Path, Product] = BLD / "data.pkl") -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(path)

Tip

If you want to refresh your knowledge about type hints, read this guide.

produces

Tasks can use produces as an argument name. Every value, or in this case path, passed to this argument is automatically treated as a task product. Here, the path is given by the default value of the argument.

# Content of task_data_preparation.py.
from pathlib import Path

import numpy as np
import pandas as pd
from my_project.config import BLD


def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(produces)

Decorators

Warning

This approach is deprecated and will be removed in v0.5

To specify a product, pass the path to the @pytask.mark.produces decorator. Then, add produces as an argument name to use the path inside the task function.

# Content of task_data_preparation.py.
from pathlib import Path

import numpy as np
import pandas as pd
import pytask
from my_project.config import BLD


@pytask.mark.produces(BLD / "data.pkl")
def task_create_random_data(produces: Path) -> None:
    rng = np.random.default_rng(0)
    beta = 2

    x = rng.normal(loc=5, scale=10, size=1_000)
    epsilon = rng.standard_normal(1_000)

    y = beta * x + epsilon

    df = pd.DataFrame({"x": x, "y": y})
    df.to_pickle(produces)

To let pytask track the product of the task, you need to use the @pytask.mark.produces decorator.

Now, execute pytask to collect tasks in the current and subsequent directories.

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">1</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                              ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span>       │
└───────────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1  Collected tasks </span>           <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1  Succeeded        (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>

Customize task names#

Use the @task decorator to mark a function as a task regardless of its function name. You can optionally pass a new name for the task. Otherwise, pytask uses the function name.

from pytask import task

# The id will be ".../task_data_preparation.py::create_random_data".

@task
def create_random_data():
    ...

# The id will be ".../task_data_preparation.py::create_data".

@task(name="create_data")
def create_random_data():
    ...

Warning

Since v0.4 users should use @task over @pytask.mark.task which will be removed in v0.5.

Customize task module names#

Use the configuration value task_files if you prefer a different naming scheme for the task modules. task_*.py is the default. You can specify one or multiple patterns to collect tasks from other files.