Write a task#
Starting from the project structure in the previous tutorial, you will learn how to write your first task.
The task, task_create_random_data
, will be defined in
src/my_project/task_data_preparation.py
, and it will generate a data set stored in
bld/data.pkl
.
The task_
prefix for modules and task functions is important so that pytask
automatically discovers them.
my_project
│
├───.pytask
│
├───bld
│ └────data.pkl
│
├───src
│ └───my_project
│ ├────__init__.py
│ ├────config.py
│ └────task_data_preparation.py
│
└───pyproject.toml
Generally, a task is a function whose name starts with task_
. Tasks produce outputs
and the most common output is a file which we will focus on throughout the tutorials.
The following interfaces are different ways to specify the products of a task which is necessary for pytask to correctly run a workflow. The interfaces are ordered from most (left) to least recommended (right).
Important
You cannot mix different interfaces for the same task. Choose only one.
The task accepts the argument path
that points to the file where the data set will be
stored. The path is passed to the task via the default value, BLD / "data.pkl"
. To
indicate that this file is a product we add some metadata to the argument.
Look at the type hint Annotated[Path, Product]
. It uses the
Annotated
syntax. The first entry is the type of the argument,
Path
. The second entry is Product
that marks this
argument as a product.
# Content of task_data_preparation.py.
from pathlib import Path
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
def task_create_random_data(path: Annotated[Path, Product] = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(path)
Tip
If you want to refresh your knowledge about type hints, read this guide.
The task accepts the argument path
that points to the file where the data set will be
stored. The path is passed to the task via the default value, BLD / "data.pkl"
. To
indicate that this file is a product we add some metadata to the argument.
Look at the type hint Annotated[Path, Product]
. It uses the
Annotated
syntax. The first entry is the type of the argument,
Path
. The second entry is Product
that marks this
argument as a product.
# Content of task_data_preparation.py.
from pathlib import Path
import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
from typing_extensions import Annotated
def task_create_random_data(path: Annotated[Path, Product] = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(path)
Tip
If you want to refresh your knowledge about type hints, read this guide.
Tasks can use produces
as an argument name. Every value, or in this case path, passed
to this argument is automatically treated as a task product. Here, the path is given by
the default value of the argument.
# Content of task_data_preparation.py.
from pathlib import Path
import numpy as np
import pandas as pd
from my_project.config import BLD
def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(produces)
Warning
This approach is deprecated and will be removed in v0.5
To specify a product, pass the path to the
@pytask.mark.produces
decorator. Then, add produces
as
an argument name to use the path inside the task function.
# Content of task_data_preparation.py.
from pathlib import Path
import numpy as np
import pandas as pd
import pytask
from my_project.config import BLD
@pytask.mark.produces(BLD / "data.pkl")
def task_create_random_data(produces: Path) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(produces)
To let pytask track the product of the task, you need to use the
@pytask.mark.produces
decorator.
Now, execute pytask to collect tasks in the current and subsequent directories.
$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">1</span> task.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span> │
└───────────────────────────────────────────────────┴─────────┘
<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>
Customize task names#
Use the @task
decorator to mark a function as a task regardless of
its function name. You can optionally pass a new name for the task. Otherwise, pytask
uses the function name.
from pytask import task
# The id will be ".../task_data_preparation.py::create_random_data".
@task
def create_random_data():
...
# The id will be ".../task_data_preparation.py::create_data".
@task(name="create_data")
def create_random_data():
...
Warning
Since v0.4 users should use @task
over
@pytask.mark.task
which will be removed in v0.5.
Customize task module names#
Use the configuration value task_files
if you prefer a different naming
scheme for the task modules. task_*.py
is the default. You can specify one or multiple
patterns to collect tasks from other files.