Defining dependencies and products#
To ensure pytask executes all tasks in the correct order, define which dependencies are required and which products are produced by a task.
Important
If you do not specify dependencies and products as explained below, pytask will not be able to build a graph, a DAG, and will not be able to execute all tasks in the project correctly!
Products#
Let’s revisit the task from the previous tutorial.
@pytask.mark.produces(BLD / "data.pkl")
def task_create_random_data(produces):
...
The @pytask.mark.produces
marker attaches a
product to a task which is a pathlib.Path
to file. After the task has finished,
pytask will check whether the file exists.
Optionally, you can use produces
as an argument of the task function and get access to
the same path inside the task function.
Dependencies#
Most tasks have dependencies. Like products, you can use the
@pytask.mark.depends_on
marker to attach a
dependency to a task.
@pytask.mark.depends_on(BLD / "data.pkl")
@pytask.mark.produces(BLD / "plot.png")
def task_plot_data(depends_on, produces):
df = pd.read_pickle(depends_on)
...
Use depends_on
as a function argument to work with the dependency path and, for
example, load the data.
Conversion#
Dependencies and products do not have to be absolute paths. If paths are relative, they are assumed to point to a location relative to the task module.
You can also use absolute and relative paths as strings that obey the same rules as the
pathlib.Path
.
@pytask.mark.produces("../bld/data.pkl")
def task_create_random_data(produces):
...
If you use depends_on
or produces
as arguments for the task function, you will have
access to the paths of the targets as pathlib.Path
.
Multiple dependencies and products#
The easiest way to attach multiple dependencies or products to a task is to pass a
dict
(highly recommended), list
, or another iterator to the marker
containing the paths.
To assign labels to dependencies or products, pass a dictionary. For example,
@pytask.mark.produces({"first": BLD / "data_0.pkl", "second": BLD / "data_1.pkl"})
def task_create_random_data(produces):
...
Then, use produces
inside the task function.
>>> produces["first"]
BLD / "data_0.pkl"
>>> produces["second"]
BLD / "data_1.pkl"
You can also use lists and other iterables.
@pytask.mark.produces([BLD / "data_0.pkl", BLD / "data_1.pkl"])
def task_create_random_data(produces):
...
Inside the function, the arguments depends_on
or produces
become a dictionary where
keys are the positions in the list.
>>> produces
{0: BLD / "data_0.pkl", 1: BLD / "data_1.pkl"}
Why does pytask recommend dictionaries and convert lists, tuples, or other iterators to dictionaries? First, dictionaries with positions as keys behave very similarly to lists.
Secondly, dictionaries use keys instead of positions that are more verbose and descriptive and do not assume a fixed ordering. Both attributes are especially desirable in complex projects.
Multiple decorators#
pytask merges multiple decorators of one kind into a single dictionary. This might help you to group dependencies and apply them to multiple tasks.
common_dependencies = pytask.mark.depends_on(
{"first_text": "text_1.txt", "second_text": "text_2.txt"}
)
@common_dependencies
@pytask.mark.depends_on("text_3.txt")
def task_example(depends_on):
...
Inside the task, depends_on
will be
>>> depends_on
{"first_text": ... / "text_1.txt", "second_text": "text_2.txt", 0: "text_3.txt"}
Nested dependencies and products#
Dependencies and products can be nested containers consisting of tuples, lists, and dictionaries. It is beneficial if you want more structure and nesting.
Here is an example of a task that fits some model on data. It depends on a module containing the code for the model, which is not actively used but ensures that the task is rerun when the model is changed. And it depends on the data.
@pytask.mark.depends_on(
{
"model": [SRC / "models" / "model.py"],
"data": {"a": SRC / "data" / "a.pkl", "b": SRC / "data" / "b.pkl"},
}
)
@pytask.mark.produces(BLD / "models" / "fitted_model.pkl")
def task_fit_model(depends_on, produces):
...
depends_on
within the function will be
{
"model": [SRC / "models" / "model.py"],
"data": {"a": SRC / "data" / "a.pkl", "b": SRC / "data" / "b.pkl"},
}