Hashing inputs of tasks

Any input to a task function is parsed by pytask’s nodes. For example, pathlib.Paths are parsed by PathNodes. The PathNode handles among other things how changes in the underlying file are detected.

If an input is not parsed by any more specific node type, the general PythonNode is used.

In the following example, the argument text will be parsed as a PythonNode.

from pathlib import Path
from typing import Annotated

from pytask import Product


def task_example(
    text: str = "Hello, World", path: Annotated[Path, Product] = Path("file.txt")
) -> None:
    path.write_text(text)

By default, pytask does not detect changes in PythonNode and if the value would change (without changing the task module), pytask would not rerun the task.

We can also hash the value of PythonNode s so that pytask knows when the input changed. For that, we need to use the PythonNode explicitly and set hash = True.

from pathlib import Path
from typing import Annotated

from pytask import Product
from pytask import PythonNode


def task_example(
    text: Annotated[str, PythonNode(value="Hello, World", hash=True)],
    path: Annotated[Path, Product] = Path("file.txt"),
) -> None:
    path.write_text(text)

When hash=True, pytask will call the builtin hash() on the input that will call the __hash__() method of the object.

Some objects like tuple and typing.NamedTuple are hashable and return correct hashes by default.

>>> hash((1, 2))
-3550055125485641917

str and bytes are special. They are hashable, but the hash changes from interpreter session to interpreter session for security reasons (see object.__hash__() for more information). pytask will hash them using the hashlib module to create a stable hash.

>>> from pytask import PythonNode
>>> node = PythonNode(value="Hello, World!", hash=True)
>>> node.state()
'dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f'

list and dict are not hashable by default. Luckily, there are libraries who provide this functionality like deepdiff. We can use them to pass a function to the PythonNode that generates a stable hash.

First, install deepdiff.

$ pip install deepdiff
$ conda install deepdiff

Then, create the hash function and pass it to the node.

import json
from pathlib import Path
from typing import Annotated
from typing import Any

from deepdiff import DeepHash
from pytask import Product
from pytask import PythonNode


def calculate_hash(x: Any) -> str:
    return DeepHash(x)[x]


node = PythonNode(value={"a": 1, "b": 2}, hash=calculate_hash)


def task_example(
    data: Annotated[dict[str, int], node],
    path: Annotated[Path, Product] = Path("file.txt"),
) -> None:
    path.write_text(json.dumps(data))