Hashing inputs of tasks#
Any input to a task function is parsed by pytask’s nodes. For example,
pathlib.Path
s are parsed by PathNode
s. The
PathNode
handles among other things how changes in the underlying file
are detected.
If an input is not parsed by any more specific node type, the general
PythonNode
is used.
In the following example, the argument text
will be parsed as a
PythonNode
.
from pathlib import Path
from typing import Annotated
from pytask import Product
def task_example(
text: str = "Hello, World", path: Annotated[Path, Product] = Path("file.txt")
) -> None:
path.write_text(text)
By default, pytask does not detect changes in PythonNode
and if the
value would change (without changing the task module), pytask would not rerun the task.
We can also hash the value of PythonNode
s so that pytask knows when
the input changed. For that, we need to use the PythonNode
explicitly
and set hash = True
.
from pathlib import Path
from typing import Annotated
from pytask import Product
from pytask import PythonNode
def task_example(
text: Annotated[str, PythonNode("Hello, World", hash=True)],
path: Annotated[Path, Product] = Path("file.txt"),
) -> None:
path.write_text(text)
When hash=True
, pytask will call the builtin hash()
on the input that will call
the __hash__()
method of the object.
Some objects like tuple
and typing.NamedTuple
are hashable and return
correct hashes by default.
>>> hash((1, 2))
-3550055125485641917
str
and bytes
are special. They are hashable, but the hash changes
from interpreter session to interpreter session for security reasons (see
object.__hash__()
for more information). pytask will hash them using the
hashlib
module to create a stable hash.
>>> from pytask import PythonNode
>>> node = PythonNode(value="Hello, World!", hash=True)
>>> node.state()
'dffd6021bb2bd5b0af676290809ec3a53191dd81c7f70a4b28688a362182986f'
list
and dict
are not hashable by default. Luckily, there are
libraries who provide this functionality like deepdiff
. We can use them to pass a
function to the PythonNode
that generates a stable hash.
First, install deepdiff
.
$ pip install deepdiff
$ conda install deepdiff
Then, create the hash function and pass it to the node.
import json
from pathlib import Path
from typing import Annotated
from typing import Any
from deepdiff import DeepHash
from pytask import Product
from pytask import PythonNode
def calculate_hash(x: Any) -> str:
return DeepHash(x)[x]
node = PythonNode({"a": 1, "b": 2}, hash=calculate_hash)
def task_example(
data: Annotated[dict[str, int], node],
path: Annotated[Path, Product] = Path("file.txt"),
) -> None:
path.write_text(json.dumps(data))