The DataCatalog
- Revisited#
An introduction to the data catalog can be found in the tutorial.
This guide explains some details that were left out of the tutorial.
Changing the default node#
The data catalog uses the PickleNode
by default to serialize any kind
of Python object. You can use any other node that follows the PNode
protocol and register it when creating the data catalog.
For example, use the PythonNode
as the default.
from pytask import PythonNode
data_catalog = DataCatalog(default_node=PythonNode)
Or, learn to write your own node by reading Writing custom nodes.
Here, is an example for a PickleNode
that uses cloudpickle instead of the normal
pickle
module.
from pathlib import Path
from typing import Any
import cloudpickle
from attrs import define
@define
class PickleNode:
"""A node for pickle files.
Attributes
----------
name
Name of the node which makes it identifiable in the DAG.
path
The path to the file.
"""
name: str
path: Path
@classmethod
def from_path(cls, path: Path) -> "PickleNode":
"""Instantiate class from path to file."""
if not path.is_absolute():
msg = "Node must be instantiated from absolute path."
raise ValueError(msg)
return cls(name=path.as_posix(), path=path)
def state(self) -> str | None:
if self.path.exists():
return str(self.path.stat().st_mtime)
return None
def load(self, is_product: bool = False) -> Any:
if is_product:
return self
with self.path.open("rb") as f:
return cloudpickle.load(f)
def save(self, value: Any) -> None:
with self.path.open("wb") as f:
cloudpickle.dump(value, f)
Changing the name and the default path#
By default, the data catalogs store their data in a directory .pytask/data_catalogs
.
If you use a pyproject.toml
with a [tool.pytask.ini_options]
section, then the
.pytask
folder is in the same folder as the configuration file.
The default name for a catalog is "default"
and so you will find its data in
.pytask/data_catalogs/default
. If you assign a different name like
"data_management"
, you will find the data in .pytask/data_catalogs/data_management
.
data_catalog = DataCatalog(name="data_management")
You can also change the path where the data catalogs will be stored by changing the
path
attribute. Here, we store the data catalog’s data next to the module where the
data catalog is defined in .data
.
from pathlib import Path
data_catalog = DataCatalog(path=Path(__file__).parent / ".data")
Multiple data catalogs#
You can use multiple data catalogs when you want to separate your datasets across multiple catalogs or when you want to use the same names multiple times (although it is not recommended!).
Make sure you assign different names to the data catalogs so that their data is stored in different directories.
# Stored in .pytask/data_catalog/a
data_catalog_a = DataCatalog(name="a")
# Stored in .pytask/data_catalog/b
data_catalog_b = DataCatalog(name="b")
Or, use different paths as explained above.