Using a data catalog#
The previous tutorial explained how to use paths to define dependencies and products.
Two things will quickly become a nuisance in bigger projects.
We have to define the same paths again and again.
We have to define paths to files that we are not particularly interested in since they are just intermediate representations.
As a solution, pytask offers a DataCatalog
which is a purely optional
feature. The tutorial focuses on the main features. To learn about all features, read
the how-to guide.
Let us focus on the previous example and see how the DataCatalog
helps
us.
The project structure is the same as in the previous example with the exception of the
.pytask
folder and the missing data.pkl
in bld
.
my_project
│
├───.pytask
│
├───bld
│ └────plot.png
│
├───src
│ └───my_project
│ ├────__init__.py
│ ├────config.py
│ ├────task_data_preparation.py
│ └────task_plot_data.py
│
└───pyproject.toml
The DataCatalog
#
At first, we define the data catalog in config.py
.
from pathlib import Path
from pytask import DataCatalog
SRC = Path(__file__).parent.resolve()
BLD = SRC.joinpath("..", "..", "bld").resolve()
data_catalog = DataCatalog()
task_data_preparation
#
Next, we will use the data catalog to save the product of the task in
task_data_preparation.py
.
Instead of using a path, we set the location of the product in the data catalog with
data_catalog["data"]
. If the key does not exist, the data catalog will automatically
create a PickleNode
that allows you to save any Python object to a
pickle
file. The pickle
file is stored within the .pytask
folder.
The following tabs show you how to use the data catalog given the interface you prefer.
Use data_catalog["key"]
as an default argument to access the
PickleNode
within the task. When you are done transforming your
DataFrame
, save it with save()
.
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
def task_create_random_data(
node: Annotated[PickleNode, Product] = data_catalog["data"]
) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
node.save(df)
Use data_catalog["key"]
as an default argument to access the
PickleNode
within the task. When you are done transforming your
DataFrame
, save it with save()
.
import numpy as np
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
from typing_extensions import Annotated
def task_create_random_data(
node: Annotated[PickleNode, Product] = data_catalog["data"]
) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
node.save(df)
Use data_catalog["key"]
as an default argument to access the
PickleNode
within the task. When you are done transforming your
DataFrame
, save it with save()
.
import numpy as np
import pandas as pd
from my_project.config import data_catalog
from my_project.config import PickleNode
def task_create_random_data(produces: PickleNode = data_catalog["data"]) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
produces.save(df)
An elegant way to use the data catalog is via return type annotations. Add
data_catalog["data"]
to the annotated return and simply return the
DataFrame
to store it.
You can read more about return type annotations in Using task returns.
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import data_catalog
def task_create_random_data() -> Annotated[pd.DataFrame, data_catalog["data"]]:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
return pd.DataFrame({"x": x, "y": y})
task_plot_data
#
Next, we will define the second task that consumes the data set from the previous task.
Following one of the interfaces gives you immediate access to the
DataFrame
in the task without any additional line to load it.
Use data_catalog["key"]
as an default argument to access the
PickleNode
within the task. When you are done transforming your
DataFrame
, save it with save()
.
from pathlib import Path
from typing import Annotated
import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from my_project.config import data_catalog
from pytask import Product
def task_plot_data(
df: Annotated[pd.DataFrame, data_catalog["data"]],
path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
_, ax = plt.subplots()
df.plot(x="x", y="y", ax=ax, kind="scatter")
plt.savefig(path_to_plot)
plt.close()
Use data_catalog["key"]
as an default argument to access the
PickleNode
within the task. When you are done transforming your
DataFrame
, save it with save()
.
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from my_project.config import data_catalog
from pytask import Product
from typing_extensions import Annotated
def task_plot_data(
df: Annotated[pd.DataFrame, data_catalog["data"]],
path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
_, ax = plt.subplots()
df.plot(x="x", y="y", ax=ax, kind="scatter")
plt.savefig(path_to_plot)
plt.close()
Finally, let’s execute the two tasks.
$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">2</span> task.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_plot_data.py::</span>task_plot_data │ <span class="termynal-success">.</span> │
└───────────────────────────────────────────────────┴─────────┘
<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 2 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 2 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 0.06 seconds ──────────────────────────</span>
Adding data to the catalog#
In most projects, you have other data sets that you would like to access via the data
catalog. To add them, call the add()
method and supply a name
and a path.
Let’s add file.csv
to the data catalog.
my_project
│
├───pyproject.toml
│
├───src
│ └───my_project
│ ├────config.py
│ ├────file.csv
│ ├────task_data_preparation.py
│ └────task_plot_data.py
│
├───setup.py
│
├───.pytask
│ └────...
│
└───bld
├────file.pkl
└────plot.png
The path can be absolute or relative to the module of the data catalog.
from pathlib import Path
from pytask import DataCatalog
SRC = Path(__file__).parent.resolve()
BLD = SRC.joinpath("..", "..", "bld").resolve()
data_catalog = DataCatalog()
# Use either a relative or a absolute path.
data_catalog.add("csv", Path("file.csv"))
data_catalog.add("transformed_csv", BLD / "file.pkl")
You can now use the data catalog as in previous example and use the
Path
in the task.
from pathlib import Path
from typing import Annotated
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
def task_transform_csv(
path: Annotated[Path, data_catalog["csv"]],
node: Annotated[PickleNode, Product] = data_catalog["transformed_csv"],
) -> None:
df = pd.read_csv(path)
...
node.save(df)
from pathlib import Path
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
from typing_extensions import Annotated
def task_transform_csv(
path: Annotated[Path, data_catalog["csv"]],
node: Annotated[PickleNode, Product] = data_catalog["transformed_csv"],
) -> None:
df = pd.read_csv(path)
...
node.save(df)
from pathlib import Path
from typing import Annotated
import pandas as pd
from my_project.config import data_catalog
def task_transform_csv(
path: Annotated[Path, data_catalog["csv"]],
) -> Annotated[pd.DataFrame, data_catalog["transformed_csv"]]:
df = pd.read_csv(path)
...
return df
Developing with the DataCatalog
#
You can also use the data catalog in a Jupyter notebook or in the terminal in the Python
interpreter. Simply import the data catalog, select a node and call the
load()
method of a node to access its value.
>>> from myproject.config import data_catalog
>>> data_catalog.entries
['csv', 'data', 'transformed_csv']
>>> data_catalog["data"].load()
DataFrame(...)
>>> data_catalog["csv"].load()
WindowsPath('C:\Users\pytask-dev\git\my_project\file.csv')
data_catalog["data"]
was stored with a PickleNode
and returns the
DataFrame
whereas data_catalog["csv"]
becomes a
PathNode
and load()
returns the path.