Using a data catalog¶
The previous tutorial explained how to use paths to define dependencies and products.
Two things will quickly become a nuisance in bigger projects.
- We have to define the same paths again and again.
- We have to define paths to files that we are not particularly interested in since they are just intermediate representations.
As a solution, pytask offers a
pytask.DataCatalog, which
is a purely optional feature. The tutorial focuses on the main features. To learn about
all the features, read the how-to guide.
Let us focus on the previous example and see how
pytask.DataCatalog helps
us.
The project structure is the same as in the previous example except the .pytask folder
and the missing data.pkl in bld.
my_project
│
├───.pytask
│
├───bld
│ └────plot.png
│
├───src
│ └───my_project
│ ├────__init__.py
│ ├────config.py
│ ├────task_data_preparation.py
│ └────task_plot_data.py
│
├───pytask.lock
│
└───pyproject.toml
The DataCatalog¶
At first, we define the data catalog in config.py.
from pathlib import Path
from pytask import DataCatalog
SRC = Path(__file__).parent.resolve()
BLD = SRC.joinpath("..", "..", "bld").resolve()
data_catalog = DataCatalog()
task_create_random_data¶
Next, we look at the module task_data_preparation.py and its task
task_create_random_data. The task creates a dataframe with simulated data that should
be stored on the disk.
In the previous tutorial, we learned to use
pathlib.Paths to define
products of our tasks. Here we see again the signature of the task function.
from pathlib import Path
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
def task_create_random_data(
path_to_data: Annotated[Path, Product] = BLD / "data.pkl",
) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(path_to_data)
from pathlib import Path
import numpy as np
import pandas as pd
from my_project.config import BLD
def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(produces)
When we want to use the data catalog, we replace BLD / "data.pkl" with an entry of the
data catalog like data_catalog["data"]. If there is yet no entry with the name
"data", the data catalog will automatically create a
pytask.PickleNode. The node allows you
to save any Python object to a pickle file.
You probably noticed that we did not need to define a path. That is because the data
catalog takes care of that and stores the pickle file in the .pytask folder.
Using data_catalog["data"] is thus equivalent to using PickleNode(path=Path(...)).
The following tabs show you how to use the data catalog given the interface you prefer.
Use data_catalog["data"] as an default argument to access the
pytask.PickleNode within the task. When
you are done transforming your pandas.DataFrame, save it with
pytask.PNode.save.
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
def task_create_random_data(
node: Annotated[PickleNode, Product] = data_catalog["data"],
) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
node.save(df)
Use data_catalog["data"] as an default argument to access the
pytask.PickleNode within the task. When
you are done transforming your pandas.DataFrame, save it with
pytask.PNode.save.
import numpy as np
import pandas as pd
from my_project.config import PickleNode
from my_project.config import data_catalog
def task_create_random_data(produces: PickleNode = data_catalog["data"]) -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
produces.save(df)
An elegant way to use the data catalog is via return type annotations. Add
data_catalog["data"] to the annotated return and simply return the
pandas.DataFrame
to store it.
You can read more about return type annotations in Using task returns.
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import data_catalog
def task_create_random_data() -> Annotated[pd.DataFrame, data_catalog["data"]]:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
return pd.DataFrame({"x": x, "y": y})
task_plot_data¶
Next, we will define the second task that consumes the data set from the previous task.
Following one of the interfaces gives you immediate access to the
pandas.DataFrame in the task without any additional line to load
it.
from pathlib import Path
from typing import Annotated
import matplotlib.pyplot as plt
import pandas as pd
from my_project.config import BLD
from my_project.config import data_catalog
from pytask import Product
def task_plot_data(
df: Annotated[pd.DataFrame, data_catalog["data"]],
path_to_plot: Annotated[Path, Product] = BLD / "plot.png",
) -> None:
_, ax = plt.subplots()
df.plot(x="x", y="y", ax=ax, kind="scatter")
plt.savefig(path_to_plot)
plt.close()
Finally, let's execute the two tasks.
$ pytask
────────────────────────── Start pytask session ─────────────────────────
Platform: win32 -- Python 3.13.0, pytask 0.6.0, pluggy 1.3.0
Root: C:\Users\pytask-dev\git\my_project
Collected 2 task.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span> │
│ <span class="termynal-dim">task_plot_data.py::</span>task_plot_data │ <span class="termynal-success">.</span> │
└───────────────────────────────────────────────────┴─────────┘
<span class="termynal-dim">─────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 2 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 2 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">─────────────────────── Succeeded in 0.06 seconds ───────────────────────</span>
Adding data to the catalog¶
In most projects, you have other data sets that you would like to access via the data
catalog. To add them, call the
pytask.DataCatalog.add
method and supply a name and a path.
Let's add file.csv with the name "csv" to the data catalog and use it to create
data["transformed_csv"].
my_project
│
├───pyproject.toml
│
├───pytask.lock
│
├───src
│ └───my_project
│ ├────config.py
│ ├────file.csv
│ ├────task_data_preparation.py
│ └────task_plot_data.py
│
├───.pytask
│ └────...
│
└───bld
├────file.pkl
└────plot.png
We can use a relative or an absolute path to define the location of the file. A relative path means the location is relative to the module of the data catalog.
from pathlib import Path
from pytask import DataCatalog
SRC = Path(__file__).parent.resolve()
BLD = SRC.joinpath("..", "..", "bld").resolve()
data_catalog = DataCatalog()
# Use either a relative or a absolute path.
data_catalog.add("csv", Path("file.csv"))
You can now use the data catalog as in the previous example and use the
pathlib.Path in the task.
Note
Note that the value of data_catalog["csv"] inside the task becomes a
pathlib.Path. It is because a pathlib.Path in
pytask.DataCatalog.add
is not parsed to a pytask.PickleNode
but a pytask.PathNode.
Read writing custom nodes for more information about different node types which is not relevant now.
from pathlib import Path
from typing import Annotated
import pandas as pd
from my_project.config import data_catalog
from pytask import PickleNode
from pytask import Product
def task_transform_csv(
path: Annotated[Path, data_catalog["csv"]],
node: Annotated[PickleNode, Product] = data_catalog["transformed_csv"],
) -> None:
df = pd.read_csv(path)
# ... transform the data.
node.save(df)
from pathlib import Path
from typing import Annotated
import pandas as pd
from my_project.config import data_catalog
def task_transform_csv(
path: Annotated[Path, data_catalog["csv"]],
) -> Annotated[pd.DataFrame, data_catalog["transformed_csv"]]:
return pd.read_csv(path)
# ... transform the data
Developing with the DataCatalog¶
You can also use the data catalog in a Jupyter Notebook or the terminal in the Python interpreter. This can be super helpful when you develop tasks interactively in a Jupyter Notebook.
Simply import the data catalog, select a node and call
pytask.PNode.load to access its value.
Here is an example with a terminal.
>>> from myproject.config import data_catalog
>>> data_catalog.entries
['csv', 'data', 'transformed_csv']
>>> data_catalog["data"].load()
DataFrame(...)
>>> data_catalog["csv"].load()
WindowsPath('C:\Users\pytask-dev\git\my_project\file.csv')
data_catalog["data"] was stored with a
pytask.PickleNode and returns the
pandas.DataFrame whereas data_catalog["csv"] becomes a
pytask.PathNode and
pytask.PNode.load returns the path.