Structure of task files#

This guide presents some best-practices for structuring your task files. You do not have to follow them to use pytask or to create a reproducible research project. But, if you are looking for orientation or inspiration, here are some tips.

TL;DR#

  • Use task modules to separate task functions from another. Separating tasks by the stages in research project like data management, analysis, plotting is a good start. Separate further when task modules become crowded.

  • Task functions should be at the top of a task module to easily identify what the module is for.

    See also

    The only exception might be for repetitions.

  • The purpose of the task function is to handle IO operations like loading and saving files and calling Python functions on the task’s inputs. IO should not be handled in any other function.

  • Non-task functions in the task module are private functions and only used within this task module. The functions should not have side-effects.

  • It should never be necessary to import from task modules. So if you need a function in multiple task modules, put it in a separate module (which does not start with task_).

Best Practices#

Number of tasks in a module#

There are two reasons to split tasks across several modules.

The first reason concerns readability and complexity. Tasks deal with different concepts and, thus, should be split. Even if tasks deal with the same concept, they might becna very complex and separate modules help the reader (most likely you or your colleagues) to focus on one thing.

The second reason is about runtime. If a task module is changed, all tasks within the module are re-run. If the runtime of all tasks in the module is high, you wait longer for your tasks to finish or until an error occurs which prolongs your feedback loops and hurts your productivity.

See also

Use @pytask.mark.persist if you want to avoid accidentally triggering an expensive task. It is also explained in this tutorial.

Structure of the module#

For the following example, let us assume that the task module contains one task.

The task function should be the first function in the module. It should have a descriptive name and a docstring which explains what the task accomplishes.

It should be the only public function in the module which means the only function without a leading underscore. This is a convention to keep public functions separate from private functions (with a leading underscore) where the latter must only be used in the same module and not imported elsewhere.

The body of the task function should contain two things:

  1. Any IO operations like reading and writing files which are necessary for this task.

    The reason is that IO operations introduce side-effects since the result of the function does not only depend on the function arguments, but also on the IO resource (e.g., a file on the disk).

    If we bundle all IO operations in the task functions, all other functions used in task remain pure (without side-effects) which makes testing the functions easier.

  2. The task function should either call private functions defined inside the task module or functions which are shared between tasks and defined in a module separated from all tasks.

The rest of the module is made of private functions with a leading underscore which are used to accomplish this and only this task.

Here is an example of a task module which conforms to all advices.

from pathlib import Path

import pandas as pd
from checks import perform_general_checks_on_data
from pytask import Product
from typing_extensions import Annotated


def task_prepare_census_data(
    path_to_raw_census: Path = Path("raw_census.csv"),
    path_to_census: Annotated[Path, Product] = Path("census.pkl"),
) -> None:
    """Prepare the census data.

    This task prepares the data in three steps.

    1. Clean the data.
    2. Create new variables.
    3. Perform some checks on the new data.

    """
    df = pd.read_csv(path_to_raw_census)
    df = _clean_data(df)
    df = _create_new_variables(df)
    perform_general_checks_on_data(df)
    df.to_pickle(path_to_census)


def _clean_data(df: pd.DataFrame) -> None:
    ...


def _create_new_variables(df: pd.DataFrame) -> None:
    ...

See also

The structure of the task module is greatly inspired by John Ousterhout’s “A Philosopy of Software Design” in which he coins the name “deep modules”. In short, deep modules have simple interfaces which are defined by one or a few public functions (or classes) which provide the functionality. The complexity is hidden inside the module in private functions which are called by the public functions.