Migrating from scripts to pytask#

Are you tired of managing tasks in your research workflows with scripts that get harder to maintain over time? Then pytask is here to help!

With pytask, you can enjoy features like:

  • Lazy builds. Only execute the scripts that need to be run or re-run because something has changed, saving you lots of time.

  • Parallelization. Use pytask-parallel to speed up your scripts by running them in parallel.

  • Cross-language projects. pytask has several plugins for running scripts written in other popular languages: pytask-r, pytask-julia, and pytask-stata.

The following guide will walk you through a series of steps to quickly migrate your scripts to a workflow managed by pytask. The focus is first on Python scripts, but the guide concludes with an additional example of an R script.

Installation#

To get started with pytask, simply install it with pip or conda:

$ pip install pytask pytask-parallel

$ conda -c conda-forge pytask pytask-parallel

From Python script to task#

We must rewrite your scripts and move the executable part to a task function. You might contain the code in the main namespace of your script, like in this example.

# Content of task_data_management.py
import pandas as pd


df = pd.read_csv("data.csv")

# Many operations.

df.to_pickle("data.pkl")

Or, you might use an if __name__ == "__main__" block like this example.

# Content of task_data_management.py
import pandas as pd


def main() -> None:
    df = pd.read_csv("data.csv")

    # Many operations.

    df.to_pickle("data.pkl")


if __name__ == "__main__":
    main()

For pytask, you need to move the code into a task that is a function whose name starts with task_ in a module with the same prefix like task_data_management.py.

# Content of task_data_management.py
import pandas as pd


def task_prepare_data() -> None:
    df = pd.read_csv("data.csv")

    # Many operations.

    df.to_pickle("data.pkl")

An if __name__ == "__main__" block must be deleted.

Extracting dependencies and products#

To let pytask know the order in which to execute tasks and when to re-run them, you’ll need to specify task dependencies and products. Add dependencies as arguments to the function with default values. Do the same for products, but also add the special Product annotation with Annotated[Path, Product]. For example:

# Content of task_data_management.py
from pathlib import Path

import pandas as pd
from pytask import Product
from typing_extensions import Annotated


def task_prepare_data(
    path_to_csv: Path = Path("data.csv"),
    path_to_pkl: Annotated[Path, Product] = Path("data.pkl"),
) -> None:
    df = pd.read_csv(path_to_csv)

    # Many operations.

    df.to_pickle(path_to_pkl)

You can also use a dictionary to group multiple dependencies or products.

from pathlib import Path
from typing import Optional

import pandas as pd
from pytask import Product
from typing_extensions import Annotated


def task_merge_data(
    paths_to_input_data: Optional[dict[str, Path]] = None,
    path_to_merged_data: Annotated[Path, Product] = Path("merged_data.pkl"),
) -> None:
    if paths_to_input_data is None:
        paths_to_input_data = {
            "first": Path("data_1.csv"),
            "second": Path("data_2.csv"),
        }
    df1 = pd.read_csv(paths_to_input_data["first"])
    df2 = pd.read_csv(paths_to_input_data["second"])

    df = df1.merge(df2, on=...)

    df.to_pickle(path_to_merged_data)

See also

If you want to learn more about dependencies and products, read the tutorial.

Execution#

Finally, execute your newly defined tasks with pytask. Assuming your scripts lie in the current directory of your terminal or a subsequent directory, run the following.

$ pytask
──────────────────────────── Start pytask session ────────────────────────────
Platform: win32 -- Python <span style="color: var(--termynal-blue)">3.10.0</span>, pytask <span style="color: var(--termynal-blue)">0.4.0</span>, pluggy <span style="color: var(--termynal-blue)">1.0.0</span>
Root: C:\Users\pytask-dev\git\my_project
Collected <span style="color: var(--termynal-blue)">1</span> task.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task                                        ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_prepare_data │ <span class="termynal-success">.</span>       │
└─────────────────────────────────────────────┴─────────┘

<span class="termynal-dim">──────────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1  Collected tasks </span>           <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1  Succeeded        (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">───────────────────────── Succeeded in 30.6 seconds ──────────────────────────</span>

Otherwise, pass the paths explicitly to the pytask executable.

If you have rewritten multiple scripts that can be run in parallel, use the -n/--n-workers option to define the number of parallel tasks. pytask-parallel will then automatically spawn multiple processes to run the workflow in parallel.

$ pytask -n 4

See also

You can find more information on pytask-parallel in the readme on Github.

Bonus: From R script to task#

pytask wants to help you get your job done, and sometimes a different programming language can make your life easier. Thus, pytask has several plugins to integrate code written in R, Julia, and Stata. Here, we explore how to incorporate an R script with pytask-r. You can also find more information about the plugin in the repo’s readme.

First, we will install the package.

$ pip install pytask-r

$ conda install -c conda-forge pytask-r

See also

Checkout pytask-julia and pytask-stata, too!

And here is the R script prepare_data.r that we want to integrate.

# Content of prepare_data.r
df <- read.csv("data.csv")

# Many operations.

saveRDS(df, "data.rds")

Next, we create a task function to point pytask to the script and the dependencies and products.

# Content of task_data_management.py
from pathlib import Path

import pytask


@pytask.mark.r(script="prepare_data.r")
def task_prepare_data(
    input_path: Path = Path("data.csv"), output_path: Path = Path("data.rds")
) -> None:
    pass

pytask automatically makes the paths to the dependencies and products available to the R file via a JSON file. Let us amend the R script to load the information from the JSON file.

# Content of prepare_data.r
library(jsonlite)

# Read the JSON file whose path is passed to the script
args <- commandArgs(trailingOnly=TRUE)
path_to_json <- args[length(args)]
config <- read_json(path_to_json)

df <- read.csv(config$depends_on)

# Many operations.

saveRDS(df, config$produces)

Conclusion#

Congrats! You have just set up your first workflow with pytask!

If you enjoyed what you have seen, you should discover the other parts of the documentation. The tutorials are a good entry point to start with pytask and learn about many concepts step-by-step.