3. Project templates

3.1. Templates

3.1.1. Most DS projects consist in similar workflows

../_images/Paper.ML_in_Prod_-_feb_2020.13.png

Example one-off data analysis workflow

../_images/Paper.ML_in_Prod_-_feb_2020.14.png

Example train/deploy model workflow

3.1.2. Example project structure

src/
.... package_name/ # package source
........ __init__.py
tests/ # package tests
data/ # input data and artefacts
.... raw/
.... processed/
.... artefacts/
bin/ # scripts
CI/ # CI/CD stuff
setup.py # setuptools for package
environment.yml # environment configuration
Makefile
.gitignore

No one-size-fits-all: Different projects may warrant different structures

  • Promotes consistency & good practices

  • Facilitates review & reproducibility

  • Enables automation

We can use tools like cookiecutter or pyScaffold to create and use templates to reduce cognitive load

3.1.3. cookiecutter Hello world!

# Create a folder with these contents
hello_world_template/
...cookiecutter.json
...{{cookiecutter.directory_name}}/
......hello_world.py

# contents of hello_world.py
print("Hello, {{cookiecutter.your_name}}!")

# contents of cookiecutter.json
{"directory_name": "project_name", "your_name": "Arnau"}

And now, magic:

# to create a new project
cookiecutter hello_world_template

3.1.4. Recommendations

Organize your projects in a clear, consistent, meaningful manner

  1. Clear: each file and folder are easily understood from their context and name

  2. Consistent: every project follows the same logic, no surprises

  3. Meaningful: every choice has a proper motivation

Other opinionated guidelines:

3.1.5. cookiecutter: pre/post hooks

You can automate certain pre/post-project setup tasks with Python or shell scripts:

cookiecutter-something/
├── {{cookiecutter.project_name}}/
├── hooks
│   ├── pre_gen_project.py
│   └── post_gen_project.sh
└── cookiecutter.json

Examples:

  • Pre-hook: checking that package name is valid

  • Post-hook: initializing git repo

# Example pre_gen_project.py
import re
import sys
PACKAGE_REGEX = r'^[_a-zA-Z][_a-zA-Z0-9]+$'
if not re.match(PACKAGE_REGEX, {{ cookiecutter.package_name }}
        sys.exit(1)
# Example pre_gen_project.sh
#!/usr/bin/env bash
echo "Initializing git repository..."
git init
# Commit project skeleton to the repository
git add * # or whatever we want to check-in
git commit -m '{{cookiecutter.module_name}} first commit'

3.1.6. cookiecutter: Jinja Templates

We can use Jinja templates in the cookiecutter.json values:

{
"mod_name": "",
"pkg_name": "{{ cookiecutter.mod_name|lower|replace(' ', '_')|replace('-', '_') }}",
"project_url": "https://github.com/atibaup/{{ cookiecutter.pkg_name }}",
"year": "{% now 'utc', '%Y' %}",
"_extensions": ["jinja2_time.TimeExtension"]
}

3.1.7. Practice time: Build your own template!

Let’s put together the three things we just learned: Set up your own project template with:

1. Your favored project folder structure. You can find inspiration from cookiecutter-data-science, cookiecutter-pypackage, or an example from the 2019 edition.

2. A Makefile with setup, dataset and clean targets, setting up the conda environment and installing the local module (if there is one)

  1. Pre-cookiecutter hooks to validate the cookiecutter variables

  2. Post-cookiecutter hooks to initialize a repository

  3. Bonus: set up pre-commit hooks