This repository contains ETL (Extract, Transform, Load) jobs for processing Open Government Data for Canton Basel-Stadt, Switzerland.
For more information about the OpenDataBS organization and its projects, visit opendatabs on GitHub.
Each ETL job is contained in its own folder at the root of this repository. Each folder represents an independent data processing pipeline that:
- Extracts data from source systems
- Transforms the data into a standardized format
- Loads the processed data to the web server for publication
Use the interactive setup script to create a new ETL job:
python setup_new_etl.pyThe script will ask you a series of questions and automatically create all necessary files and folders with the correct structure.
When creating a new ETL job manually, create a new folder with the following structure:
-
Dockerfile- Container definition that builds the ETL job image- Must use the base image:
FROM ghcr.io/opendatabs/data-processing/base:latest - Copies
uv.lockandpyproject.tomland runsuv sync --frozen - Copies all files to
/code/ - Sets the command to:
CMD ["uv", "run", "-m", "etl"]
- Must use the base image:
-
etl.py- Main ETL script that contains the data processing logic- Should have a
main()function that is executed when the module runs - Uses the
commonlibrary (imported fromhttps://github.com/opendatabs/common) - Typically reads from
data_orig/and writes todata/
- Should have a
-
pyproject.toml- Python project configuration and dependencies- Defines project name, version, and Python requirements
- Must include
commonas a dependency with a git source reference - Example:
[project] name = "project-name" version = "0.1.0" requires-python = ">=3.12" dependencies = [ "common", "pandas>=2.2.3", # ... other dependencies ] [tool.uv.sources] common = { git = "https://github.com/opendatabs/common", rev = "..." }
-
uv.lock- Lock file for dependency versions (generated byuv) -
data/- Folder for processed/transformed data output- Contains
.gitkeepto ensure the folder is tracked in git - Processed data files are written here by the ETL script
- Contains
-
data_orig/- Folder for original/source data- Contains
.gitkeepto ensure the folder is tracked in git - Source data files are typically mounted here at runtime in Docker
- Original data files are read from here by the ETL script
- Contains
-
change_tracking/- Folder for change tracking metadata- Contains
.gitkeepto ensure the folder is tracked in git - Used by the
common.change_trackingmodule to track data changes
- Contains
.python-version- Python version specification (typically3.12)README.md- Documentation specific to the ETL job.gitignore- Git ignore rules for the specific job- Schema files, configuration files, or other job-specific resources
Folder names should:
- Use lowercase letters
- Use underscores (
_) to separate words - Be descriptive and identify the data source and type
- Follow the pattern:
{organization}_{dataset}or{organization}_{data_type}
Examples:
aue_umweltlabor- Umweltlabor data from AUE (Amt für Umwelt und Energie)gva_geodatenshop- Geodatenshop data from GVA (Grundbuch- und Vermessungsamt)
Important for discoverability:
- Use clear, descriptive names that indicate the data source
- Include the organization abbreviation prefix (e.g.,
aue_,gva_,stata_,kapo_)
The repository includes a GitHub Actions workflow (.github/workflows/docker_build.yaml) that:
- Detects changes - Monitors which folders have been modified
- Builds base image - If the root
Dockerfilechanges, rebuilds the base image - Builds job images - For each modified folder, builds and pushes a Docker image to GitHub Container Registry (GHCR)
- Images are tagged with:
ghcr.io/opendatabs/data-processing/{folder_name}:latest - Images are also tagged with the commit SHA for versioning
- Images are tagged with:
Important: When adding a new ETL job folder, you must add it to the workflow file (.github/workflows/docker_build.yaml) in the filters section so that changes to the folder trigger Docker image builds.
Also Important: After the first push, you must set the Docker image visibility to Public on GitHub Container Registry:
- Go to the repository's "Packages" section on GitHub.
- Click on the image (under "Packages") corresponding to your ETL job (e.g.,
data-processing/your_job_folder). - Click the "Package settings" or gear icon.
- Under "Package visibility", change it from "Private" to "Public".
- Confirm the change.
This allows the image to be pulled and run by anyone with appropriate access.
The repository includes a Ruff workflow (.github/workflows/ruff.yaml) that:
- Automatically formats Python code
- Checks for linting issues
- Creates pull requests with auto-fixes
ETL jobs are designed to run in Docker containers. Each job:
- Reads source data from
data_orig/(typically mounted as a volume) - Processes the data using the logic in
etl.py - Writes processed data to
data/ - May upload data to FTP servers or push to APIs as configured
Jobs are typically scheduled and orchestrated using Apache Airflow, with DAG definitions stored in a separate repository.
-
Install dependencies using
uv:uv sync
-
Run the ETL script locally:
uv run -m etl
-
Ensure source data is available in
data_orig/for testing
To test Docker builds locally:
docker build -t test-job ./your_job_folder- Python 3.12+ - Required Python version
- uv - Fast Python package installer and resolver (used for dependency management)
- common - Shared library from
https://github.com/opendatabs/commoncontaining utilities for ETL jobs - Docker - For containerization and deployment
The base Docker image (ghcr.io/opendatabs/data-processing/base:latest) provides:
- Python 3.12 environment
- Timezone configured to
Europe/Zurich - Locale configured to
de_CH.UTF-8 uvpackage manager pre-installed
All ETL job Dockerfiles extend this base image.