Data Processing Repository

This repository contains ETL (Extract, Transform, Load) jobs for processing Open Government Data for Canton Basel-Stadt, Switzerland.

For more information about the OpenDataBS organization and its projects, visit opendatabs on GitHub.

Repository Structure

Each ETL job is contained in its own folder at the root of this repository. Each folder represents an independent data processing pipeline that:

Extracts data from source systems
Transforms the data into a standardized format
Loads the processed data to the web server for publication

Creating a New ETL Job

Quick Setup (Recommended)

Use the interactive setup script to create a new ETL job:

python setup_new_etl.py

The script will ask you a series of questions and automatically create all necessary files and folders with the correct structure.

Manual Setup

When creating a new ETL job manually, create a new folder with the following structure:

Required Files and Folders

Dockerfile - Container definition that builds the ETL job image
- Must use the base image: FROM ghcr.io/opendatabs/data-processing/base:latest
- Copies uv.lock and pyproject.toml and runs uv sync --frozen
- Copies all files to /code/
- Sets the command to: CMD ["uv", "run", "-m", "etl"]
etl.py - Main ETL script that contains the data processing logic
- Should have a main() function that is executed when the module runs
- Uses the common library (imported from https://github.com/opendatabs/common)
- Typically reads from data_orig/ and writes to data/

pyproject.toml - Python project configuration and dependencies

Defines project name, version, and Python requirements
Must include common as a dependency with a git source reference

Example:

[project]
name = "project-name"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "common",
    "pandas>=2.2.3",
    # ... other dependencies
]

[tool.uv.sources]
common = { git = "https://github.com/opendatabs/common", rev = "..." }

uv.lock - Lock file for dependency versions (generated by uv)
data/ - Folder for processed/transformed data output
- Contains .gitkeep to ensure the folder is tracked in git
- Processed data files are written here by the ETL script
data_orig/ - Folder for original/source data
- Contains .gitkeep to ensure the folder is tracked in git
- Source data files are typically mounted here at runtime in Docker
- Original data files are read from here by the ETL script
change_tracking/ - Folder for change tracking metadata
- Contains .gitkeep to ensure the folder is tracked in git
- Used by the common.change_tracking module to track data changes

Optional Files

.python-version - Python version specification (typically 3.12)
README.md - Documentation specific to the ETL job
.gitignore - Git ignore rules for the specific job
Schema files, configuration files, or other job-specific resources

Naming Convention

Folder names should:

Use lowercase letters
Use underscores (_) to separate words
Be descriptive and identify the data source and type
Follow the pattern: {organization}_{dataset} or {organization}_{data_type}

Examples:

aue_umweltlabor - Umweltlabor data from AUE (Amt für Umwelt und Energie)
gva_geodatenshop - Geodatenshop data from GVA (Grundbuch- und Vermessungsamt)

Important for discoverability:

Use clear, descriptive names that indicate the data source
Include the organization abbreviation prefix (e.g., aue_, gva_, stata_, kapo_)

Workflows

Docker Build Workflow

The repository includes a GitHub Actions workflow (.github/workflows/docker_build.yaml) that:

Detects changes - Monitors which folders have been modified
Builds base image - If the root Dockerfile changes, rebuilds the base image
Builds job images - For each modified folder, builds and pushes a Docker image to GitHub Container Registry (GHCR)
- Images are tagged with: ghcr.io/opendatabs/data-processing/{folder_name}:latest
- Images are also tagged with the commit SHA for versioning

Important: When adding a new ETL job folder, you must add it to the workflow file (.github/workflows/docker_build.yaml) in the filters section so that changes to the folder trigger Docker image builds. Also Important: After the first push, you must set the Docker image visibility to Public on GitHub Container Registry:

Go to the repository's "Packages" section on GitHub.
Click on the image (under "Packages") corresponding to your ETL job (e.g., data-processing/your_job_folder).
Click the "Package settings" or gear icon.
Under "Package visibility", change it from "Private" to "Public".
Confirm the change.

This allows the image to be pulled and run by anyone with appropriate access.

Code Quality Workflow

The repository includes a Ruff workflow (.github/workflows/ruff.yaml) that:

Automatically formats Python code
Checks for linting issues
Creates pull requests with auto-fixes

Running ETL Jobs

ETL jobs are designed to run in Docker containers. Each job:

Reads source data from data_orig/ (typically mounted as a volume)
Processes the data using the logic in etl.py
Writes processed data to data/
May upload data to FTP servers or push to APIs as configured

Jobs are typically scheduled and orchestrated using Apache Airflow, with DAG definitions stored in a separate repository.

Development

Local Development

Install dependencies using uv:
```
uv sync
```
Run the ETL script locally:
```
uv run -m etl
```
Ensure source data is available in data_orig/ for testing

Testing Docker Builds

To test Docker builds locally:

docker build -t test-job ./your_job_folder

Dependencies

Python 3.12+ - Required Python version
uv - Fast Python package installer and resolver (used for dependency management)
common - Shared library from https://github.com/opendatabs/common containing utilities for ETL jobs
Docker - For containerization and deployment

Base Docker Image

The base Docker image (ghcr.io/opendatabs/data-processing/base:latest) provides:

Python 3.12 environment
Timezone configured to Europe/Zurich
Locale configured to de_CH.UTF-8
uv package manager pre-installed

All ETL job Dockerfiles extend this base image.

License

opendatabs/data-processing

Folders and files

Latest commit

History

Repository files navigation

Data Processing Repository

Repository Structure

Creating a New ETL Job

Quick Setup (Recommended)

Manual Setup

Required Files and Folders

Optional Files

Naming Convention

Workflows

Docker Build Workflow

Code Quality Workflow

Running ETL Jobs

Development

Local Development

Testing Docker Builds

Dependencies

Base Docker Image

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages