- Introduction
- Architecture
- Tools and Technologies
- Setup and Usage
- CI/CD Pipeline
- Directory Structure
- Future Work
This project automates the extraction, transformation, and loading (ETL) of data using a robust pipeline built with Terraform, Airflow, DBT, and Docker. The pipeline orchestrates data workflows to:
- Extract raw data from APIs.
- Transform the data into Facts and Dimensions using DBT.
- Store the data into RDS for further analytics.
- Data Ingestion: Extract raw data from APIs and upload it to S3.
- Storage: Use S3 for raw storage and RDS for structured data.
- Orchestration: Airflow manages pipeline execution via DAGs.
- Transformation: DBT handles data modeling into Facts and Dimensions.
- Provisioning: Terraform provisions cloud infrastructure (S3, RDS, IAM).
- CI/CD: Automates linting, building, and deployment of Docker images.
- Terraform: Infrastructure as Code for scalable provisioning.
- Airflow: Orchestrates ETL workflows using directed acyclic graphs (DAGs).
- DBT: Simplifies SQL transformations for modeling data.
- Docker: Containerizes workflows for consistent deployments.
- GitHub Actions: Implements CI/CD for automated linting and deployment.
- AWS: Hosts S3 for storage and RDS for databases.
- Install Docker, Terraform, and AWS CLI.
- Set up AWS credentials for Terraform and Airflow.
-
Clone the Repository:
git clone <repo-url> cd Mile-stone-project
-
Provision Infrastructure:
cd infrastructure/ terraform init terraform apply
-
Run Airflow:
cd airflow/ docker-compose up -d
-
Trigger Airflow DAGs:
- Use the Airflow UI to trigger DAGs for extraction, transformation, and loading.
-
Monitor CI/CD:
- Push changes to GitHub to trigger the CI/CD pipeline.
- Lint Python code using
flake8
. - Ensures adherence to PEP 8 standards.
- Builds a Docker image for data processing.
- Pushes the Docker image to Docker Hub.
.github/workflows/ci_cd.yml
automates the pipeline.- CI/CD workflow triggers on every push to the
main
branch.
Mile-stone-project/
├── airflow/
│ ├── dags/
│ │ ├── data_processor/
│ │ │ ├── extract_api_data.py
│ │ │ ├── load_to_s3.py
│ │ ├── services/
│ │ │ ├── api_handler.py
│ │ │ ├── config.py
│ │ │ ├── extract_from_s3.py
│ │ │ ├── rds_loader.py
│ │ │ ├── s3_uploader.py
│ │ │ ├── transform_s3.py
│ │ ├── travel_agency.py
│ ├── docker-compose.yml
│ ├── Dockerfile
├── dbt/
│ ├── travel_agency/
│ │ ├── models/
│ │ ├── dbt_project.yml
├── infrastructure/
│ ├── main.tf
│ ├── modules/
│ │ ├── rds/
│ │ ├── s3/
│ │ ├── iam/
├── .github/
│ ├── workflows/
│ │ ├── ci_cd.yml
- Add real-time monitoring to Airflow and DBT workflows.
- Expand pipeline to support additional data sources.
- Integrate data quality checks and alerts.
This project is licensed under the MIT License - see the LICENSE file for details.