Datalakes-And-Data-Integration

Project for the Datalakes & Data Integration class

Airflow Setup

Ce projet utilise Apache Airflow pour orchestrer le pipeline de transformation des données. L'environnement Airflow est déployé via Docker Compose, ce qui permet de démarrer simultanément tous les conteneurs nécessaires (Airflow, Redis, PostgreSQL, Localstack, Cassandra, etc.).

Démarrage de l'environnement Airflow

Après avoir lancé la commande docker-compose up -d, les conteneurs suivants sont démarrés :

airflow-init : Initialise la base de métadonnées d'Airflow et crée un utilisateur administrateur par défaut.
airflow-webserver : Fournit l'interface web pour visualiser et gérer les DAGs.
airflow-scheduler : Planifie l'exécution des tâches selon les dépendances définies dans les DAGs.
airflow-worker : Exécute les tâches distribuées via Celery.

Accès au Webserver d'Airflow

Ouvrez votre navigateur et accédez à :

http://localhost:8080

Connectez-vous avec les identifiants suivants :

Username : admin
Password : admin

Une fois authentifié, vous pourrez accéder au DAG qui orchestre l'ensemble des transformations de données.

Assurez-vous d'avoir un .env avec les clés API associés

API

Flask API

We choose to use flask for our API as it is a simple and efficient way to do an API Gateway for a POC such as this. To run the api please run the following command at the root of the project folder: (you may need to replace pyton by python3 or py)

python src/main.py

Endpoints

We have 4 endpoints in our API, all are of the method POST:

/ingest/blob : This endpoint is used to ingest data from one or several blob made out of csv files. It takes a json payload with the following format:

{
    "data" : [blob1, blob2]
}

/ingest/csv : This endpoint is used to ingest data from one or several csv files. It takes a json payload with the following format:

{
    "files" : ["file1.csv", "file2.csv"]
}

/ingest : This endpoint autoredirect to the right endpoint depending on the type of the file. It takes a json payload with the following format:

{
    "files" : ["file1.csv", "file2.csv"]
}
OR
{
    "data" : [blob1, blob2]
}

/ingest/fast : This endpoint is an optimised version of the endpoint /ingest. It takes the same possible payload.

TESTING ENDPOINTS

/ingest : This endpoint autoredirect to the right endpoint depending on the type of the file. It takes a json payload with the following format:

{
    "files" : ["file1.csv", "file2.csv"]
}
OR
{
    "data" : [blob1, blob2]
}

/ingest/fast : This endpoint is an optimised version of the endpoint /ingest. It takes the same possible payload.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.idea		.idea
config		config
dags		dags
src		src
test_files		test_files
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
env		env
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datalakes-And-Data-Integration

Airflow Setup

Démarrage de l'environnement Airflow

Accès au Webserver d'Airflow

Assurez-vous d'avoir un .env avec les clés API associés

API

Flask API

Endpoints

TESTING ENDPOINTS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

JZhg02/Datalakes-And-Data-Integration

Folders and files

Latest commit

History

Repository files navigation

Datalakes-And-Data-Integration

Airflow Setup

Démarrage de l'environnement Airflow

Accès au Webserver d'Airflow

Assurez-vous d'avoir un .env avec les clés API associés

API

Flask API

Endpoints

TESTING ENDPOINTS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages