Example use of a PySpark transformation script on fake CSV and JSON files. Demonstrates data loading, cleaning, transformation, and aggregation, with practical operations like column calculations and data type changes, along with logging for ETL steps.
This repository contains a basic example of how to use Apache Spark for loading, cleaning, transforming, and aggregating data from CSV and JSON files. The code demonstrates key functionalities of Spark's DataFrame API, including data cleaning, data transformation, and aggregation.
This project is designed to show how to use Apache Spark for simple (Transform & Load) operations. It includes loading data from CSV and JSON files, cleaning the data, transforming the data by adding new columns, and finally performing aggregation operations on the data.
The script processes employee data:
- The CSV file contains general information about employees (such as salary).
- The JSON file contains detailed information about employees (such as their age and city).
- Python 3.5
- PySpark
- Logging (for simple logging of ETL steps)
To run this project, you'll need to have Python and Apache Spark installed locally.
-
Clone this repository to your local machine:
bash: git clone https://github.com/ur64n/example-pyspark-use
-
Install the necessary Python dependencies:
bash: pip install pyspark
-
Ensure that Spark is installed and available in your system path.
-
Add your CSV and JSON data files to the appropriate directory (in this case,
/home/urban/Projekty/etl_test/practise/data/
).
Once everything is set up, you can run the script using the following command in apropriate path:
python3 transform.py