Run an python package on AWS EMR
Develop install:
$ pip install -e .
Testing:
$ pip install tox
$ tox
The easiest way to get EMR up and running is to go through the Web-Interface and create a ssh key, and start a cluster by hand. This will then create the needed subnet_key and EMR roles.
Create a config.yaml
per project or as a default into
~/.config/spark-emr.yaml
bootstrap_uri: s3://foo/bar
master:
instance_type: m4.large
size_in_gb: 100
core:
instance_type: m4.large
instance_count: 2
size_in_gb: 100
ssh_key: XXXXX
subnet_id: subnet-XXXXXX
python_version: python36
emr_version: emr-5.20.0
consistent: false
optimization: false
region: eu-central-1
job_flow_role: EMR_EC2_DefaultRole
service_role: EMR_DefaultRole
To run a python code on EMR you need to build a proper python package aka
setup.py
with console_scripts
the script needs to end on .py
or yarn
won't be able to execute it |-(
Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll cluster until finished, stop cluster:
$ spark-emr start \
[--config config.yaml] \
--name "Spark-ETL" \
--bid-master 0.04 \
--bid-core 0.04 \
--cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
--tags foo 2 bar 4 \
--poll \
--yarn-log \
--package "../"
Running with a released pypackage version (pip):
$ spark-emr start \
... \
--package pip+etl_pypackage
Returns the status of a cluster (also terminated ones):
$ spark-emr status --cluster-id j-XXXXX
List all cluster and filter optionally by tag:
$ spark-emr list [--config config.yaml] [--filter somekey somevalue]
Stop a running cluster:
$ spark-emr stop --cluster-id j-XXXXX
This call returns for all regions and configured instances the spot price:
$ spark-emr spot
The created command can also be run directly from the master:
$ /usr/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
--conf spark.executorEnv.PYSPARK_PYTHON=python35 \
/usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv
To test if our spark is running as expected we can run it locally in docker.
$ git clone https://github.com/delijati/spark-docker
$ cd spark-docker
$ docker build . --pull -t spark
Now we can run our spark job locally.
$ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"