0% found this document useful (0 votes)

66 views6 pages

Data Pipeline

This document provides steps to set up Apache Spark on a system and use it to extract data from a SQL Server database into a CSV file, then upload that file to an S3 bucket. It installs Spark, configures environment variables and paths, starts the Spark master and slaves, downloads necessary JAR files, writes a Python script to query the database and save to a CSV, configure AWS credentials, and run the Python script to extract the data, save to local disk, then upload to S3.

Uploaded by

Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views6 pages

Data Pipeline

Uploaded by

Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

java -version; git --version; python --version

cd /opt/spark
sudo wget https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz

tar xvf spark-*

ls -lrt spark-*

vi ~/.profile

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

source ~/.profile

start-master.sh
http://127.0.0.1:8080/

start-slave.sh spark://0.0.0.0:8082

start-slave.sh spark://waplgmdalin_lab01:8082

start-slave.sh spark://0.0.0.0:8082 -c 4 -m 512M

stop-master.sh

stop-slave.sh

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH

sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.7.4.jar

sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.4.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar

vi sqlfile.py

query1 = """(select * from sales-data where date >= '2021-01-01' and status ='Completed')"""

vi config.ini
ACCESS_KEY=BBIYYTU6L4U47BGGG&^CF
SECRET_KEY=Uy76BBJabczF7h6vv+BFssqTVLDVkKYN/f31puYHtG
BUCKET_NAME=s3-bucket-name
DIRECTORY=sales-data-directory

[mssql]
url = jdbc:sqlserver://PPOP24888S08YTA.APAC.PAD.COMPANY-DSN.COM:1433;databaseName=Transactions
database = Transactions
user = MSSQL-USER
password = MSSQL-Password
dbtable = sales-data
filename = data_extract.csv

from pyspark.sql import SparkSession

import shutil
import os
import glob
import boto3
from sqlfile import query1
from configparser import ConfigParser

appName = "PySpark ETL Example - via MS-SQL JDBC"

master = "local"
spark = SparkSession
.builder
.master(master)
.appName(appName)
.config("spark.driver.extraClassPath","/opt/spark/jars/mssql-jdbc-9.2.1.jre8.jar")
.getOrCreate()

url = config.get('mssql-onprem', 'url')

user = config.get('mssql-onprem', 'user')
password = config.get('mssql-onprem', 'password')
dbtable = config.get('mssql-onprem', 'dbtable')
filename = config.get('mssql-onprem', 'filename')
ACCESS_KEY=config.get('aws', 'ACCESS_KEY')

SECRET_KEY=config.get('aws', 'SECRET_KEY')

BUCKET_NAME=config.get('aws', 'BUCKET_NAME')

DIRECTORY=config.get('aws', 'DIRECTORY')

jdbcDF = spark.read.format("jdbc")
.option("url", url)
.option("query", query2)
.option("user", user)
.option("password", password)
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.load()
jdbcDF.show(5)

path = 'output'
jdbcDF.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv(path)
shutil.move(glob.glob(os.getcwd() + '/' + path + '/' + r'*.csv')[0], os.getcwd()+ '/' + filename )
shutil.rmtree(os.getcwd() + '/' + path)

session = boto3.Session(
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
)
bucket_name=BUCKET_NAME
s3_output_key=DIRECTORY + filename
s3 = session.resource('s3')
# Filename - File to upload
# Bucket - Bucket to upload to (the top level directory under AWS S3)
# Key - S3 object name (can contain subdirectories). If not specified then file_name is used
s3.meta.client.upload_file(Filename=filename, Bucket=bucket_name, Key=s3_output_key)

if os.path.isfile(filename):
os.remove(filename)
else:
print("Error: %s file not found" % filename)

T15 Hand-on solution id 80827
No ratings yet
T15 Hand-on solution id 80827
2 pages
Examples with practical guide for pyspark
No ratings yet
Examples with practical guide for pyspark
127 pages
Py_1731703428
No ratings yet
Py_1731703428
8 pages
Spark 5
No ratings yet
Spark 5
2 pages
SIC - Big Data - Chapter 6 - Workbook
No ratings yet
SIC - Big Data - Chapter 6 - Workbook
133 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
learning spark - chapter 2
No ratings yet
learning spark - chapter 2
6 pages
Project documentation
No ratings yet
Project documentation
36 pages
Lab Experiments 1,2&4
No ratings yet
Lab Experiments 1,2&4
8 pages
CC UNI-3-2
No ratings yet
CC UNI-3-2
15 pages
AWS Glue
No ratings yet
AWS Glue
10 pages
Cloud - AWS Pentest
No ratings yet
Cloud - AWS Pentest
55 pages
Pandas Documentation PDF
No ratings yet
Pandas Documentation PDF
86 pages
BDA LABCoure Co-po Mapping
No ratings yet
BDA LABCoure Co-po Mapping
4 pages
CC - Unit III - Chapter-1 & 2
No ratings yet
CC - Unit III - Chapter-1 & 2
37 pages
CSM-labmanual
No ratings yet
CSM-labmanual
16 pages
Analysis of Heart Disease Dataset
No ratings yet
Analysis of Heart Disease Dataset
16 pages
How to work with Iceberg format in AWS-Glue
No ratings yet
How to work with Iceberg format in AWS-Glue
17 pages
MongoDB_NareshIT_17_1_2022
No ratings yet
MongoDB_NareshIT_17_1_2022
13 pages
steps
No ratings yet
steps
9 pages
Procedure: 1
No ratings yet
Procedure: 1
29 pages
Cloud Computing Lab4 Kittu
No ratings yet
Cloud Computing Lab4 Kittu
15 pages
log_hdfs_vtgroup
No ratings yet
log_hdfs_vtgroup
10 pages
Python-Deprecated Library v1.1 Documentation
From Everand
Python-Deprecated Library v1.1 Documentation
Laurent LAPORTE
No ratings yet
Py Spark
No ratings yet
Py Spark
7 pages
Car Analytics Solution
No ratings yet
Car Analytics Solution
4 pages
aws_commands
No ratings yet
aws_commands
11 pages
Installation+Steps
No ratings yet
Installation+Steps
5 pages
CA 3 hive
No ratings yet
CA 3 hive
4 pages
v2 3 Running+PySpark+on+Jupyter+NoteBook
No ratings yet
v2 3 Running+PySpark+on+Jupyter+NoteBook
8 pages
Commands List
No ratings yet
Commands List
11 pages
S3 Bucket
No ratings yet
S3 Bucket
5 pages
Production Data Processing With Apache Spark
No ratings yet
Production Data Processing With Apache Spark
7 pages
Word Count
No ratings yet
Word Count
3 pages
Spark1 1
No ratings yet
Spark1 1
3 pages
Spark installation
No ratings yet
Spark installation
1 page
Apache Spark
No ratings yet
Apache Spark
8 pages
Untitled Document (1)
No ratings yet
Untitled Document (1)
7 pages
Boto3 Lambda Python
No ratings yet
Boto3 Lambda Python
5 pages
Consumir Api Desde Glue
No ratings yet
Consumir Api Desde Glue
3 pages
Colab Spark Initialize Step
No ratings yet
Colab Spark Initialize Step
1 page
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Analytics-2025-04-03-090003.ips.ca
No ratings yet
Analytics-2025-04-03-090003.ips.ca
161 pages
Python Code
No ratings yet
Python Code
5 pages
Noeud
No ratings yet
Noeud
4 pages
Cloudshell Commands
No ratings yet
Cloudshell Commands
2 pages
SP - 05 Denah Kolom-Sp - 06 Denah Kolom Lantai 1
No ratings yet
SP - 05 Denah Kolom-Sp - 06 Denah Kolom Lantai 1
1 page
1) Aws Installation On Windows
No ratings yet
1) Aws Installation On Windows
2 pages
Scnfinal
No ratings yet
Scnfinal
5 pages
Spark Python Install
No ratings yet
Spark Python Install
3 pages
Spark Installation Mac
No ratings yet
Spark Installation Mac
1 page
The Ultimate Data Engineering Guide_ Apache Spark, Apache Airflow, and AWS Glue
No ratings yet
The Ultimate Data Engineering Guide_ Apache Spark, Apache Airflow, and AWS Glue
6 pages
Aws Glue
No ratings yet
Aws Glue
3 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
WS1 Certificate Authority Integrations
No ratings yet
WS1 Certificate Authority Integrations
121 pages
JBoss WildFly 9 Documentation
100% (1)
JBoss WildFly 9 Documentation
1,804 pages
HP Smart Print Anywhere Mobile Desktop Training - Dec 2018
No ratings yet
HP Smart Print Anywhere Mobile Desktop Training - Dec 2018
68 pages
SpSlabPlusBeam Manual v3.00
No ratings yet
SpSlabPlusBeam Manual v3.00
406 pages
10 -Queues Using Array_074810
No ratings yet
10 -Queues Using Array_074810
29 pages
MICROPROCESSOR_UNIT VI Q & Answers.docx
No ratings yet
MICROPROCESSOR_UNIT VI Q & Answers.docx
12 pages
3 MCQ Class 11th While and For Loops
No ratings yet
3 MCQ Class 11th While and For Loops
9 pages
Platform Guide EC-Net4 - UG-1
No ratings yet
Platform Guide EC-Net4 - UG-1
204 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Standalone and Rac Database Differences Classnotes
No ratings yet
Standalone and Rac Database Differences Classnotes
3 pages
Py Spark Final
No ratings yet
Py Spark Final
1 page
Aws
100% (1)
Aws
5 pages
Xi Cs Hye Ahmd Qp s1
No ratings yet
Xi Cs Hye Ahmd Qp s1
5 pages
Uploading Lumi Content To Canvas
No ratings yet
Uploading Lumi Content To Canvas
8 pages
WT Lab QP
No ratings yet
WT Lab QP
11 pages
East 21 143 5 (1) 46 55
No ratings yet
East 21 143 5 (1) 46 55
10 pages
9713 Applied ICT Example Candidate Responses Booklet WEB
No ratings yet
9713 Applied ICT Example Candidate Responses Booklet WEB
104 pages
04 Computer Speeds and Units
No ratings yet
04 Computer Speeds and Units
13 pages
HUMAN SETTLEMENTS ADJUDICATION COMMISSION-Supervising Administrative Officer
No ratings yet
HUMAN SETTLEMENTS ADJUDICATION COMMISSION-Supervising Administrative Officer
2 pages
Hadoop ECO System
No ratings yet
Hadoop ECO System
1 page
SYS600 - IEC 61107 Master Protocol
100% (1)
SYS600 - IEC 61107 Master Protocol
32 pages
INEVITRADE Pro +
75% (4)
INEVITRADE Pro +
3 pages
Kubernetes Cluster Creation Using Kubeadm
No ratings yet
Kubernetes Cluster Creation Using Kubeadm
6 pages
Resume Hitesh V1 0
100% (1)
Resume Hitesh V1 0
6 pages
CV Satya
No ratings yet
CV Satya
5 pages
Write A Simple Program To Create The Welcome Window
No ratings yet
Write A Simple Program To Create The Welcome Window
37 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Run Multiple Websites On The Same Port and IP Address On IIS - Windows OS Hub
No ratings yet
Run Multiple Websites On The Same Port and IP Address On IIS - Windows OS Hub
7 pages
Lee2018 Chapter MethodToModifyTheHexOfAndroidM
No ratings yet
Lee2018 Chapter MethodToModifyTheHexOfAndroidM
6 pages
Part 1 - Digital Imaging
No ratings yet
Part 1 - Digital Imaging
3 pages
1 Year Schedule
No ratings yet
1 Year Schedule
10 pages
2.1 Advanced Word Processing Skills
0% (1)
2.1 Advanced Word Processing Skills
23 pages
pb1009x Factory Reset
No ratings yet
pb1009x Factory Reset
3 pages

Data Pipeline

Uploaded by

Data Pipeline

Uploaded by

java -version; git --version; python --version

tar xvf spark-*

start-slave.sh spark://0.0.0.0:8082 -c 4 -m 512M

sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.7.4.jar

from pyspark.sql import SparkSession

appName = "PySpark ETL Example - via MS-SQL JDBC"

url = config.get('mssql-onprem', 'url')

You might also like