0% found this document useful (0 votes)

112 views13 pages

Data Engineering & Apache Spark Guide

The document provides an overview of data engineering and Apache Spark. It discusses what data engineers do, reference architectures for data engineering platforms, and introduces Apache Spark and Databricks.

Uploaded by

thulasi narravula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views13 pages

Data Engineering & Apache Spark Guide

Uploaded by

thulasi narravula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

ScholarNest

Introduction to Data Engineering

Data Engineers – Reference Architecture ScholarNest
Image Source: Google Cloud Documentation

What do they do? What do they do? What do they do?

• Develop and Manage • Collect data • Optimize
Operational Systems • Transform • Fraud Prevention
• Banking Apps • Quality Check • Grow
• E-Commerce Apps • Standardize • Recommendations
• OTT Applications • Prepare/Model • Monitor/Report
• IoT Applications • Facilitate Consumption • Sales/Revenue
Data Engineering Platform – Reference Architecture ScholarNest

Data Engineering Functions

Approaches (Batch/Stream-RT/NRT)
Lakehouse Medallion Architecture ScholarNest
ScholarNest

Introduction to Apache Spark

ScholarNest
Apache Spark is an engine for executing data engineering,
stream processing, and machine learning on distributed clusters.

Capabilities
• ANSI SQL
• Batch Processing API
• Stream Processing API

What is •
•
Graph Processing API
Machine Learning API

Apache Spark? Who is using Apache Spark?

Thousands of companies, including 80% of the Fortune 500, use
Apache Spark .
What is Apache Spark – A Unified Framework ScholarNest

Programming
API/DSL

Spark Framework

Resource Manager (YARN | Standalone | Kubernetes)

Compute Cluster

Distributed Storage (HDFS | S3 | ADLS | GCS)

Why Apache Spark? ScholarNest

Unified Open Wide

Abstraction Ease of use
Platform Source Ecosystem

Spark Framework

Resource Manager (YARN | Standalone | Kubernetes)

Compute Cluster

Distributed Storage (HDFS | S3 | ADLS | GCS)

Missing features from Apache Spark ScholarNest

Data Storage Infrastructure

ACID Transaction capabilities

Metadata Catalog

Cluster Management

Automation APIs and Tools

Spark Platforms ScholarNest

Cloudera Hadoop Platform

Amazon EMR
Azure HDInsight
Google Data Proc
Databricks Platform
ScholarNest

Introduction to Databricks
Databricks Features ScholarNest

Spark as Cloud-Native Technology

Secure Cloud Storage Integration
ACID Transaction via Delta Lake Integration
Unity Catalog for Metadata Management
Cluster Management Databricks Cloud
Photon Query Engine
Notebooks and Workspace
Administration Controls
Optimized Spark Runtime
Automation Tools
Databricks Cloud – Key Integrations ScholarNest

Service Azure AWS GCP

CI/CD Azure DevOps, GitHub Enterprise AWS Code Build, AWS Code Deploy, AWS Code Pipeline Google Cloud Build, Google Cloud Deploy
Data warehouse Azure Synapse Analytics Amazon Redshift BigQuery
Data Integration Azure Data Factory AWS Glue, Amazon Data Pipeline Google Cloud Data Fusion
Messaging Azure Service Bus, Azure Event Hubs AWS Kinesis, Amazon SNS, Amazon SQS Google Pub/Sub
Workflow orchestration Azure Data Factory Amazon Data Pipeline, AWS Glue, Apache Airflow Cloud Composer
Document data Azure Cosmos DB Amazon DocumentDB Firestore
NoSQL - Key/Value Azure Cosmos DB Amazon DynamoDB Cloud Bigtable
RDBMS Azure SQL Database Amazon Aurora, Amazon RDS Cloud SQL
Storage Transfer Azure Data Factory, Azure Storage Mover AWS Storage Gateway, AWS Data Sync Storage Transfer Service
Network connectivity Azure Virtual Private Network AWS Virtual Private Network Cloud VPN
Audit logging Azure Audit Logs AWS CloudTrail Cloud Audit Logs
Key management Azure Key Vault AWS KMS Cloud KMS
Identity Azure Identity Management AWS IAM Google Cloud IAM
Storage Azure Blob Storage - ADLS Gen2 Amazon S3 Google Cloud Storage

Data Engg
No ratings yet
Data Engg
19 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
Day 1
No ratings yet
Day 1
10 pages
Explain Databricks
No ratings yet
Explain Databricks
26 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
DE in AI
No ratings yet
DE in AI
14 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
Ravindra Gude Senior Data Engineer
No ratings yet
Ravindra Gude Senior Data Engineer
6 pages
Spark 101
No ratings yet
Spark 101
25 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Apache Spark Primer 170303
No ratings yet
Apache Spark Primer 170303
8 pages
Script - Google Cloud Infrastructure
No ratings yet
Script - Google Cloud Infrastructure
6 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
193 pages
Cloud Data Engineering
No ratings yet
Cloud Data Engineering
2 pages
Data Engineering Essentials Guide
No ratings yet
Data Engineering Essentials Guide
9 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Bda U4
No ratings yet
Bda U4
49 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Data Engineering Modified
No ratings yet
Data Engineering Modified
5 pages
Apache Spark IP Gemini 1 PDF
No ratings yet
Apache Spark IP Gemini 1 PDF
38 pages
Basic Terms of DATA ENGINEERING
No ratings yet
Basic Terms of DATA ENGINEERING
9 pages
GCP Data Engineering Science Governance
No ratings yet
GCP Data Engineering Science Governance
2 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Finding Employee SSN in BigQuery Datasets - 05032025
No ratings yet
Finding Employee SSN in BigQuery Datasets - 05032025
2 pages
Module 2
No ratings yet
Module 2
20 pages
Data Platform and Analytics Foundational Training: (Speaker Name)
No ratings yet
Data Platform and Analytics Foundational Training: (Speaker Name)
14 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
? What Is Big Data
No ratings yet
? What Is Big Data
14 pages
Data Roles & Cloud Platforms Guide
No ratings yet
Data Roles & Cloud Platforms Guide
18 pages
Google Cloud Services Part II
No ratings yet
Google Cloud Services Part II
30 pages
Spark-Powered Big Data Platform by Atigeo
No ratings yet
Spark-Powered Big Data Platform by Atigeo
17 pages
1 Spark
No ratings yet
1 Spark
2 pages
Unit - 4
No ratings yet
Unit - 4
49 pages
Unit V
No ratings yet
Unit V
35 pages
Data Governance On Unity Catalog - Jul 2024
100% (1)
Data Governance On Unity Catalog - Jul 2024
56 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Google Cloud Data Science Guide
No ratings yet
Google Cloud Data Science Guide
1 page
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
19 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Aws Azure GCP
No ratings yet
Aws Azure GCP
8 pages
GCP Detailed Services v3
No ratings yet
GCP Detailed Services v3
3 pages
AWS Summit Bengaluru Innovators Edition Keynote
No ratings yet
AWS Summit Bengaluru Innovators Edition Keynote
114 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Module 4
No ratings yet
Module 4
14 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
AWS - Data Flow Poster - Long - Final
No ratings yet
AWS - Data Flow Poster - Long - Final
1 page
Google Feuille de Route Pour l'IA Sep25
No ratings yet
Google Feuille de Route Pour l'IA Sep25
70 pages
Introduction To Apache Spark
No ratings yet
Introduction To Apache Spark
13 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Test 12 File
No ratings yet
Test 12 File
18 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
Data Report Martin Inline Graphics R8 1
No ratings yet
Data Report Martin Inline Graphics R8 1
6 pages
Feature Stores For Sub ML
No ratings yet
Feature Stores For Sub ML
25 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Pbi Xii CS QP
No ratings yet
Pbi Xii CS QP
7 pages
DBMS Module3
No ratings yet
DBMS Module3
9 pages
7 Days Analytics Course 3feiz7 1
No ratings yet
7 Days Analytics Course 3feiz7 1
8 pages
Unit 5
No ratings yet
Unit 5
19 pages
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
No ratings yet
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
42 pages
Upload Nexus
No ratings yet
Upload Nexus
2 pages
Sravankumar Reddy: Professional Summary
No ratings yet
Sravankumar Reddy: Professional Summary
3 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Studio Petrel 2020-1 ReleaseNotes
100% (1)
Studio Petrel 2020-1 ReleaseNotes
19 pages
Database Query Optimization Guide
No ratings yet
Database Query Optimization Guide
4 pages
01 Introduction To Datamarts
No ratings yet
01 Introduction To Datamarts
3 pages
Db2 SQL Errors
100% (1)
Db2 SQL Errors
574 pages
Trinity College Library: XXXXXXXX@TCD - Ie
No ratings yet
Trinity College Library: XXXXXXXX@TCD - Ie
1 page
CDS General
No ratings yet
CDS General
11 pages
Informatica - Power Center - Lesson 2
No ratings yet
Informatica - Power Center - Lesson 2
31 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
14 pages
CWR Payload
No ratings yet
CWR Payload
4 pages
DBMS Solved Questions With Diagrams
No ratings yet
DBMS Solved Questions With Diagrams
4 pages
More Dumps For Service Now Implementation Specialist
67% (3)
More Dumps For Service Now Implementation Specialist
1 page
Paper DBMS 2024
No ratings yet
Paper DBMS 2024
2 pages
XMLType Datatype in Oracle9i
No ratings yet
XMLType Datatype in Oracle9i
52 pages
MBA Exam 2024: Business Analytics
No ratings yet
MBA Exam 2024: Business Analytics
6 pages
Power Query Fundamentals
No ratings yet
Power Query Fundamentals
18 pages
Software Engineering Prep Course
No ratings yet
Software Engineering Prep Course
2 pages
Trees Lecture - G5 - With Code
No ratings yet
Trees Lecture - G5 - With Code
91 pages
Migrating Arcgis Roads and Highways Data From Arcmap To Arcgis Pro 3.4
No ratings yet
Migrating Arcgis Roads and Highways Data From Arcmap To Arcgis Pro 3.4
11 pages
Types of Relationships (DBMS)
No ratings yet
Types of Relationships (DBMS)
12 pages
Database Programming With SQL Section 12 Quiz Parte II
No ratings yet
Database Programming With SQL Section 12 Quiz Parte II
47 pages
Database Dependency Essentials
No ratings yet
Database Dependency Essentials
17 pages

Data Engineering & Apache Spark Guide

Uploaded by

Data Engineering & Apache Spark Guide

Uploaded by

ScholarNest

Introduction to Data Engineering

What do they do? What do they do? What do they do?

Data Engineering Functions

Introduction to Apache Spark

Apache Spark? Who is using Apache Spark?

Resource Manager (YARN | Standalone | Kubernetes)

Distributed Storage (HDFS | S3 | ADLS | GCS)

Unified Open Wide

Resource Manager (YARN | Standalone | Kubernetes)

Distributed Storage (HDFS | S3 | ADLS | GCS)

Data Storage Infrastructure

ACID Transaction capabilities

Automation APIs and Tools

Cloudera Hadoop Platform

Spark as Cloud-Native Technology

Service Azure AWS GCP

You might also like