0% found this document useful (0 votes)

58 views45 pages

08 Cloud Dataflow

The document provides an overview of Google Cloud Platform's Cloud Dataflow, a managed service based on Apache Beam for unified batch and streaming data processing. It explains key concepts such as pipelines, PCollections, transforms, and the ParDo operation, as well as the advantages of using Dataflow templates and SQL for job execution. The document concludes with a preview of a demonstration on using Cloud Dataflow.

Uploaded by

Eric Sandria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views45 pages

08 Cloud Dataflow

Uploaded by

Eric Sandria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

PIERIAN

CLOUD
GCP Professional Data Engineer

Cloud Dataflow
GCP Professional Data Engineer

● Section Overview:
○ Cloud Dataflow Overview
○ Cloud Dataflow
■ Pipelines
■ Templates
■ SQL
○ Cloud Dataflow Demonstration
GCP Professional Data Engineer

Let’s get started!

GCP Professional Data Engineer

Cloud Dataflow
Overview
GCP Professional Data Engineer

● We’ve discovered that many of the data

engineering services on GCP are managed
versions of open-source software.
● Cloud Dataflow is a managed service of
Apache Beam.
● Apache Beam was actually developed at
Google and later donated to the Apache
Software Foundation.
GCP Professional Data Engineer

● Using a managed service like Cloud

Dataflow allows you to focus your efforts
on designing the data processing job,
rather than dealing with the underlying
orchestration or infrastructure.
GCP Professional Data Engineer

● Before we dive deeper into Apache Beam

and Cloud Dataflow, let’s quickly make
distinctions between Cloud Dataproc and
Cloud Dataflow.
GCP Professional Data Engineer

Cloud Dataflow Cloud Dataproc

Use Case: Unified batch and Hadoop/Spark based

streaming data. applications.

Autoscaling Yes Yes

Capability:

Fully-managed: Yes No

Open Source Apache Beam Hadoop/Spark and

Foundation: derivatives (Pig,
Presto, Hive, etc…)
GCP Professional Data Engineer

● Apache Beam is an open source unified

programming model to define and execute
data processing pipelines, including ETL,
batch and stream (continuous) processing.
● The Apache Beam programming model
simplifies the mechanics of large-scale
data processing.
GCP Professional Data Engineer

● As we’ve mentioned Dataflow is a tool for

unified stream or batch data processing.
● This is actually the source of the name
Apache Beam:
○ Batch
○ Stream
GCP Professional Data Engineer

● Using one of the Apache Beam SDKs, you

build a program that defines the pipeline.
● Then, one of Apache Beam's supported
distributed processing backends, such as
Dataflow, executes the pipeline.
GCP Professional Data Engineer

● Let’s go through a few key concepts for

Apache Beam:
○ Pipelines
○ PCollection
○ Transforms
○ ParDo
GCP Professional Data Engineer

● Pipelines:
○ A pipeline encapsulates the entire series
of computations involved in reading
input data, transforming that data, and
writing output data.
GCP Professional Data Engineer

● Pipelines:
○ The input source and output sink can be
the same or of different types, allowing
you to convert data from one format to
another.
GCP Professional Data Engineer

● Pipelines:
○ Apache Beam programs start by
constructing a Pipeline object, and then
using that object as the basis for
creating the pipeline's datasets.
○ Each pipeline represents a single,
repeatable job.
GCP Professional Data Engineer

Pipelines

PCollection_i PCollection_ou
Transform1 Transform2
n t
GCP Professional Data Engineer

● PCollection:
○ A PCollection represents a potentially
distributed, multi-element dataset that
acts as the pipeline's data.
○ Apache Beam transforms use
PCollection objects as inputs and
outputs for each step in your pipeline.
GCP Professional Data Engineer

● PCollection:
○ A PCollection can hold a dataset of a
fixed size or an unbounded dataset from
a continuously updating data source.
GCP Professional Data Engineer

● Transforms:
○ A transform represents a processing
operation that transforms data.
○ A transform takes one or more
PCollections as input, performs an
operation that you specify on each
element in that collection, and produces
one or more PCollections as output.
GCP Professional Data Engineer

● Transforms:
○ A transform can perform nearly any kind
of processing operation, including
performing mathematical computations
on data, converting data from one
format to another, grouping data
together, reading and writing data,
filtering data, and more!
GCP Professional Data Engineer

Pipelines

PCollection_i PCollection_ou
Transform1 Transform2
n t
GCP Professional Data Engineer

Pipelines

PCollection_ou
PCollection_i
Transform1 t
n
v1

PCollection_ou
Transform2 t
v2
GCP Professional Data Engineer

● ParDo:
○ ParDo is the core parallel processing
operation in the Apache Beam SDKs,
invoking a user-specified function on
each of the elements of the input
PCollection.
GCP Professional Data Engineer

● ParDo:
○ ParDo collects the zero or more output
elements into an output PCollection.
○ The ParDo transform processes
elements independently and possibly in
parallel.
GCP Professional Data Engineer

● Leveraging Dataflow allows you to easily

use Apache Beam to connect unified
stream and batch data processing to other
GCP services.
○ Stream from Pub/Sub to BigQuery.
○ Connect TensorFlow ML to streaming
data sources.
GCP Professional Data Engineer

● Review:
○ We learned about Cloud Dataflow and its
relationship to Apache Beam.
GCP Professional Data Engineer

● Up Next:
○ We’ll dive into specific Cloud Dataflow
operations, such as creating pipelines,
templates, and using SQL with Dataflow.
GCP Professional Data Engineer

Cloud Dataflow:
Pipelines, Templates,
and SQL
GCP Professional Data Engineer

● When creating data pipelines with Apache

Beam, the pipeline development and job
execution typically happen all within a
development environment.
● Template Dataflow jobs allow you to
separate the staging and execution steps.
GCP Professional Data Engineer

● Dataflow templates allow you to stage

your pipelines on Google Cloud and run
them using the Google Cloud console, the
Google Cloud CLI, or REST API calls.
GCP Professional Data Engineer

● There are currently two types of templates:

○ Classic templates are staged as
execution graphs on Cloud Storage while
Flex Templates package the pipeline as
a Docker image and stage these images
on your project's Container Registry or
Artifact Registry.
GCP Professional Data Engineer

● Google recommends Flex Templates.

● Flex Templates bring more flexibility over
classic templates by allowing minor
variations of Dataflow jobs to be launched
from a single template and allowing the
use of any source or sink I/O. (For classic
templates, the execution graph is built
during the template creation process. )
GCP Professional Data Engineer

● Template Advantages
GCP Professional Data Engineer

● Template Advantages
○ You can run your pipelines without the
development environment and
associated dependencies that are
common with non-templated
deployment. This is useful for scheduling
recurring batch jobs.
GCP Professional Data Engineer

● Template Advantages
○ Templates separate the pipeline
construction (performed by developers)
from the running of the pipeline.
○ Hence, there's no need to recompile the
code every time the pipeline is run.
GCP Professional Data Engineer

● Template Advantages
○ Runtime parameters allow you to
customize the running of the pipeline.
○ Non-technical users can run templates
with the Google Cloud console, Google
Cloud CLI, or the REST API.
GCP Professional Data Engineer

● Template Advantages
○ Google provides a huge list of templates
ready for you to use for common data
pipelines:
■ cloud.google.com/dataflow/docs/
guides/templates/provided-
templates
GCP Professional Data Engineer

● Another great feature of using Cloud

Dataflow is the ability to use Cloud
Dataflow SQL which lets you use SQL
queries to develop and run Cloud Dataflow
jobs directly from the BigQuery web user
interface.
GCP Professional Data Engineer

● Dataflow SQL queries use the Dataflow

SQL query syntax. The Dataflow SQL query
syntax is similar to BigQuery standard SQL.
● You can use the Dataflow SQL streaming
extensions to aggregate data from
continuously updating Dataflow sources
like Pub/Sub.
GCP Professional Data Engineer

● For example, the following query counts

the passengers in a Pub/Sub stream of taxi
rides every minute:
GCP Professional Data Engineer

● Dataflow SQL empowers anyone that

knows SQL to take full advantage of the
abilities of Cloud Dataflow!
GCP Professional Data Engineer

● Review:
○ We explored additional Cloud Dataflow
features for pipelines, including
templates and Dataflow SQL.
GCP Professional Data Engineer

● Up Next:
○ We’ll go through a demonstration of
using Cloud Dataflow.
GCP Professional Data Engineer

DEMO:
Cloud Dataflow

Google Cloud Data Engineering
No ratings yet
Google Cloud Data Engineering
129 pages
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
No ratings yet
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
41 pages
? What Is Big Data
No ratings yet
? What Is Big Data
14 pages
Dataflow Basics: What Is Google Cloud Dataflow?
No ratings yet
Dataflow Basics: What Is Google Cloud Dataflow?
1 page
01 Overview of GCP Platform
No ratings yet
01 Overview of GCP Platform
66 pages
04 BigQuery
100% (1)
04 BigQuery
243 pages
FDS CO2 Session 17 18
No ratings yet
FDS CO2 Session 17 18
19 pages
GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
No ratings yet
GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
38 pages
GCP Notes For Certification
No ratings yet
GCP Notes For Certification
24 pages
4.4 - Managed Services
No ratings yet
4.4 - Managed Services
17 pages
07-Cloud Composer
No ratings yet
07-Cloud Composer
40 pages
File Module 5 - en - en
No ratings yet
File Module 5 - en - en
16 pages
OD M1 Introduction To Data Engineering
No ratings yet
OD M1 Introduction To Data Engineering
69 pages
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
0% (1)
3 - ETL Processing On Google Cloud Using Dataflow and BigQuery
15 pages
M1.1 Introduction To Data Engineering
No ratings yet
M1.1 Introduction To Data Engineering
75 pages
Data Engineering Essentials
No ratings yet
Data Engineering Essentials
61 pages
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
No ratings yet
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
9 pages
P2 Description
No ratings yet
P2 Description
2 pages
Pad21302t Eml Unit 4
No ratings yet
Pad21302t Eml Unit 4
73 pages
GCP Associate Cloud Engineer Guide
100% (2)
GCP Associate Cloud Engineer Guide
45 pages
GCP Data Engineer Curriculum
No ratings yet
GCP Data Engineer Curriculum
7 pages
03 Big Data Overview
No ratings yet
03 Big Data Overview
96 pages
OD M1 Introduction To Data Engineering
No ratings yet
OD M1 Introduction To Data Engineering
69 pages
PrinceKumar GCP ETL
No ratings yet
PrinceKumar GCP ETL
1 page
GCP Detailed Services v3
No ratings yet
GCP Detailed Services v3
3 pages
Associate Cloud Engineer - 8
No ratings yet
Associate Cloud Engineer - 8
22 pages
Google Cloud Fund M8 Big Data and Machine Learning in The Cloud
No ratings yet
Google Cloud Fund M8 Big Data and Machine Learning in The Cloud
44 pages
Mlops 4
No ratings yet
Mlops 4
22 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Google's Big Data Analytics - Supriya Dusad
No ratings yet
Google's Big Data Analytics - Supriya Dusad
12 pages
M1 - Introduction To Data Engineering Slides
No ratings yet
M1 - Introduction To Data Engineering Slides
62 pages
Bigquery Scenarios - Dipakraj Patil
No ratings yet
Bigquery Scenarios - Dipakraj Patil
37 pages
TOC - GCP Cloud Architect (Advanced) - 3 Days
No ratings yet
TOC - GCP Cloud Architect (Advanced) - 3 Days
4 pages
Google Cloud Product Flashcards
No ratings yet
Google Cloud Product Flashcards
117 pages
Professional Data Engineers Google
No ratings yet
Professional Data Engineers Google
1 page
GCP Data Engineer Course Content
No ratings yet
GCP Data Engineer Course Content
7 pages
Compute Engine
No ratings yet
Compute Engine
49 pages
GCP Services for IT Professionals
No ratings yet
GCP Services for IT Professionals
10 pages
GCP Technologies
No ratings yet
GCP Technologies
12 pages
GCP Data Engineer Resume
No ratings yet
GCP Data Engineer Resume
1 page
Intro To Google Cloud Platform
No ratings yet
Intro To Google Cloud Platform
86 pages
Data Engineer Handbook
No ratings yet
Data Engineer Handbook
21 pages
Google Cloud Data Engineer Data Dossier 1 - 1548369728 PDF
100% (2)
Google Cloud Data Engineer Data Dossier 1 - 1548369728 PDF
156 pages
Google Cloud Platform
100% (1)
Google Cloud Platform
23 pages
7-GCS Compute, Storage and Network Services-22!08!2024
No ratings yet
7-GCS Compute, Storage and Network Services-22!08!2024
67 pages
23ad401 - de - Cia 1 QB
No ratings yet
23ad401 - de - Cia 1 QB
2 pages
Google Cloud Guide for Beginners
No ratings yet
Google Cloud Guide for Beginners
35 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
16 pages
SPEC - The Smart Cube - Lead Engineer
No ratings yet
SPEC - The Smart Cube - Lead Engineer
2 pages
CLOUD Ia2
No ratings yet
CLOUD Ia2
17 pages
Google Cloud Services
No ratings yet
Google Cloud Services
27 pages
Apache Flink for Big Data Engineers
No ratings yet
Apache Flink for Big Data Engineers
116 pages
Cloud Engineer Learning Track
No ratings yet
Cloud Engineer Learning Track
8 pages
Cloud Digital Leader Demo
No ratings yet
Cloud Digital Leader Demo
8 pages
DataCamp - Data Engineer
No ratings yet
DataCamp - Data Engineer
2 pages
Section 1 - Tableau Basics
No ratings yet
Section 1 - Tableau Basics
29 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Chapter 6 Mapreduce Programming Framework
No ratings yet
Chapter 6 Mapreduce Programming Framework
35 pages
Sertif GCP
No ratings yet
Sertif GCP
177 pages
Input Filepo Leguler: Penginputan
No ratings yet
Input Filepo Leguler: Penginputan
3 pages
Full Corporate Offer For Railways Second Grade R50-R65
No ratings yet
Full Corporate Offer For Railways Second Grade R50-R65
3 pages
Operating Instructions Subracks
50% (2)
Operating Instructions Subracks
21 pages
Functions MTU DiaSys V2.73 PRO
No ratings yet
Functions MTU DiaSys V2.73 PRO
8 pages
A Study On The Performance of E-Waste With Bitumen For Road Construction
No ratings yet
A Study On The Performance of E-Waste With Bitumen For Road Construction
34 pages
QB Answer of X Geography L - 6 2020
No ratings yet
QB Answer of X Geography L - 6 2020
2 pages
2 Community Organizing
No ratings yet
2 Community Organizing
54 pages
ULTRA V Vertical Pressure Screen
No ratings yet
ULTRA V Vertical Pressure Screen
4 pages
Aircraft Airconditioning
No ratings yet
Aircraft Airconditioning
8 pages
PLC Panasonic FPX Ing
No ratings yet
PLC Panasonic FPX Ing
20 pages
Model 63 357 302B Foot Valves
No ratings yet
Model 63 357 302B Foot Valves
4 pages
Nursing Management of GI Medications
No ratings yet
Nursing Management of GI Medications
2 pages
Hytera Pd685 VHF Uhf Service Manual
No ratings yet
Hytera Pd685 VHF Uhf Service Manual
163 pages
Awwa M17 - 2006
No ratings yet
Awwa M17 - 2006
136 pages
Extinction of Criminal Action
No ratings yet
Extinction of Criminal Action
17 pages
Lenovo Customer Experience Issues
No ratings yet
Lenovo Customer Experience Issues
1 page
Intelligence on Human Trafficking
No ratings yet
Intelligence on Human Trafficking
80 pages
FIA Course Textbooks Price List 2016
No ratings yet
FIA Course Textbooks Price List 2016
2 pages
228 B.M. 1625 Uy Timosa
No ratings yet
228 B.M. 1625 Uy Timosa
2 pages
Norriseal PDF
100% (2)
Norriseal PDF
349 pages
Quick Reference Card Invoice Example Essent
No ratings yet
Quick Reference Card Invoice Example Essent
1 page
Coupon
No ratings yet
Coupon
1 page
Hydr 725 & 730 Serie Alta
No ratings yet
Hydr 725 & 730 Serie Alta
2 pages
4
No ratings yet
4
2 pages
Engaging Students with Poster Sessions
No ratings yet
Engaging Students with Poster Sessions
7 pages
RTI Aplication Form
No ratings yet
RTI Aplication Form
2 pages
Rich Dad Poor Dad Imp Notes
No ratings yet
Rich Dad Poor Dad Imp Notes
24 pages
MR72 PDF
No ratings yet
MR72 PDF
5 pages
2526 Can Da C 063515
No ratings yet
2526 Can Da C 063515
1 page
ECE 571 Laboratory: New Era University
No ratings yet
ECE 571 Laboratory: New Era University
6 pages
Sexual Harassment at Workplace in India Lets Stop It Together Free Ebook From LawSikho High Resolution
No ratings yet
Sexual Harassment at Workplace in India Lets Stop It Together Free Ebook From LawSikho High Resolution
44 pages

08 Cloud Dataflow

Uploaded by

08 Cloud Dataflow

Uploaded by

PIERIAN

Let’s get started!

● We’ve discovered that many of the data

● Using a managed service like Cloud

● Before we dive deeper into Apache Beam

Cloud Dataflow Cloud Dataproc

Use Case: Unified batch and Hadoop/Spark based

Autoscaling Yes Yes

Open Source Apache Beam Hadoop/Spark and

● Apache Beam is an open source unified

● As we’ve mentioned Dataflow is a tool for

● Using one of the Apache Beam SDKs, you

● Let’s go through a few key concepts for

● Leveraging Dataflow allows you to easily

● When creating data pipelines with Apache

● Dataflow templates allow you to stage

● There are currently two types of templates:

● Google recommends Flex Templates.

● Another great feature of using Cloud

● Dataflow SQL queries use the Dataflow

● For example, the following query counts

● Dataflow SQL empowers anyone that

You might also like