PIERIAN
CLOUD
GCP Professional Data Engineer
Cloud Dataflow
GCP Professional Data Engineer
● Section Overview:
○ Cloud Dataflow Overview
○ Cloud Dataflow
■ Pipelines
■ Templates
■ SQL
○ Cloud Dataflow Demonstration
GCP Professional Data Engineer
Let’s get started!
GCP Professional Data Engineer
Cloud Dataflow
Overview
GCP Professional Data Engineer
● We’ve discovered that many of the data
engineering services on GCP are managed
versions of open-source software.
● Cloud Dataflow is a managed service of
Apache Beam.
● Apache Beam was actually developed at
Google and later donated to the Apache
Software Foundation.
GCP Professional Data Engineer
● Using a managed service like Cloud
Dataflow allows you to focus your efforts
on designing the data processing job,
rather than dealing with the underlying
orchestration or infrastructure.
GCP Professional Data Engineer
● Before we dive deeper into Apache Beam
and Cloud Dataflow, let’s quickly make
distinctions between Cloud Dataproc and
Cloud Dataflow.
GCP Professional Data Engineer
Cloud Dataflow Cloud Dataproc
Use Case: Unified batch and Hadoop/Spark based
streaming data. applications.
Autoscaling Yes Yes
Capability:
Fully-managed: Yes No
Open Source Apache Beam Hadoop/Spark and
Foundation: derivatives (Pig,
Presto, Hive, etc…)
GCP Professional Data Engineer
● Apache Beam is an open source unified
programming model to define and execute
data processing pipelines, including ETL,
batch and stream (continuous) processing.
● The Apache Beam programming model
simplifies the mechanics of large-scale
data processing.
GCP Professional Data Engineer
● As we’ve mentioned Dataflow is a tool for
unified stream or batch data processing.
● This is actually the source of the name
Apache Beam:
○ Batch
○ Stream
GCP Professional Data Engineer
● Using one of the Apache Beam SDKs, you
build a program that defines the pipeline.
● Then, one of Apache Beam's supported
distributed processing backends, such as
Dataflow, executes the pipeline.
GCP Professional Data Engineer
● Let’s go through a few key concepts for
Apache Beam:
○ Pipelines
○ PCollection
○ Transforms
○ ParDo
GCP Professional Data Engineer
● Pipelines:
○ A pipeline encapsulates the entire series
of computations involved in reading
input data, transforming that data, and
writing output data.
GCP Professional Data Engineer
● Pipelines:
○ The input source and output sink can be
the same or of different types, allowing
you to convert data from one format to
another.
GCP Professional Data Engineer
● Pipelines:
○ Apache Beam programs start by
constructing a Pipeline object, and then
using that object as the basis for
creating the pipeline's datasets.
○ Each pipeline represents a single,
repeatable job.
GCP Professional Data Engineer
Pipelines
PCollection_i PCollection_ou
Transform1 Transform2
n t
GCP Professional Data Engineer
● PCollection:
○ A PCollection represents a potentially
distributed, multi-element dataset that
acts as the pipeline's data.
○ Apache Beam transforms use
PCollection objects as inputs and
outputs for each step in your pipeline.
GCP Professional Data Engineer
● PCollection:
○ A PCollection can hold a dataset of a
fixed size or an unbounded dataset from
a continuously updating data source.
GCP Professional Data Engineer
● Transforms:
○ A transform represents a processing
operation that transforms data.
○ A transform takes one or more
PCollections as input, performs an
operation that you specify on each
element in that collection, and produces
one or more PCollections as output.
GCP Professional Data Engineer
● Transforms:
○ A transform can perform nearly any kind
of processing operation, including
performing mathematical computations
on data, converting data from one
format to another, grouping data
together, reading and writing data,
filtering data, and more!
GCP Professional Data Engineer
Pipelines
PCollection_i PCollection_ou
Transform1 Transform2
n t
GCP Professional Data Engineer
Pipelines
PCollection_ou
PCollection_i
Transform1 t
n
v1
PCollection_ou
Transform2 t
v2
GCP Professional Data Engineer
● ParDo:
○ ParDo is the core parallel processing
operation in the Apache Beam SDKs,
invoking a user-specified function on
each of the elements of the input
PCollection.
GCP Professional Data Engineer
● ParDo:
○ ParDo collects the zero or more output
elements into an output PCollection.
○ The ParDo transform processes
elements independently and possibly in
parallel.
GCP Professional Data Engineer
● Leveraging Dataflow allows you to easily
use Apache Beam to connect unified
stream and batch data processing to other
GCP services.
○ Stream from Pub/Sub to BigQuery.
○ Connect TensorFlow ML to streaming
data sources.
GCP Professional Data Engineer
● Review:
○ We learned about Cloud Dataflow and its
relationship to Apache Beam.
GCP Professional Data Engineer
● Up Next:
○ We’ll dive into specific Cloud Dataflow
operations, such as creating pipelines,
templates, and using SQL with Dataflow.
GCP Professional Data Engineer
Cloud Dataflow:
Pipelines, Templates,
and SQL
GCP Professional Data Engineer
● When creating data pipelines with Apache
Beam, the pipeline development and job
execution typically happen all within a
development environment.
● Template Dataflow jobs allow you to
separate the staging and execution steps.
GCP Professional Data Engineer
● Dataflow templates allow you to stage
your pipelines on Google Cloud and run
them using the Google Cloud console, the
Google Cloud CLI, or REST API calls.
GCP Professional Data Engineer
● There are currently two types of templates:
○ Classic templates are staged as
execution graphs on Cloud Storage while
Flex Templates package the pipeline as
a Docker image and stage these images
on your project's Container Registry or
Artifact Registry.
GCP Professional Data Engineer
● Google recommends Flex Templates.
● Flex Templates bring more flexibility over
classic templates by allowing minor
variations of Dataflow jobs to be launched
from a single template and allowing the
use of any source or sink I/O. (For classic
templates, the execution graph is built
during the template creation process. )
GCP Professional Data Engineer
● Template Advantages
GCP Professional Data Engineer
● Template Advantages
○ You can run your pipelines without the
development environment and
associated dependencies that are
common with non-templated
deployment. This is useful for scheduling
recurring batch jobs.
GCP Professional Data Engineer
● Template Advantages
○ Templates separate the pipeline
construction (performed by developers)
from the running of the pipeline.
○ Hence, there's no need to recompile the
code every time the pipeline is run.
GCP Professional Data Engineer
● Template Advantages
○ Runtime parameters allow you to
customize the running of the pipeline.
○ Non-technical users can run templates
with the Google Cloud console, Google
Cloud CLI, or the REST API.
GCP Professional Data Engineer
● Template Advantages
○ Google provides a huge list of templates
ready for you to use for common data
pipelines:
■ cloud.google.com/dataflow/docs/
guides/templates/provided-
templates
GCP Professional Data Engineer
● Another great feature of using Cloud
Dataflow is the ability to use Cloud
Dataflow SQL which lets you use SQL
queries to develop and run Cloud Dataflow
jobs directly from the BigQuery web user
interface.
GCP Professional Data Engineer
● Dataflow SQL queries use the Dataflow
SQL query syntax. The Dataflow SQL query
syntax is similar to BigQuery standard SQL.
● You can use the Dataflow SQL streaming
extensions to aggregate data from
continuously updating Dataflow sources
like Pub/Sub.
GCP Professional Data Engineer
● For example, the following query counts
the passengers in a Pub/Sub stream of taxi
rides every minute:
GCP Professional Data Engineer
● Dataflow SQL empowers anyone that
knows SQL to take full advantage of the
abilities of Cloud Dataflow!
GCP Professional Data Engineer
● Review:
○ We explored additional Cloud Dataflow
features for pipelines, including
templates and Dataflow SQL.
GCP Professional Data Engineer
● Up Next:
○ We’ll go through a demonstration of
using Cloud Dataflow.
GCP Professional Data Engineer
DEMO:
Cloud Dataflow