0% found this document useful (0 votes)

136 views91 pages

Large Scale Data Pipelines

The document provides an overview of modern data pipelines and various technologies used to build them, including Kafka, Akka Streams, Play Framework, Flink, and Cassandra. It discusses why streaming data pipelines are important for real-time processing and intelligence, and outlines requirements like scalability, availability, and distribution. Traditional ETL is compared to modern streaming approaches. Key concepts for each technology are briefly explained, with Kafka focusing on its distributed commit log architecture and Akka Streams on reactive streams.

Uploaded by

gopinathmaruthachala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views91 pages

Large Scale Data Pipelines

Uploaded by

gopinathmaruthachala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 91

Modern Data Pipelines

Ryan Knight
James Ward
TODO
@
_JamesWard
@
• Distributed Systems guru
• Scala, Akka, Cassandra Expert & Trainer
• Skis with his 5 boys in Park City, UT
• First time to jFokus

Ryan Knight

Architect at Starbucks
• Back-end Developer
• Creator of WebJars
• Blog: www.jamesward.com
• Not a JavaScript Fan

James Ward • In love with FP

Developer at Salesforce
Agenda

• Modern Data Pipeline Overview

• Kafka Code

• Akka Streams github.com/jamesward/koober

• Play Framework
• Flink
• Cassandra
• Spark Streaming
Modern Data Pipelines
Real-Time, Distributed, Decoupled
Why Streaming Pipelines
Real Time Value
• Allow business to react to data in real-time instead of batch
Real Time Intelligence
• Provide real-time information so that the apps can use the information
and adapt their user interactions
Distributed data processing that is both scalable and resilient
Clickstream analysis
Real-time anomaly detection
Instant (< 10 s) feedback - ex. real time concurrent video viewers / page
views
Data Pipeline Requirements

• Ability to process massive amounts of data

• Handle data from a wider variety of sources
• Highly Available
• Resilient - not just fault tolerant
• Distributed for Scale of Data and Transactions
• Elastic
• Uniformity - all-JVM based for easy deployment and management
Traditional ETL
Data Integration Today
Data Pipelines today

http://ferd.ca/queues-don-t-fix-overload.html
Backpressure

http://ferd.ca/queues-don-t-fix-overload.html
Data Hub / Stream Processing
Pipeline Architecture

Spark
Notebook
Web Client
Flink
Spark
core, streaming,
graphx, mllib, ...

Play App
Spark
Kafka Cassandra
Streaming
Cold Data
Koober
github.com/jamesward/koober
Kafka
Distributed Commit Logs
What is Kafka?

Kafka is a distributed and partitioned commit log

Replacement for traditional message queues and publish subscribe
systems
Central Data Backbone or Hub
Designed to scale transparently with replication across the cluster
Core Principles

1. One pipeline to rule them all

2. Stream processing >> messaging
3. Clusters not servers
4. Pull Not Push
Kafka Characteristics

Scalability of a filesystem
• Hundreds of MB/sec/server throughput
• Many TB per server
Durable - Guarantees of a database
• Messages strictly ordered
• All data persistent
Distributed by default
• Replication
• Partitioning model
Kafka is about logs
The Event Log

Append-Only Logging
Database of Facts
Disks are Cheap
Why Delete Data any more?
Replay Events
Append Only Logging
Logs: pub/sub done right
Kafka Overview
• Producers write data to brokers.
• Consumers read data from brokers.
• Brokers - Each server running Kafka is called a
broker.
• All this is distributed.
• Data
– Data is stored in topics.
– Topics are split into partitions, which are
replicated.

• Built in Parallelism and Scale

http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Partitions
A topic consists of partitions.
Partition: ordered + immutable sequence of messages
that is continually appended to
Partition offsets
• Offset: messages in the partitions are each assigned a unique
(per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
Example:
A Fault-tolerant CEO Hash Table
Operations
Final State
Kafka Log
Heroku Kafka

• Managed Kafka Cloud Service

• https://www.heroku.com/kafka
Code
Akka Streams
Reactive Streams Built on Akka
Reactive Streams
A JVM standard for asynchronous stream processing with non-blocking back pressure
Akka Streams

• Powered by Akka Actors

• Impl of Reactive Streams
• Actors can be used directly or just internally
• Stream processing functions: map, filter, fold, etc
Sink & Source

val source = Source.repeat("hello, world")

val sink = Sink.foreach(println)
val flow = source to sink
flow.run()
Code
Play Framework
Web Framework Built on Akka Streams
Play Framework
Scala & Java – Built on Akka Streams

Declarative Routing:
GET /foo controllers.Foo.do

Controllers Hold Stateless Functions:

class Foo {
def do() = Action {
Ok("hello, world")
}
}
Reactive Requests
Don't block in wait states!

def doLater = Action.async {

Promise.timeout(Ok("hello, world"), 5.seconds)
}

def reactiveRest = Action.async {

ws.url("http://api.foo.com/bar").get().map { response =>
Ok(response.json)
}
}
WebSockets
Built on Akka Streams

def ws = WebSocket.accept { request =>

val sink = ...
val source = ...
Flow.fromSinkAndSource(Sink.ignore, source)
}
Views
Serverside Templating with a Subset of Scala

app/views/blah.scala.html Action {
Ok(views.html.blah("bar"))
}
@(foo: String)
<html> <html>
<body> <body>
@foo bar
</body> </body>
</html> </html>
Demo & Code
Flink
Real-time Data Analytics
Flink
Real-time Data Analytics

• Bounded & Unbounded Data Sets

• Stream processing
• Distributed Core
• Fault Tolerant
• Clustered

• Flexible Windowing
Apache Flink
Continuous Processing for Unbounded Datasets

count() 5
Windowing
Bounding with Time, Count, Session, or Data

1s 1s count() 3
2
Batch Processing
Stream Processing on Finite Streams

count() 4
Data Processing
What can we do?

• Aggregate / Accumulate fold(), reduce(), sum(), min()

• Transform map(), flatMap()
• Filter λ filter(), distinct()
• Sort sortGroup(), sortPartition()
Apache Flink
Architecture
Partitioning
Network Distribution
Demo & Code
Cassandra
Distributed NoSQL Database
Challenges with Relational Databases
• How do you scale and maintain high-availability with a
monolithic database?
• Is it possible to have ACID compliant distributed transactions?
• How can I synchronize a distributed data store?
• How do I resolve differing views of data?
Goals of a Distributed Database
• Consistency is not practical - give it up!
• Manual sharding & rebalancing is hard - Automatic
Sharding!
• Every moving part makes systems more complex
• Master / slave creates a Single Point of Failure / Bottleneck
- Simplify Architecture!
• Scaling up is expensive - Reduce Cost
• Leverage cloud / commodity hardware
What is Cassandra?
Distributed Database

✓ Individual DBs (nodes)

✓ Working in a cluster C*
✓ Nothing is shared

Confidential
Cassandra Cluster

• Nodes in a peer-to-peer cluster

• No single point of failure

• Built in data replication

• Data is always available
• 100% Uptime

• Across data centers

• Failure avoidance

Confidential
Multi-Data Center Design
Why Cassandra?
It has a flexible data model
Tables, wide rows, partitioned and distributed
✓ Data
✓ Blobs (documents, files, images)
✓ Collections (Sets, Lists, Maps)
✓ UDTs
Access it with CQL ← familiar syntax to SQL

Confidential
Two knobs control Cassandra fault tolerance
Replication Factor (server side)
How many copies of the data should exist?
RF=3
Write A
A B
CD AD
Client
D C
BC AB
Two knobs control Cassandra fault tolerance
Consistency Level (client side)

How many replicas do we need to hear from before we acknowledge?

CL=ONE CL=QUORUM
A B A B
Write A Write A
CD AD CD AD
Client Client

D C D C
BC AB BC AB
Consistency Levels
Applies to both Reads and Writes (i.e. is set on each query)

ONE – one replica from any DC

LOCAL_ONE – one replica from local DC
QUORUM – 51% of replicas from any DC
LOCAL_QUORUM – 51% of replicas from local DC
ALL – all replicas
TWO
Consistency Level and Speed

How many replicas we need to hear from can affect

A B
how quickly we can read and write data in CD AD
Read A
Cassandra? 5 µs ack
(CL=QUORUM)
Client 300 µs ack

12 µs ack
D 12 µs ack
C
BC AB
Consistency Level and Availability
Consistency Level choice affects availability

For example, QUORUM can tolerate one replica being down

and still be available (in RF=3)
A B
CD AD
Read A
(CL=QUORUM) A=2
Client A=2

A=2
D C
BC AB
Reads in the cluster
Same as writes in the cluster, reads are coordinated
Any node can be the Coordinator Node

A B
CD AD

Client

Read A
(CL=QUORUM) D C
BC AB

Coordinator Node
Spark Cassandra Connector
Spark Cassandra Connector

Data locality-aware (speed)

Read from and Write to Cassandra

Cassandra Tables Exposed as RDD and DataFrames

Server-Side filters (where clauses)

Cross-table operations (JOIN, UNION, etc.)

Mapping of Java Types to Cassandra Types

●70
Code
Spark Streaming
Stream Processing Built on Spark
Hadoop?
Hadoop Limitations
• Master / Slave Architecture
• Every Processing Step requires Disk IO
• Difficult API and Programming Model
• Designed for batch-mode jobs
• No even-streaming / real-time
• Complex Ecosystem
What is Spark?
Fast and general compute engine for large-scale data processing

Fault Tolerant Distributed Datasets

Distributed Transformation on Datasets

Integrated Batch, Iterative and Streaming Analysis

In Memory Storage with Spill-over to Disk

Advantages of Spark
• Improves efficiency through:
• In-memory data sharing
• General computation graphs - Lazy Evaluates Data
• 10x faster on disk, 100x faster in memory than Hadoop MR
• Improves usability through:
• Rich APIs in Java, Scala, Py..??
• 2 to 5x less code
• Interactive shell
Spark Components Hosting
Spark Master UI
Spark Master
Hosting :7080
Application UI
:4040 A Process which Manages the
Application
(Spark Driver) Resources of the Spark Cluster

You application code

which creates the SparkContext

Worker
Worker
Worker
Worker
Worker

A process which shells out to create

a Executor JVM

These processes are all separate and require networking

to communicate
Resilient Distributed Datasets (RDD)
• The primary abstraction in Spark
• Collection of data stored in the Spark Cluster
• Fault-tolerant
• Enables parallel processing on data sets
• In-Memory or On-Disk
RDD Operations
Transformations - Similar to scala collections API
Produce new RDDs:
filter, flatmap, map, distinct, groupBy,
union, zip, reduceByKey, subtract

Actions - Require materialization of the records to generate a value

collect: Array[T], count, fold, reduce..
DataFrame
• Distributed collection of data

• Similar to a Table in a RDBMS

• Common API for reading/writing data

• API for selecting, filtering, aggregating

and plotting structured data
DataFrame Part 2
• Sources such as Cassandra, structured data files, tables in
Hive, external databases, or existing RDDs.

• Optimization and code generation through the Spark SQL

Catalyst optimizer

• Decorator around RDD - Previously SchemaRDD

Spark Versus Spark Streaming
Spark Streaming Data Sources
Spark Streaming General Architecture
DStream Micro Batches
Windowing
Windowing
Streaming Resiliency without Kafka

• Streaming uses aggressive checkpointing and in-memory data replication to improve

resiliency.

• Frequent checkpointing keeps RDD lineages down to a reasonable size.

• Checkpointing and replication mandatory since streams don’t have source data files to
reconstruct lost RDD partitions (except for the directory ingest case).
• Write Ahead Logging to prevent Data Loss
Direct Kafka Streaming w/ Kafka Direct API

• Use Kafka Direct Approach (No Receivers)

• Queries Kafka Directly
• Automatically Parallelizes based on Kafka Partitions
• (Mostly) Exactly Once Processing - Only Move Offset after
Processing
• Resiliency without copying data
Demo & Code

Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
100% (1)
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Databricks Lakehouse Fundamentals Slide Deck
No ratings yet
Databricks Lakehouse Fundamentals Slide Deck
121 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Databricks - Data Intelligence Platform For Advanced Data Architecture
No ratings yet
Databricks - Data Intelligence Platform For Advanced Data Architecture
5 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Databricks Secure Deployments and Security Baselines
No ratings yet
Databricks Secure Deployments and Security Baselines
25 pages
DBT Cloud Advanced Architecture Guide
0% (1)
DBT Cloud Advanced Architecture Guide
4 pages
Stream Processing Using Kafka
No ratings yet
Stream Processing Using Kafka
46 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
74 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
Real Time Data Processing With PDI
No ratings yet
Real Time Data Processing With PDI
15 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Learning Journey For Machine Learning On Azure - 20210208
No ratings yet
Learning Journey For Machine Learning On Azure - 20210208
13 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Data Vault and HQDM Principles PDF
No ratings yet
Data Vault and HQDM Principles PDF
8 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Running Airflow Reliably With Kubernetes
100% (1)
Running Airflow Reliably With Kubernetes
47 pages
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
No ratings yet
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
6 pages
Datawarehouse To Data Lakehouse
100% (1)
Datawarehouse To Data Lakehouse
48 pages
Big Data As A Service
No ratings yet
Big Data As A Service
36 pages
GuideToApacheAirflow PDF
100% (1)
GuideToApacheAirflow PDF
6 pages
Web Semantics: Cutting Edge and Future Directions in Healthcare 1st Edition Sarika Jain All Chapter Instant Download
100% (5)
Web Semantics: Cutting Edge and Future Directions in Healthcare 1st Edition Sarika Jain All Chapter Instant Download
62 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
SAP BW4HANA - An Introduction - 202001
No ratings yet
SAP BW4HANA - An Introduction - 202001
30 pages
Installation Guide Apache Kylin
100% (1)
Installation Guide Apache Kylin
17 pages
Microsoft SQL Server
No ratings yet
Microsoft SQL Server
111 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
Abinitio Introduction
No ratings yet
Abinitio Introduction
9 pages
E Library Project
No ratings yet
E Library Project
87 pages
Power BI Vs Excel: Which One Is Better - JBK Academy
No ratings yet
Power BI Vs Excel: Which One Is Better - JBK Academy
9 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Lecture 5 MS-Access Tables Forms Queries Reports
No ratings yet
Lecture 5 MS-Access Tables Forms Queries Reports
34 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Babok Visual v3
91% (11)
Babok Visual v3
218 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Complex Event Processing With Apache Flink Presentation
No ratings yet
Complex Event Processing With Apache Flink Presentation
49 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Bayut Internal Linking Strategy
No ratings yet
Bayut Internal Linking Strategy
27 pages
ICEDQ Next Gen Vs Others
No ratings yet
ICEDQ Next Gen Vs Others
20 pages
Be It Certified MySQL 010-002 Free Questions Dumps
No ratings yet
Be It Certified MySQL 010-002 Free Questions Dumps
5 pages
Integrating Apache Nifi and Apache Kafka
No ratings yet
Integrating Apache Nifi and Apache Kafka
5 pages
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
No ratings yet
Star Schema and Technology Review: Musa Sami Ata Abdel-Rahman Supervisor: Professor Sebastian Link
15 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Hibernate-Advanced-Java (Set 1)
No ratings yet
Hibernate-Advanced-Java (Set 1)
4 pages
Open Points List (OPL) : Title: Owner: Date: 6/9/2021
No ratings yet
Open Points List (OPL) : Title: Owner: Date: 6/9/2021
12 pages
Big Data
No ratings yet
Big Data
41 pages
DBMS 2: Anomalies + Normalization PDF
No ratings yet
DBMS 2: Anomalies + Normalization PDF
10 pages
1 Part Kali Linux Questions-3
No ratings yet
1 Part Kali Linux Questions-3
6 pages
SQL Notes by Apna College
No ratings yet
SQL Notes by Apna College
15 pages
AnilK Thodeti - Datastage ETL
No ratings yet
AnilK Thodeti - Datastage ETL
9 pages
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
99% (68)
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
514 pages
Linux Commands MT
0% (1)
Linux Commands MT
72 pages
DSX Developer Ebook4 FINAL PDF
No ratings yet
DSX Developer Ebook4 FINAL PDF
27 pages
Data Warehousing and Data Mining: Downloaded From
No ratings yet
Data Warehousing and Data Mining: Downloaded From
94 pages
What Is Data Flow Diagram (DFD) ? How To Draw DFD?
No ratings yet
What Is Data Flow Diagram (DFD) ? How To Draw DFD?
12 pages
SWP391-AppDevProject - Design Template
No ratings yet
SWP391-AppDevProject - Design Template
5 pages
What Is Bigquery: Enterprise Data Warehouse
No ratings yet
What Is Bigquery: Enterprise Data Warehouse
2 pages
05 Table Space
No ratings yet
05 Table Space
32 pages
Zomato, Swiggy - Explore - Google Trends
No ratings yet
Zomato, Swiggy - Explore - Google Trends
3 pages
Compusoft, 2 (12), 396-399 PDF
No ratings yet
Compusoft, 2 (12), 396-399 PDF
4 pages
BDSCP Module 08 Mindmap
No ratings yet
BDSCP Module 08 Mindmap
1 page
Data Storytelling Cheat Sheet
100% (6)
Data Storytelling Cheat Sheet
2 pages
Performance Tuning
100% (1)
Performance Tuning
5 pages
Full Life Planner Interactive
100% (76)
Full Life Planner Interactive
265 pages
Storytelling With Data Cole Nussbaumer Knaflic
100% (46)
Storytelling With Data Cole Nussbaumer Knaflic
291 pages
Data Analytics
100% (5)
Data Analytics
346 pages
HP LVM
No ratings yet
HP LVM
2 pages
Visual Data Storytelling With Tableau by Lindy Ryan
85% (20)
Visual Data Storytelling With Tableau by Lindy Ryan
450 pages
Business Data Processing System Practical Questions
No ratings yet
Business Data Processing System Practical Questions
4 pages
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
88% (17)
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
102 pages
Data Governance Playbook
100% (16)
Data Governance Playbook
168 pages
Data Analytics Concepts Techniques and A PDF
100% (12)
Data Analytics Concepts Techniques and A PDF
451 pages
Better Data Visualizations Scholars
98% (41)
Better Data Visualizations Scholars
464 pages
Storytelling With Data - The New Visualization Data Guide To Reaching Your Business Aim in The Fastest Way
100% (6)
Storytelling With Data - The New Visualization Data Guide To Reaching Your Business Aim in The Fastest Way
100 pages
Roadmaps: Includes Our Ready-To-Use Templates
100% (4)
Roadmaps: Includes Our Ready-To-Use Templates
46 pages
Business Analysis Cheat-Sheet
100% (8)
Business Analysis Cheat-Sheet
14 pages
Business Analysis For Beginners - Mohamed Elgendy PDF
100% (10)
Business Analysis For Beginners - Mohamed Elgendy PDF
166 pages
Data Management Complete Study Guide
100% (10)
Data Management Complete Study Guide
122 pages
The DAMA Guide To The Data Management Body of Knowledge - First Edition
100% (11)
The DAMA Guide To The Data Management Body of Knowledge - First Edition
430 pages
Implementing Data Governance
100% (5)
Implementing Data Governance
32 pages
Startup Business Plan
90% (40)
Startup Business Plan
22 pages
Interview Data Engineer
100% (1)
Interview Data Engineer
13 pages
The Data Storytelling Handbook
100% (17)
The Data Storytelling Handbook
49 pages
Data Engineering Cookbook
88% (8)
Data Engineering Cookbook
88 pages
Data Strategy Template v1 2
100% (6)
Data Strategy Template v1 2
13 pages
DAMA Data Governance 90 Min PDF
80% (5)
DAMA Data Governance 90 Min PDF
58 pages
Data Strategy and Architecture
100% (4)
Data Strategy and Architecture
19 pages
The Complete Guide To An Enterprise DataOps Transformation (2022)
100% (1)
The Complete Guide To An Enterprise DataOps Transformation (2022)
186 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Business Analysis Poster
87% (15)
Business Analysis Poster
1 page
Architecting A Data Lake
100% (8)
Architecting A Data Lake
60 pages
Creating An Enterprise Data Strategy - Final
100% (4)
Creating An Enterprise Data Strategy - Final
42 pages
The Data Governance Maturity Model: Establishing The People, Policies and Technology That Manage Enterprise Data
100% (1)
The Data Governance Maturity Model: Establishing The People, Policies and Technology That Manage Enterprise Data
11 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Large Scale Data Pipelines

Uploaded by

Large Scale Data Pipelines

Uploaded by

Modern Data Pipelines

James Ward • In love with FP

• Modern Data Pipeline Overview

• Akka Streams ​github.com/jamesward/koober

• Ability to process massive amounts of data

​ Kafka is a distributed and partitioned commit log

1. One pipeline to rule them all

• Built in Parallelism and Scale

• Managed Kafka Cloud Service

• Powered by Akka Actors

val source = Source.repeat("hello, world")

​Controllers Hold Stateless Functions:

​def doLater = Action.async {

def reactiveRest = Action.async {

​def ws = WebSocket.accept { request =>

• Bounded & Unbounded Data Sets

• Aggregate / Accumulate ​fold(), reduce(), sum(), min()

✓ Individual DBs (nodes)

• Nodes in a peer-to-peer cluster

• Built in data replication

• Across data centers

How many replicas do we need to hear from before we acknowledge?

​ONE – one replica from any DC

​How many replicas we need to hear from can affect

For example, QUORUM can tolerate one replica being down

​ Data locality-aware (speed)

​ Read from and Write to Cassandra

​ Cassandra Tables Exposed as RDD and DataFrames

​ Server-Side filters (where clauses)

​ Cross-table operations (JOIN, UNION, etc.)

​ Mapping of Java Types to Cassandra Types

​ Fault Tolerant Distributed Datasets

​ Distributed Transformation on Datasets

​ Integrated Batch, Iterative and Streaming Analysis

​ In Memory Storage with Spill-over to Disk

You application code

A process which shells out to create

These processes are all separate and require networking

Actions - Require materialization of the records to generate a value

• Similar to a Table in a RDBMS

• Common API for reading/writing data

• API for selecting, filtering, aggregating

• Optimization and code generation through the Spark SQL

• Decorator around RDD - Previously SchemaRDD

• Streaming uses aggressive checkpointing and in-memory data replication to improve

• Frequent checkpointing keeps RDD lineages down to a reasonable size.

• Use Kafka Direct Approach (No Receivers)

You might also like

• Akka Streams github.com/jamesward/koober

Kafka is a distributed and partitioned commit log

Controllers Hold Stateless Functions:

def doLater = Action.async {

def ws = WebSocket.accept { request =>

• Aggregate / Accumulate fold(), reduce(), sum(), min()

ONE – one replica from any DC

How many replicas we need to hear from can affect

Data locality-aware (speed)

Read from and Write to Cassandra

Cassandra Tables Exposed as RDD and DataFrames

Server-Side filters (where clauses)

Cross-table operations (JOIN, UNION, etc.)

Mapping of Java Types to Cassandra Types

Fault Tolerant Distributed Datasets

Distributed Transformation on Datasets

Integrated Batch, Iterative and Streaming Analysis

In Memory Storage with Spill-over to Disk