0% found this document useful (0 votes)

302 views25 pages

Making Big Data Simple With Databricks

1) Databricks was founded in 2013 by the creators of Apache Spark to make working with big data simpler. 2) Databricks provides a complete solution for data ingestion, exploration, analytics and production deployment on a single platform using Spark. 3) Customers have reported benefits such as faster deployment of data pipelines, higher productivity through maintenance-free infrastructure, and data democratization through easy to use tools.

Uploaded by

toddsawicki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

302 views25 pages

Making Big Data Simple With Databricks

Uploaded by

toddsawicki

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Making big data simple

with Databricks
We are Databricks, the company behind Spark

Share of Spark code

Founded by the creators of
contributed by Databricks
Apache Spark in 2013 75% in 2014

Data Value

Created Databricks on top of Spark to make big data simple.

2
WORKING WITH BIG
DATA I S D I F F I C U LT

“Through 2017, 60% of big-data

projects will fail to go beyond
piloting and experimentation
and will be abandoned.”
GARTNER

3
PROBLEM

Building infrastructure and data

pipelines is complex

4
Your difficult journey to finding value in data
Building a Build and
Import and explore data with disparate tools deploy data
cluster
applications

Data Advanced Production

Exploration Analytics Deployment

ETL
Data Dashboards
Warehousing & Reports

Long delays $$ High costs

5
3 main causes of this problem:

Infrastructure is complex Tools are slow, clunky, Re-engineering of

to build and maintain and disparate prototypes for deployment

•  Expensive upfront investment •  Not user-friendly •  Duplicated eﬀort

•  Months to build •  Long time to compute answers •  Complexity to achieve
•  Dedicated DevOps to operate •  Lots of integration required production quality

6
SOLUTION

Data Value

Build Databricks on top of Spark to make big data simple 7

A complete solution, from ingest to production
Instant & Seamless
secure Fast and easy-to-use tools in a single platform transition to
infrastructure production
Notebooks with one- Built-in ML and
click visualization Job scheduler
graph libraries
Easy connection
to diverse data Data Advanced Production
sources
Exploration Analytics Deployment

ETL
Data Dashboards
Warehousing & Reports
Real-time query engine Customizable dashboards
& 3rd party apps

Short time to value $$ Lower costs

8
Four components of Databricks
Make Big Data simple

Notebooks & Dashboards Jobs 3rd-Party Apps

Cluster Manager

Interactive workspace
Managed Spark clusters Production pipeline scheduler 3rd party applications
with notebooks
•  Easily provision clusters •  Schedule production workflows •  Explore data and develop code •  Connect powerful BI tools
•  Harness the power of Spark •  Implement complete pipelines in Java, Python, Scala, or SQL •  Leverage a growing
•  Import data seamlessly •  Monitor progress and results •  Collaborate with the entire team ecosystem of applications
•  Point and click visualization
•  Publish customized dashboards
9
Databricks benefits

Higher Faster deployment Data democratization

productivity of data pipelines

•  Maintenance-free •  Zero management •  One shared repository

infrastructure Spark clusters •  Seamless collaboration
•  Real time processing •  Instant transition from •  Easy to build sophisticated
•  Easy to use tools prototype to production dashboards and notebooks
10
A few examples of Databricks in action

Prepare data Perform analytics Build data products

•  Import data using APIs or •  Explore large data sets in •  Rapid prototyping
connectors real-time •  Implement advanced
•  Cleanse mal-formed data •  Find hidden patterns with analytics algorithms
•  Aggregate data to create a regression analysis •  Create and monitor robust
data warehouse •  Publish customized production pipelines
dashboards
11
CUSTOMER CASE STUDIES

12
Customer testimonials
“Without Databricks and the real-time insights from Spark, we wouldn't be
able to maintain our database at the pace needed for our customers”
Darian Shirazi, CEO, Radius Intelligence

“We condensed the 6 months we had planned for the initial prototype to
production process to just about a couple of weeks with Databricks.”
Rob Ferguson, Director of Engineering, Automatic Labs

“Databricks is used by over a third of our staﬀ; After implementation, the

amount of analysis performed has increased sixfold, meaning more
questions are being asked, more hypotheses tested.”
Jaka Jančar, CTO, Celtra

13
Radius Intelligence
Gathering customer insights for marketers

CHALLENGE: Complex data integration

•  25 million businesses
•  Over 100 billion points of data

RESULT: Speed up the data pipeline

•  Entire data set processed in hours

instead of days
•  Deploy weekly updates to customers BENEFIT:
instead of monthly
Higher productivity
14
Automatic Labs
IoT for drivers – making car sensor data useful

CHALLENGE: Product idea validation

•  Ingest billions of data points

•  Rapidly test hypothesis
Location based
•  Iterate on ideas in real-time
Driving habits
X
RESULT: Shorter time from idea to product

3 weeks with Databricks vs.

2 months with previous solution
BENEFIT:
Faster Deployment of Data Pipelines
15
Celtra
Building and serving rich digital ads across platforms

CHALLENGE: Analytics specialist bottleneck

•  Billions of data points, operational data of

the entire company
•  Huge backlog of analytics projects

RESULT: Enable self-service for non-specialists

•  Grew the number of analysts by 4x

•  Increased analytics project completed by BENEFIT:
6x in four months
Data Democratization
16
Sharethrough
Intelligent ad placement

CHALLENGE: Slow performance, costly DevOps

•  Terabyte-scale clickstream data, long delays

in new feature prototyping $
•  Two full-time engineers to maintain
infrastructure

RESULT: Faster answers, zero management

•  Prototyped new feature in record time

•  Reduced system downtime with faster
root-cause analysis
•  Dramatically easier to maintain than Hive BENEFIT: Higher Productivity
17
Yesware
Delivering tools and analytics for sales teams

CHALLENGE: Infrastructure pain, slow & clunky tools

•  6 months to setup Pig, very problematic pipeline X

•  Too slow to extend reporting history beyond 1 month
•  Need to develop machine-learning algorithms

RESULT: Instant infrastructure, full suite of capabilities

•  3 weeks to setup robust Spark pipeline w/Databricks

•  Double data processed, in fraction of time BENEFIT:
•  Built-in machine learning libraries Faster Deployment of Data Pipelines
18
A few of our customers

19
WHAT’S NEW?

20
What’s new with Databricks
•  Databricks is now generally available (announced on June 15th, 2015)
•  Upcoming features during second half of 2015:
•  R-language notebooks: Analyze large-scale data sets using R in the
Databricks environment.
•  Access control and private notebooks: Manage permissions to view
and execute code at an individual level.
•  Version control: Track changes to source code in the Databricks
platform.
•  Spark streaming support: Enabling a fault-tolerant real-time
processing

21
What’s new with Spark
•  The general availability of Spark 1.4 was announced on June 10th 2015
•  Spark 1.4 is largest Spark release: more than 220 contributors and 1,200
commits.
•  Key new features introduced in Spark 1.4:
•  New R language API (SparkR)
•  Expansion of Spark’s Dataframe API’s: window functions, statistical and
mathematical functions, support for missing data.
•  API to build complete machine learning pipelines.
•  UI visualizations for debugging and monitoring programs: interactive event
timeline for jobs, DAG visualization, visual monitoring for Spark Streaming.

22
Data science made easy with Apache Spark
From ingest to production

þ Unified þ Zero management Higher productivity

þ Fast at any scale þ Real-time and collaborative

Faster deployment
þ Flexible þ Instant to production of data pipelines

þ No lock-in þ Open and extensible

Data democratization
23
Databricks is available today

Contact sales@databricks.com

Or sign up for a trial at

https://databricks.com/registration
Thank you

Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Spark QA
No ratings yet
Spark QA
34 pages
Build A Data Pipeline Using AWS Glue
No ratings yet
Build A Data Pipeline Using AWS Glue
27 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
Spark
No ratings yet
Spark
13 pages
Competitive Intelligence Course
No ratings yet
Competitive Intelligence Course
36 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Databricks Clusters
No ratings yet
Databricks Clusters
29 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
31 pages
Spark Optimization Techniques
No ratings yet
Spark Optimization Techniques
7 pages
Data Science & Big Data Projects
100% (1)
Data Science & Big Data Projects
85 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Day1 Main
No ratings yet
Day1 Main
188 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
100% (1)
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
168 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Azure Databricks Workshop Agenda
No ratings yet
Azure Databricks Workshop Agenda
43 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
124 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Getting Started With Apache Nifi
No ratings yet
Getting Started With Apache Nifi
10 pages
Data Modeling Techniques & Types
No ratings yet
Data Modeling Techniques & Types
2 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark Summit: June 2014
No ratings yet
Spark Summit: June 2014
32 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Module 3
No ratings yet
Module 3
47 pages
Data Bricks S
No ratings yet
Data Bricks S
18 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Big Data Analytics02
No ratings yet
Big Data Analytics02
20 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
The State of AI and Machine Learning
No ratings yet
The State of AI and Machine Learning
31 pages
Ceramics and Pigments From Kostienki 1 PDF
No ratings yet
Ceramics and Pigments From Kostienki 1 PDF
58 pages
Neandertal Demise An Archaeological Anal
No ratings yet
Neandertal Demise An Archaeological Anal
43 pages
Playerhabitsoutline PDF
No ratings yet
Playerhabitsoutline PDF
2 pages
ETL Testing Guide: Concepts & Types
No ratings yet
ETL Testing Guide: Concepts & Types
14 pages
ER Model & Database Design Guide
No ratings yet
ER Model & Database Design Guide
17 pages
Automatic Seeding in SQL Server AG
No ratings yet
Automatic Seeding in SQL Server AG
9 pages
Ocs353 DCF
No ratings yet
Ocs353 DCF
4 pages
F.4. Data Analytics Part 1
No ratings yet
F.4. Data Analytics Part 1
29 pages
Movie Recommendation System Using SVD Letterboxd
No ratings yet
Movie Recommendation System Using SVD Letterboxd
9 pages
MS Azure AWS Comparison Ebook v2
No ratings yet
MS Azure AWS Comparison Ebook v2
37 pages
Dbms 2
No ratings yet
Dbms 2
2 pages
SSIS Framework for Package Development
0% (1)
SSIS Framework for Package Development
53 pages
ODI Class Notes
50% (2)
ODI Class Notes
149 pages
Sycs Sem Iii Oct.18 (Choice Base) Database Management System (30.oct.18) (PC.-56104)
No ratings yet
Sycs Sem Iii Oct.18 (Choice Base) Database Management System (30.oct.18) (PC.-56104)
3 pages
Intro to Information Retrieval Systems
No ratings yet
Intro to Information Retrieval Systems
10 pages
BIOINFORMATICS ASSIGNMENT - Final - DR - 01
No ratings yet
BIOINFORMATICS ASSIGNMENT - Final - DR - 01
17 pages
Deep Learning Insights for Beginners
No ratings yet
Deep Learning Insights for Beginners
10 pages
Health Information Systems 1693635187
100% (2)
Health Information Systems 1693635187
285 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
BDA 2 Marks
No ratings yet
BDA 2 Marks
13 pages
Ui & Ux QB - Unit 5
No ratings yet
Ui & Ux QB - Unit 5
14 pages
Fast - Algorithms - For - Mining Association Rules - R Agrawal - R Srikant-IBM
No ratings yet
Fast - Algorithms - For - Mining Association Rules - R Agrawal - R Srikant-IBM
32 pages
2023+CISSP+Domain+2+Study+Guide+by+ThorTeaches Com+v4 0
No ratings yet
2023+CISSP+Domain+2+Study+Guide+by+ThorTeaches Com+v4 0
9 pages
SUMMARY of The Electronic Health Record
No ratings yet
SUMMARY of The Electronic Health Record
3 pages
Batch-59 - Analysis On Cyber Attacks
No ratings yet
Batch-59 - Analysis On Cyber Attacks
13 pages
20 Multiple Choice Questions: D. DICOM Gateway D. All of The Above
No ratings yet
20 Multiple Choice Questions: D. DICOM Gateway D. All of The Above
7 pages
Instructor Materials Chapter 1: Data and The Internet of Things
No ratings yet
Instructor Materials Chapter 1: Data and The Internet of Things
21 pages
Self-Service Analytics Maturity Model Guide
100% (2)
Self-Service Analytics Maturity Model Guide
21 pages
Y19 Coeu Cfihos
100% (1)
Y19 Coeu Cfihos
15 pages
Evolution and Impact of IoT Technology
No ratings yet
Evolution and Impact of IoT Technology
1 page
Pb2 AI SetA
No ratings yet
Pb2 AI SetA
5 pages
The ODK Project (Slides)
100% (2)
The ODK Project (Slides)
27 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages

Making Big Data Simple With Databricks

Uploaded by

Making Big Data Simple With Databricks

Uploaded by

Making big data simple

Share of Spark code

Created Databricks on top of Spark to make big data simple.

“Through 2017, 60% of big-data

Building infrastructure and data

Data Advanced Production

Long delays $$ High costs

Infrastructure is complex Tools are slow, clunky, Re-engineering of

• Expensive upfront investment • Not user-friendly • Duplicated eﬀort

Build Databricks on top of Spark to make big data simple 7

Short time to value $$ Lower costs

Notebooks & Dashboards Jobs 3rd-Party Apps

Higher Faster deployment Data democratization

• Maintenance-free • Zero management • One shared repository

Prepare data Perform analytics Build data products

“Databricks is used by over a third of our staﬀ; After implementation, the

CHALLENGE: Complex data integration

RESULT: Speed up the data pipeline

• Entire data set processed in hours

CHALLENGE: Product idea validation

• Ingest billions of data points

3 weeks with Databricks vs.

CHALLENGE: Analytics specialist bottleneck

• Billions of data points, operational data of

RESULT: Enable self-service for non-specialists

• Grew the number of analysts by 4x

CHALLENGE: Slow performance, costly DevOps

• Terabyte-scale clickstream data, long delays

RESULT: Faster answers, zero management

• Prototyped new feature in record time

CHALLENGE: Infrastructure pain, slow & clunky tools

• 6 months to setup Pig, very problematic pipeline X

RESULT: Instant infrastructure, full suite of capabilities

• 3 weeks to setup robust Spark pipeline w/Databricks

þ Unified þ Zero management Higher productivity

þ Fast at any scale þ Real-time and collaborative

þ No lock-in þ Open and extensible

Or sign up for a trial at

You might also like

•  Expensive upfront investment •  Not user-friendly •  Duplicated eﬀort

•  Maintenance-free •  Zero management •  One shared repository

•  Entire data set processed in hours

•  Ingest billions of data points

•  Billions of data points, operational data of

•  Grew the number of analysts by 4x

•  Terabyte-scale clickstream data, long delays

•  Prototyped new feature in record time

•  6 months to setup Pig, very problematic pipeline X

•  3 weeks to setup robust Spark pipeline w/Databricks