Making big data simple
with Databricks
We are Databricks, the company behind Spark
Share of Spark code
Founded by the creators of
contributed by Databricks
Apache Spark in 2013 75% in 2014
Data Value
Created Databricks on top of Spark to make big data simple.
2
WORKING WITH BIG
DATA I S D I F F I C U LT
“Through 2017, 60% of big-data
projects will fail to go beyond
piloting and experimentation
and will be abandoned.”
GARTNER
3
PROBLEM
Building infrastructure and data
pipelines is complex
4
Your difficult journey to finding value in data
Building a Build and
Import and explore data with disparate tools deploy data
cluster
applications
Data Advanced Production
Exploration Analytics Deployment
ETL
Data Dashboards
Warehousing & Reports
Long delays $$ High costs
5
3 main causes of this problem:
Infrastructure is complex Tools are slow, clunky, Re-engineering of
to build and maintain and disparate prototypes for deployment
• Expensive upfront investment • Not user-friendly • Duplicated effort
• Months to build • Long time to compute answers • Complexity to achieve
• Dedicated DevOps to operate • Lots of integration required production quality
6
SOLUTION
Data Value
Build Databricks on top of Spark to make big data simple 7
A complete solution, from ingest to production
Instant & Seamless
secure Fast and easy-to-use tools in a single platform transition to
infrastructure production
Notebooks with one- Built-in ML and
click visualization Job scheduler
graph libraries
Easy connection
to diverse data Data Advanced Production
sources
Exploration Analytics Deployment
ETL
Data Dashboards
Warehousing & Reports
Real-time query engine Customizable dashboards
& 3rd party apps
Short time to value $$ Lower costs
8
Four components of Databricks
Make Big Data simple
Notebooks & Dashboards Jobs 3rd-Party Apps
Cluster Manager
Interactive workspace
Managed Spark clusters Production pipeline scheduler 3rd party applications
with notebooks
• Easily provision clusters • Schedule production workflows • Explore data and develop code • Connect powerful BI tools
• Harness the power of Spark • Implement complete pipelines in Java, Python, Scala, or SQL • Leverage a growing
• Import data seamlessly • Monitor progress and results • Collaborate with the entire team ecosystem of applications
• Point and click visualization
• Publish customized dashboards
9
Databricks benefits
Higher Faster deployment Data democratization
productivity of data pipelines
• Maintenance-free • Zero management • One shared repository
infrastructure Spark clusters • Seamless collaboration
• Real time processing • Instant transition from • Easy to build sophisticated
• Easy to use tools prototype to production dashboards and notebooks
10
A few examples of Databricks in action
Prepare data Perform analytics Build data products
• Import data using APIs or • Explore large data sets in • Rapid prototyping
connectors real-time • Implement advanced
• Cleanse mal-formed data • Find hidden patterns with analytics algorithms
• Aggregate data to create a regression analysis • Create and monitor robust
data warehouse • Publish customized production pipelines
dashboards
11
CUSTOMER CASE STUDIES
12
Customer testimonials
“Without Databricks and the real-time insights from Spark, we wouldn't be
able to maintain our database at the pace needed for our customers”
Darian Shirazi, CEO, Radius Intelligence
“We condensed the 6 months we had planned for the initial prototype to
production process to just about a couple of weeks with Databricks.”
Rob Ferguson, Director of Engineering, Automatic Labs
“Databricks is used by over a third of our staff; After implementation, the
amount of analysis performed has increased sixfold, meaning more
questions are being asked, more hypotheses tested.”
Jaka Jančar, CTO, Celtra
13
Radius Intelligence
Gathering customer insights for marketers
CHALLENGE: Complex data integration
• 25 million businesses
• Over 100 billion points of data
RESULT: Speed up the data pipeline
• Entire data set processed in hours
instead of days
• Deploy weekly updates to customers BENEFIT:
instead of monthly
Higher productivity
14
Automatic Labs
IoT for drivers – making car sensor data useful
CHALLENGE: Product idea validation
• Ingest billions of data points
• Rapidly test hypothesis
Location based
• Iterate on ideas in real-time
Driving habits
X
RESULT: Shorter time from idea to product
3 weeks with Databricks vs.
2 months with previous solution
BENEFIT:
Faster Deployment of Data Pipelines
15
Celtra
Building and serving rich digital ads across platforms
CHALLENGE: Analytics specialist bottleneck
• Billions of data points, operational data of
the entire company
• Huge backlog of analytics projects
RESULT: Enable self-service for non-specialists
• Grew the number of analysts by 4x
• Increased analytics project completed by BENEFIT:
6x in four months
Data Democratization
16
Sharethrough
Intelligent ad placement
CHALLENGE: Slow performance, costly DevOps
• Terabyte-scale clickstream data, long delays
in new feature prototyping $
• Two full-time engineers to maintain
infrastructure
RESULT: Faster answers, zero management
• Prototyped new feature in record time
• Reduced system downtime with faster
root-cause analysis
• Dramatically easier to maintain than Hive BENEFIT: Higher Productivity
17
Yesware
Delivering tools and analytics for sales teams
CHALLENGE: Infrastructure pain, slow & clunky tools
• 6 months to setup Pig, very problematic pipeline X
• Too slow to extend reporting history beyond 1 month
• Need to develop machine-learning algorithms
RESULT: Instant infrastructure, full suite of capabilities
• 3 weeks to setup robust Spark pipeline w/Databricks
• Double data processed, in fraction of time BENEFIT:
• Built-in machine learning libraries Faster Deployment of Data Pipelines
18
A few of our customers
19
WHAT’S NEW?
20
What’s new with Databricks
• Databricks is now generally available (announced on June 15th, 2015)
• Upcoming features during second half of 2015:
• R-language notebooks: Analyze large-scale data sets using R in the
Databricks environment.
• Access control and private notebooks: Manage permissions to view
and execute code at an individual level.
• Version control: Track changes to source code in the Databricks
platform.
• Spark streaming support: Enabling a fault-tolerant real-time
processing
21
What’s new with Spark
• The general availability of Spark 1.4 was announced on June 10th 2015
• Spark 1.4 is largest Spark release: more than 220 contributors and 1,200
commits.
• Key new features introduced in Spark 1.4:
• New R language API (SparkR)
• Expansion of Spark’s Dataframe API’s: window functions, statistical and
mathematical functions, support for missing data.
• API to build complete machine learning pipelines.
• UI visualizations for debugging and monitoring programs: interactive event
timeline for jobs, DAG visualization, visual monitoring for Spark Streaming.
22
Data science made easy with Apache Spark
From ingest to production
þ
Unified þ
Zero management Higher productivity
þ
Fast at any scale þ
Real-time and collaborative
Faster deployment
þ
Flexible þ
Instant to production of data pipelines
þ
No lock-in þ
Open and extensible
Data democratization
23
Databricks is available today
Contact sales@databricks.com
Or sign up for a trial at
https://databricks.com/registration
Thank you