Tackling Big Data Using MATLAB
Alka Nair
Application Engineer
© 2015 The MathWorks, Inc.
1
Building Machine Learning Models with Big Data
Preprocess,
Access Exploration & Scale up & Integrate with
Model Development Production Systems
2
Case study: Predict Air Quality
Factors Affecting
My Weather Page Air Quality
• Temperature
www.myweather.com/stats.html
• Pressure
• Relative Humidity
• Dew Point
• Wind speed
• Wind direction
• Ozone
• CO
• NO2
• SO2
3
4
Building Machine Learning Models with Big Data
Preprocess, Exploration Scale up & Integrate with
Access
& Model Development Production Systems
5
Challenges in Modeling and Deploying Big Data Applications
Preprocess, Scale up & Integrate
Access Exploration & Model with Production Systems
Development
▪ Distributed Data Storage ▪ Preprocessing and Visualizing Big Data
▪ Enterprise level
▪ Different Data Sources & ▪ Parallelizing Jobs and Scaling up deployment
Types Computations to Cluster
Managing Different APIs for Data
▪ Rewriting Algorithms to Use Big
Sources and Data Formats Overhead in Moving the
Data Platforms
Algorithm to Production
▪ Parallelizing Code to Scale up to
Use Cluster and Cloud Compute
6
Wouldn’t it be nice if you could:
▪ Easily access data however it is stored
▪ Prototype algorithms quickly using small data sets
▪ Scale up to big data sets running on large clusters
▪ Using the same intuitive MATLAB syntax you are used to
7
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access
Model Development Production Systems
8
Access and Manage Big Data
Different Data Types Different Data Sources Different Applications
• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
▪ Images • Windows Azure Blob • Image Classification
▪ Spreadsheet Storage • Denoising Images
▪ Custom File Formats • Relational Database
• Predictive Maintenance
• HDFS on Hortonworks or
Cloudera
Datastores 9
Datastore
Single
Machine
Memory
Single
Machine
Memory
Process
Cluster of
Cluster of Machines
Machines Memory
Memory
One or more files
10
Air Quality Data on Local Folder
11
Accessing and Processing different types of data
TabularTextDatastore Text files containing column-oriented data, including
CSV files
ImageDatastore Image files, including formats that are supported
Image Collection by imread such as JPEG and PNG
SpreadsheetDatastore Spreadsheet files with a supported Excel ® format
such as .xlsx
MDF
Files MDFDatastore Datastore for collection of MDF files
Custom Datastore Datastore for custom or proprietary format
12
You have 1 TB of data you’ve never seen before. How do you
access this data?
13
Historical files are on HDFS and real time data are available
through an API
• Temperature
• Pressure
• Relative Humidity
• Dew Point
• Wind Speed
• Wind Direction
• Ozone
• CO
• NO2
• SO2
14
Access air quality data using datastore
15
Preview the data and adjust properties to best represent the
data of interest
16
Access data from anywhere with minimal changes
Local disk
17
Datastores enable big data workflows
Deep Learning
18
Datastores enable big data workflows
Predictive
Maintenance
19
Datastores enable big data workflows
Fleet
Analytics
20
Datastores: Access Big Data with Minimal Changes
Different Data Types Different Data Sources Different Applications
• MapReduce
• Hadoop Distributed File
System (HDFS)
• Image Segmentation
▪ Text • Amazon S3
• Windows Azure Blob • Image Classification
▪ Images
▪ Spreadsheet Storage • Denoising Images
▪ Custom File Formats • Relational Database
• Predictive Maintenance
• HDFS on Hortonworks or
Cloudera
✓ ✓ ✓ 21
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems
22
You have 1TB of data you’ve never seen before. How do you
visualize and process the data?
23
Use tall arrays to work with the data like any MATLAB array
24
▪ Introduction to Tall Arrays
▪ Tall Arrays for Big Data Visualization and Preprocessing
▪ Machine Learning for Big Data Using Tall Arrays
25
Tall arrays Single
Machine
Memory
▪ Data is in one or more files
▪ Files stacked vertically
▪ Typically tabular data
Challenge
Cluster of
▪ Data doesn’t fit into memory Machines
Memory
(even cluster memory)
▪ Takes a lot of time for even simple
operations on data
26
Tall arrays (new R2016b) Single
tall array Single
Machine Machine
Memory Process Memory
▪ Create tall table from datastore
ds = datastore('*.csv')
tt = tall(ds) Datastore
▪ Operate on whole tall table
Cluster of
just like ordinary table Machines
Memory
summary(tt)
max(tt.EndTime – tt.StartTime)
27
tall
tall arrays Single
array Single
Machine Machine
Memory Process Memory
▪ With Parallel Computing Toolbox,
process several “chunks” at once Single
Machine
Process Memory
▪Can scale up to clusters with
MATLAB Distributed Computing Server Single
Cluster of Machine
Machines Process Memory
Memory
Single
Machine
Process Memory
28
Use a Spark-enabled Hadoop cluster and MATLAB
Support for many other platforms through reference architectures
29
It’s easy to run MATLAB code on Spark + Hadoop
Spark Connection
Cluster Config for Spark
Hadoop Access
30
MATLAB Documentation for
31
Summary for tall arrays
Local disk,
Shared folders,
Run on Compute Clusters
Databases
or Spark + Hadoop (HDFS),
for large scale analysis
Process out-of-memory data on
your Desktop to explore,
analyze, gain insights and to
develop analytics
Use Parallel Computing
Toolbox for increased
performance MATLAB Distributed Computing Server,
Spark+Hadoop
Develop your code locally using Tall Arrays or
MapReduce only once
Use the same code to scale up to
cluster 32
Create a tall array for each datastore
ozone
33
Execution model makes operations more efficient on big data
tt : tall array
▪ Deferred evaluation
– Commands are not executed right
away
– Operations are added to a queue
▪ Execution triggers include:
– gather function
– summary function
– Machine learning models
– Plotting
34
Execution model makes operations more efficient on big data
Unnecessary results are not
computed
35
✓ Introduction to Tall Arrays
▪ Tall Arrays for Big Data Visualization and Preprocessing
▪ Machine Learning for Big Data Using Tall Arrays
36
Explore Big Data with Tall Visualizations
plot
scatter
binscatter
histogram
histogram2
ksdensity
37
Explore Big Data with Tall Visualizations
38
Get a summary of the data
tt – tall table
39
Use data types to best represent the data
40
Managing Big and Messy Time-stamped Data
41
Use the results of explorations to help make decisions
- Synchronize to daily
data
- By location
42
Synchronize all data to daily times
43
Clean messy data using common preprocessing functions
44
Use familiar MATLAB functions on tall arrays
Functions Supported with Tall Arrays
45
You don’t need to leave MATLAB to monitor large jobs
46
Save preprocessed data
47
✓ Introduction to Tall Arrays
✓ Tall Arrays for Big Data Visualization and Preprocessing
▪ Machine Learning for Big Data Using Tall Arrays
48
Predict air quality
Air Quality Index Air Quality Label
Regression Classification
49
How do you know which model to use?
▪ Try them all ☺
50
Use apps for model exploration on a subset of data
Air Quality Index Air Quality Label
Regression Learner Classification Learner
51
Validate and Compare Machine Learning Models
52
Validate and Compare Machine Learning Models
53
Validate and Compare Machine Learning Models
54
Validate and Compare Machine Learning Models
55
Scale up with tall machine learning models
▪ Linear Regression (fitlm)
▪ Logistic & Generalized Linear Regression (fitglm)
▪ Discriminant Analysis Classification (fitcdiscr)
▪ K-means Clustering (kmeans)
▪ Principal Component Analysis (pca)
▪ Partition for Cross Validation (cvpartition)
▪ Linear Support Vector Machine (SVM) Classification (fitclinear)
▪ Naïve Bayes Classification (fitcnb)
▪ Random Forest Ensemble Classification (TreeBagger)
▪ Lasso Linear Regression (lasso)
▪ Linear Support Vector Machine (SVM) Regression (fitrlinear)
▪ Single Classification Decision Tree (fitctree)
▪ Linear SVM Classification with Random Kernel Expansion (fitckernel)
▪ Gaussian Kernel Regression (fitrkernel)
56
Training Machine Learning Model against Spark for Air Quality
Classification
57
Train and validate with tall data for Air Quality Index Prediction
58
Select the most important features
59
✓ Introduction to Tall Arrays
✓ Tall Arrays for Big Data Visualization and Preprocessing
✓ Machine Learning for Big Data Using Tall Arrays
61
Building machine learning models with big data
Preprocess,
Exploration & Scale up & Integrate with
Access Model Development Production Systems
62
63
Predict air quality for given location
Current Weather
My Weather
My WeatherPage
Page
www.myweather.com/stats.html
www.myweather.com/stats.html
Your Weather Conditions
Get weather conditions for your area.
Location: 01760
Temperature: 32F
MATLAB
Runtime Humidity: 76%
Wind: SSW 13 mph
Use MATLAB model running on Spark in Python web
framework
64
Integrate analytics with systems
Embedded Hardware
C, C++ HDL PLC
Enterprise Systems
GPU
Standalone Excel Hadoop/ MATLAB
Application Add-in C/C++ Java ++ Python .NET Production
Spark Server
MATLAB
Runtime
65
Package and test MATLAB code
66
67
Package and test MATLAB code
68
Call MATLAB in production environment
AirQual.ctf
69
MATLAB Production Server
▪ Server software
– Manages packaged MATLAB programs and worker pool
Enterprise
Application
▪ MATLAB Runtime libraries MPS Client
Library
MATLAB Production Server
– Single server can use runtimes
from different releases Request Broker
&
Program
Manager
Applications/
▪ RESTful JSON interface Database
Servers RESTful
JSON
MATLAB
▪ Lightweight client libraries Runtime
– C/C++, .NET, Python, and Java
70
MATLAB for Modeling and Deploying Big Data Applications
Scale up & Integrate
Preprocess,
with Production Systems
Access Exploration & Model
Development
▪ Distributed Data Storage ▪ Preprocessing and Visualizing Big Data
▪ Enterprise level
▪ Different Data Sources & ▪ Parallelizing Jobs and Scaling up deployment
Types Computations to Cluster
Easily Access Data Prototype and easily scale up Seamless integration with
however/wherever it is stored algorithms to Big Data platforms Enterprise level systems
using Datastore using the familiar MATLAB Syntax using MATLAB Production
with Tall Arrays Server
71
How do you get started?
▪ Try Tall Array Based Processing on Your Own Set of Big Data
▪ Refer to the example mentioned below to get started:
https://in.mathworks.com/help/matlab/examples/analyze-big-data-in-matlab-using-tall-
arrays.html
Other Resources
mathworks.com/big-data
mathworks.com/machine-learning eBook
72
MathWorks Training Offerings
http://www.mathworks.com/services/training/
73
Speaker Details Contact MathWorks India
Email: Alka.Nair@mathworks.in Products/Training Enquiry Booth
LinkedIn: https://www.linkedin.com/in/alka-nair- Call: 080-6632-6000
1820501a/ Email: info@mathworks.in
• Share your experience with MATLAB & Simulink on Social Media
▪ Use #MATLABEXPO
• Share your session feedback:
Please fill in your feedback for this session in the feedback form
74