Email: sashs at gmx dot com
PROJECTS RELATING TO
DATA SCIENCE
Part I
Predictive Model in Detail
Part II
Portfolio
Part III
Energy Efficiency in Building Systems
Part I: Building of a Predictive Model
Human Activity Recognition using
‘RandomForest’
Predictive Model in Detail
Conceptually…
Steps in building a predictive model
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail
Problem
Human Activity Prediction Using
Smartphones Data Set
Samsung Galaxy S II
30 volunteers wearing on their waist
Six activities
WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
Sensors
Accelerometer and Gyroscope
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail
Dataset
UCI Machine Learning Repository
561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
3-axial linear acceleration
3-axial angular velocity
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail
Duplicate Column Names
R Language
load(".\\samsungData.rda")
is.data.frame(samsungData)
# [1] TRUE
table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail
Duplicate Column Names
samsDF <- data.frame(samsungData)
is.data.frame(samsungData)
# [1] TRUE
table(duplicated(names(samsDF))) # checking for
duplicate headers
# FALSE
# 563
Predictive Model in Detail
Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561
which(sapply(samsDF, is.character))
# activity
# 563
which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail
Missing Data & Finite Values
dim(samsDF)
# [1] 7352 563
table(complete.cases(samsDF))
# TRUE
# 7352
table(sapply(samsDF[,1:561], is.finite))
# TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail
Balanced Data
table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073
sum(table(samsDF$activity))
# [1] 7352
round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)
# Split up the data using subset
train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563
round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Test Data
test = subset(samsDF, split==FALSE)
dim(test)
# [1] 2206 56
round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Random Forest
library(randomForest)
set.seed(415)
trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562
Predictive Model in Detail
Determining ntree
fit <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=500, do.trace=T)
ntree = 293
Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail
Determining mtry
# mtry : Optimal number of variables selected at each split
mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
bestm
# [1] 11
Predictive Model in Detail
Building & testing the Model
fitF <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=293, mtry=bestm, do.trace=T)
PredictionF <- predict(fitF, test[1:561])
library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail
AUC
library(pROC)
ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))
auc(ROC1)
# Multi-class area under the curve: 0.9953
Decision Tree by Hand: http://bit.ly/DTree123
Part II: Portfolio
MapReduce: Apache Weblog
Visualization: LTV
Streaming Data Analysis: Speech
Artificial Neural Network (ANN)
Water-Sludge interface Detection
MapReduce: Apache Weblog
Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog
Problem
Analyze Apache weblog and provide:
EpochTime (date and time the request was
processed by the server)
IP Address
Latitude, Longitude
URI
Referer
http://bit.ly/oFraud123
MapReduce: Apache Weblog
Combined Weblog Format
"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-
agent}i\""
(%h) - IP address of the client (remote host)
-(%l) - the "hyphen" indicates missing information
(%u) - the "userid" of the person requesting
(%t) - time of the request
…
…
Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: Apache Weblog
Knowing your customers
through Apache Logs
198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“
IP address Date & Time
URI Referer
MapReduce: Apache Weblog
Challenges
Weblog needs to be parsed to extract the
required information
Time is not expressed in “Epoch Time”
Latitude and Longitude are not readily
available
MapReduce: Apache Weblog
Regular Expression & Testing
(\S+) (\S+) (\S+) \[([^:]+:\d+:\d+:\d+) ([^\]]+)\] \"(\S+) \/(.*?)
(\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)
https://regex101.com/
MapReduce: Apache Weblog
RegEx Groups
https://regex101.com/
MapReduce: Apache Weblog
EpochTime
import time
def convert_time(d, utc):
# d = "14/Jan/2014:09:36:50"
# utc = '-0800'
fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)
epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs
if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600
return int(epf)
MapReduce: Apache Weblog
Latitude and Longitude
Geolite2 from MaxMind
geolite2.lookup(<IP address>)
Reducer
http://bit.ly/ApaMapper
Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog
Mapper
#!/usr/bin/env python
import sys
#Iterate through every line passed in to stdin
for input in sys.stdin.readlines():
value = input.strip()
print value
http://bit.ly/ApaMapper
MapReduce: Apache Weblog
Hadoop
hadoop jar path/to/hadoop-streaming-
0.20.203.0.jar \
-mapper path/to/mapper.py \
-reducer path/to/reducer.py \
-input path/to/input/* \
-output path/to/output
MapReduce: Apache Weblog
Sample Output
http://bit.ly/oFraud123
MapReduce: Apache Weblog
Impact
Helps to Detect Online Fraud and
Locate Online Visitors
Visualization: LTV
Visualization: LTV
Background
Gamers sign up each day and become part of
a cohort
LTV is computed for up to 30 days
Visualization: LTV
Problem
Use Tableau to:
Compute LTV
Compute weighted LTV
Visualization: LTV
Challenges
Tableau is relatively new
LTV computation was not readily available
Given dataset is irregular:
Visualization: LTV
Computed LTV
Visualization: LTV
Weighted LTV
Visualization: LTV
Impact
Customer LTV
>
Cost of customer Acquisition (CAC)
CAC
$10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
CAC = $10/(5+15) = $0.50
Streaming Data: Speech
https://angel.co/freeaccent
Streaming Data: Speech
Language Learning over a
Chat session
https://angel.co/freeaccent
Streaming Data: Speech
Problem
Learn a foreign language from a native
speaker
Student and Tutor are separated
Use computing device and internet
https://angel.co/freeaccent
Streaming Data: Speech
Challenges
Collect the speech data off the web
Record: start record, stop record
Upload
Preprocessing speech data
End point detection
Noise
Extracting Accent Score frame by frame
Populating on the web page on demand
https://angel.co/freeaccent
Streaming Data: Speech
Technology Stack
Collect the speech data off the web
Html5, JavaScript, PHP
Preprocessing speech data
Energy based algo, MFCC
Extracting Accent Score frame by frame
Proprietary algo
Populating on the web page on demand
AJAX
https://angel.co/freeaccent
Streaming Data: Speech
User Interface
https://angel.co/freeaccent
Streaming Data: Speech
Impact
Measurement tool
Motivational: helps to set goal
Customer retention
https://angel.co/freeaccent
Artificial Neural Network (ANN)
ANN
McCulloch Pitts (MP) Neuron
Source: https://appliedgo.net/perceptron/
ANN
Diagram of the MP neuron
ANN
Equation of the MP neuron
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Multi-Layer Perceptron
Fully interconnected
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Optimization function
Rumelhart et al – Gradient Descent
(Generalized Delta Rule)
ANN
Challenges
Saturation at Initialization
Known solutions:
Small initial weights
Hyperbolic Tangent Function instead of Sigmoidal
Other challenges relating to speech
processing
http://bit.ly/my_pubs
ANN
Hyperbolic Tangent Function
ANN
Saturation at Initialization
http://bit.ly/modANN
ANN
Introduced (N)
where (N) 1
http://bit.ly/modANN
ANN
Impact
Training time was significantly reduced
3 layers – not needed
(N) - empirical
Water-Sludge interface Detection
Thames Water Authority – Deephams
Station, Enfield
Water-Sludge interface Detection
Problem
Replace Turbidity meter
Piezo-electric transducer to detect water-
sludge interface
Measure water depth in a final stage settling
tank
Water-Sludge interface Detection
Piezo-electric Transducer
Receiver
Transmitter
Water-Sludge interface Detection
Final Stage Settling Tank
Water-Sludge interface Detection
Pulsed Sinusoidal Signal
Period of pulse 27.5 ms
Water-Sludge interface Detection
Collecting Data
Envelope Detection and Amplification
Water-Sludge interface Detection
Data Visualization
Average of the reverberated signal by the pulse period
Leakage
Reverberation
3.68 ms
Bottom of
the Tank
Water-Sludge interface Detection
Computing the Water Depth
Speed of sound ~1.5x103 m/s
1.5x103 x 3.68 ms
= 5.52 m
Depth of water
= 2.76 m
= 9.05 ft
Water-Sludge interface Detection
Impact
Proof of concept was successful
Won a contract to develop an instrument
Water-Sludge interface Detection
Addition of Internet?
IoT
On a computer or a device
Part III
Energy Efficiency in Building Systems
Energy Efficiency in Building Systems
Powerwall by Tesla
Powerwall
Energy Efficiency in Building Systems
Solar Tubes and Walls
Energy Efficiency in Building Systems
Sun Shades
UC Davis West Village is the largest planned “zero net energy” community
Energy Efficiency in Building Systems
Net Zero Homes
New homes to be net-zero energy by 2020
California Public Utilities Commission (CPUC) and
California Energy Commission (CEC)
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
Energy Efficiency in Building Systems
Related Work
http://bit.ly/EnergyEff123
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020