0% found this document useful (0 votes)

78 views74 pages

Data Science Projects

Uploaded by

sashs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views74 pages

Data Science Projects

Uploaded by

sashs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Email: sashs at gmx dot com

PROJECTS RELATING TO
DATA SCIENCE
 Part I
 Predictive Model in Detail
 Part II
 Portfolio
 Part III
 Energy Efficiency in Building Systems
Part I: Building of a Predictive Model

 Human Activity Recognition using

‘RandomForest’
Predictive Model in Detail

Conceptually…

 Steps in building a predictive model

1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail

Problem

 Human Activity Prediction Using

Smartphones Data Set
 Samsung Galaxy S II
 30 volunteers wearing on their waist
 Six activities
 WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
 Sensors
 Accelerometer and Gyroscope

Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail

Dataset

 UCI Machine Learning Repository

 561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
 3-axial linear acceleration
 3-axial angular velocity

Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail

Duplicate Column Names

 R Language
load(".\\samsungData.rda")

is.data.frame(samsungData)
# [1] TRUE

table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail

Duplicate Column Names

samsDF <- data.frame(samsungData)

is.data.frame(samsungData)
# [1] TRUE

table(duplicated(names(samsDF))) # checking for

duplicate headers
# FALSE
# 563
Predictive Model in Detail

Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561

which(sapply(samsDF, is.character))
# activity
# 563

which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail

Missing Data & Finite Values

dim(samsDF)
# [1] 7352 563

table(complete.cases(samsDF))
# TRUE
# 7352

table(sapply(samsDF[,1:561], is.finite))
# TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail

Balanced Data

table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073

sum(table(samsDF$activity))
# [1] 7352

round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)

# Split up the data using subset

train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563

round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Test Data

test = subset(samsDF, split==FALSE)

dim(test)
# [1] 2206 56

round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Random Forest

library(randomForest)
set.seed(415)

trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562
Predictive Model in Detail

Determining ntree

fit <- randomForest(as.factor(activity) ~ ., data=trainF,

importance=TRUE, ntree=500, do.trace=T)

ntree = 293

Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail

Determining mtry

# mtry : Optimal number of variables selected at each split

mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,

stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)

bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]

bestm
# [1] 11
Predictive Model in Detail

Building & testing the Model

fitF <- randomForest(as.factor(activity) ~ ., data=trainF,

importance=TRUE, ntree=293, mtry=bestm, do.trace=T)

PredictionF <- predict(fitF, test[1:561])

library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail

AUC

library(pROC)

ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))

auc(ROC1)
# Multi-class area under the curve: 0.9953

Decision Tree by Hand: http://bit.ly/DTree123

Part II: Portfolio

 MapReduce: Apache Weblog

 Visualization: LTV
 Streaming Data Analysis: Speech
 Artificial Neural Network (ANN)
 Water-Sludge interface Detection
MapReduce: Apache Weblog

Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog

Problem

 Analyze Apache weblog and provide:

 EpochTime (date and time the request was
processed by the server)
 IP Address
 Latitude, Longitude
 URI
 Referer

http://bit.ly/oFraud123
MapReduce: Apache Weblog

Combined Weblog Format

 "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-

agent}i\""
 (%h) - IP address of the client (remote host)
 -(%l) - the "hyphen" indicates missing information
 (%u) - the "userid" of the person requesting
 (%t) - time of the request
 …
 …

Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: Apache Weblog

Knowing your customers

through Apache Logs
 198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“

IP address Date & Time

URI Referer
MapReduce: Apache Weblog

Challenges

 Weblog needs to be parsed to extract the

required information
 Time is not expressed in “Epoch Time”
 Latitude and Longitude are not readily
available
MapReduce: Apache Weblog

Regular Expression & Testing

(\S+) (\S+) (\S+) \[([^:]+:\d+:\d+:\d+) ([^\]]+)\] \"(\S+) \/(.*?)

(\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)

https://regex101.com/
MapReduce: Apache Weblog

RegEx Groups

https://regex101.com/
MapReduce: Apache Weblog

EpochTime

import time

def convert_time(d, utc):

# d = "14/Jan/2014:09:36:50"
# utc = '-0800'

fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)

epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs

if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600

return int(epf)
MapReduce: Apache Weblog

Latitude and Longitude

 Geolite2 from MaxMind

 geolite2.lookup(<IP address>)

 Reducer
 http://bit.ly/ApaMapper

Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog

Mapper

#!/usr/bin/env python

import sys

#Iterate through every line passed in to stdin

for input in sys.stdin.readlines():
value = input.strip()

print value

http://bit.ly/ApaMapper
MapReduce: Apache Weblog

Hadoop

hadoop jar path/to/hadoop-streaming-

0.20.203.0.jar \
-mapper path/to/mapper.py \
-reducer path/to/reducer.py \
-input path/to/input/* \
-output path/to/output
MapReduce: Apache Weblog

Sample Output

http://bit.ly/oFraud123
MapReduce: Apache Weblog

Impact

Helps to Detect Online Fraud and

Locate Online Visitors
Visualization: LTV
Visualization: LTV

Background

 Gamers sign up each day and become part of

a cohort
 LTV is computed for up to 30 days
Visualization: LTV

Problem

 Use Tableau to:

 Compute LTV
 Compute weighted LTV
Visualization: LTV

Challenges

 Tableau is relatively new

 LTV computation was not readily available
 Given dataset is irregular:
Visualization: LTV

Computed LTV
Visualization: LTV

Weighted LTV
Visualization: LTV

Impact

Customer LTV
>
Cost of customer Acquisition (CAC)

 CAC
 $10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
 CAC = $10/(5+15) = $0.50
Streaming Data: Speech