Data Science
Lecture # 3
Step 1 – Acquiring Data
• By the end of this discussion, you will be able to:
• List techniques and technologies to access and retrieve the data you
need
• Describe an example scenario that accesses data from a variety of
sources using different technologies
Note: All Images are taken from edx.org
Step 1 – Acquiring Data
3
Where’s the Data?
• Identify suitable data related to problem
• Acquire all available data
• Leaving a small amount of data can lead to incorrect conclusions
• Data comes
• From many places i.e. Local & Remote
• In many varieties i.e. Structured & Unstructured
• In many different velocities i.e. Streaming speed of data
4
Where’s the Data?
5
Where’s the Data?
• A lot of data exists in
relational databases like
structured data coming
from organizations
• SQL is used to access
data
6
Where’s the Data?
• Data can also exist in
files such as text files
and Excel spreadsheets
• Scripting languages are
used to get data from
files like Python, VBA,
JavaScript, Perl, PHP, R,
Octa, MATLAB etc
7
Where’s the Data?
• An increasingly popular
way to get data is from
websites
• Common formats are
XML, JSON etc.
• Many websites host web
services to access their
data e.g. REST &
WebSocket
8
Where’s the Data?
• REST stands for Representational State Transfer and it is an
approach for implementing webs services with performance,
scalability and maintainability in mind
• WebSocket services allow realtime notifications from the
websites
9
Where’s the Data?
• NoSQL storage systems
are increasingly used to
manage variety of data
• Examples are Cassandra,
mongoDB and HBASE
• They provide APIs to
allow users to access the
data
10
A Real Example
11
Summary
12
Step 2A – Exploring Data
• By the end of this discussion, you will be able to:
• Explain the importance of exploring data
• Identify methods to perform preliminary analysis of your data
13
Step 2A – Exploring Data
14
Step 2A – Exploring Data
• After getting data you might be tempted to immediately build
models to analyze the data
• We must resist this temptation
• Perform preliminary investigation to gain better understanding
of specific characteristics of data
• We’ll be looking for correlations, general trends, outliers
15
Temptation: The desire to do something, especially something wrong or unwise
Why Explore?
• Correlation
graphs explore
dependencies
between variables
• General trends
show how data is
progressing over
time
• Outliers show
data points that
are distant from
other data points
16
Why Explore?
• Summary statistics
provide numerical
values to describe data
• Mean & median are
measures of location of
specific values
• Mode is the value that
occurs most frequently
• Range and standard
deviation are measures
of spread in data
17
Visualize Data
• Heat map show hot spots
• Histogram show data
distribution (unusual
dispersion)
• Boxplots also show data
distribution
• Lines graphs show
change of value over time
• Scatter plots show
correlation between two
variables 18
Step 2B – Pre-Processing Data
• By the end of this discussion, you will be able to:
• Identify some problems with real world data
• Describe what is needed to transform raw data to data that can be used
for analysis
19
Step 2B – Pre-Processing Data
Clean: To address data quality issues
Transform: To make it suitable for analysis 20
Real World Data is Messy!
• Inconsistent values: Customer with two different addresses
• Duplicate records: Customer recorded at two different locations
• Missing values: Missing customer age
• Invalid data: Invalid step code e.g. 6 digit zip code
• Outliers: Due to sensor failure values are much higher or lower
than expected for a period of time
Outliers: Things situated away or detached from main system 21
Addressing Data Quality Issues
• Remove data with missing values
• Merge duplicate records
• Generate best estimate for invalid values
• Remove outliers
• To address these issues Domain Knowledge is required
• Keep record of changes you made
22
Getting Data in Shape
• The second part is to manipulate the clean data into a format
needed for analysis called data manipulation, data pre-
processing, data wrangling or data munging
• Some operations in data munging
• Scaling
• Transformation
• Feature selection
• Dimensionality reduction
• Data manipulation
23
Scaling
• Scaling involves
changing range of
values such as from 0
to 1
• E.g. magnitude of
weight value is much
greater than
magnitude of height
value
• Scaling both values
between 0 and 1 will
equalize contributions
24
Transformation
• It reduces noise and
variability
• Aggregation is one type
of transformation
which results data in
less variability used in
long term analysis
• E.g. daily sales figures
transform into weekly
or monthly sales figures 25
Feature Selection
• It removes redundant
features, combining
features and creating
new features
• If two features are very
correlated, one can be
removed
26
Dimensionality Reduction
• It is useful when dataset
has large number of
dimensions
• It involves finding
smaller subset of
dimensions that
capture most of the
variation in the data
• E.g. principal
component analysis
27
Data Manipulation
• Raw data often has to
be manipulated to be in
the correct format for
the analysis
• It involves creating
groups and capturing
mean, range and
standard deviation for
each group
28
Summary
• Data preparation is very
important part of data
science process
• Here we spend most of
our time
• It can be tedious but is a
crucial step
• Don’t get good results if
we don’t put time and
effort, no matters how
sophisticated techniques
we use for analysis 29
Step 3 – Analyze Data
• By the end of this discussion, you will be able to:
• Describe what is involved in applying an analysis technique to your data
• List three basic analysis techniques
30
Step 3 – Analyze Data
31
Step 3 – Analyze Data
32
Categories of Analysis Techniques
• There are different types of problems so there are different types
of analysis techniques. The main techniques are
• Classification
• Regression
• Clustering
• Association analysis
• Graph analysis
33
Classification
• Goal is to predict the
category of input data
• An example is
predicting the weather
as sunny, rainy, windy
and cloudy
• Another example is to
identify handwritten
digits as being one of 10
categories i.e. 0 to 9 34
Regression
• When our model has to
predict a numeric value
then it becomes
regression problem
• An example is to
predict price of stock
over time
• Another example is to
estimate weekly sales
of a new product 35
Clustering
• The goal is to organize
similar items into
groups
• An example is to group
company’s customers
as seniors, teenagers
and adults
• Another example is to
determining different
weather groups like
rainy, cold or snowy 36
Association Analysis
• The goal is to find rules
to capture associations
between items or events
• Common example is
market basket analysis
to understand customer
purchasing behavior
• E.g. banking customer
with CD also interested
in other investments
• Diaper-bear example 37
Graph Analytics
• When data have lot of
entities and connections
like social networks, we
use graph analytics
• E.g. exploring the
spread of disease by
analyzing doctor’s
record
• Identification of security
threats by monitoring
social media, email etc 38
Modeling
• Modeling starts with
selecting one of these
techniques
• Construct the model
using prepared data
• To validate model, apply
it to new data samples
• Divide prepared data
into set of data for
constructing model and
reserve some for 39
evaluating the model
How to Evaluate Each Model?
• For classification and
regression we’ll have the
correct output for each
sample in our data
• Comparing the correct
output and predicted
output by the model
provides a way to
evaluate the model
40
How to Evaluate Each Model?
• The groups from
clustering should be
examined to see if they
make sense for our
application
• E.g. do the customer
segments reflect your
customer base?
• Are they helpful for use
in our targeted
marketing campaigns?
41
How to Evaluate Each Model?
• In this case some
investigations will be
needed to see if the
results are correct
• E.g. network traffic
delays needs to be
investigated to see if
what our model predicts
is actually happening?
42
Determine Next Steps
43
Summary
44
Step 4 – Reporting Insights
• By the end of this discussion, you will be able to:
• Determine what to present in reporting your findings
• Identify techniques to communicate your results
45
Step 4 – Reporting Insights
46
What to Present?
• Look at results and
decide what to present
• Means determining
what part of analysis is
more important to our
company?
• Our findings determine
what the next step
should be
47
What to Present?
• All findings must be presented so that informed decisions can
be made
• If your conclusions later found to be wrong your credibility
could be seriously damaged
• Better to tell a complete and true story, even if it isn’t very
clean, then to try finesse things and make them sound more
clear than they really are
48
How to Present?
• Visualization is an
important tool in
presenting results
• Scatter plots, line
graphs etc are
effective ways to
represent your
results visually
• We have tables
with details for
deeper analysis 49
Visualization Tools
50
Step 5 – Turning Insights into Action
• By the end of this discussion, you will be able to:
• Explain what turning insights into action means
• Connect your results with your business question
51
Step 5 – Turning Insights into Action
52
Step 5 – Turning Insights into Action
• We bring together large
datasets to find
actionable insights to
help answer scientific or
commercial question
53
Questions
• Business questions
• Is there something wrong in our process?
• Is there data that should be added to our application to make is more
accurate?
• Science questions
• Where the benefits from a drug trial statistically significant?
• What is the rate of deforestation? Can you predict how much forest will
remain in 15 years?
54
Implementation
• Now we’ve to figure out
how to implement the
actions
• How should it be
automated, if it can be?
• Stakeholders need to be
identified and get
involved in this change
55
Implementation
• We need to monitor and
measure the impact of
the action on the
process
• Be sure to think about
what data you should
collect during and after
the change to properly
evaluate its impact
56
Determine Next Steps
• Big data and data
science are only useful if
the insights can be
turned into actions and
actions should be
carefully defined and
evaluated
57