0% found this document useful (0 votes)

30 views57 pages

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views57 pages

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Science

Lecture # 3
Step 1 – Acquiring Data

• By the end of this discussion, you will be able to:

• List techniques and technologies to access and retrieve the data you
need
• Describe an example scenario that accesses data from a variety of
sources using different technologies

Note: All Images are taken from edx.org

Step 1 – Acquiring Data

3
Where’s the Data?
• Identify suitable data related to problem
• Acquire all available data
• Leaving a small amount of data can lead to incorrect conclusions
• Data comes
• From many places i.e. Local & Remote
• In many varieties i.e. Structured & Unstructured
• In many different velocities i.e. Streaming speed of data

4
Where’s the Data?

5
Where’s the Data?

• A lot of data exists in

relational databases like
structured data coming
from organizations
• SQL is used to access
data

6
Where’s the Data?

• Data can also exist in

files such as text files
and Excel spreadsheets
• Scripting languages are
used to get data from
files like Python, VBA,
JavaScript, Perl, PHP, R,
Octa, MATLAB etc

7
Where’s the Data?

• An increasingly popular
way to get data is from
websites
• Common formats are
XML, JSON etc.
• Many websites host web
services to access their
data e.g. REST &
WebSocket

8
Where’s the Data?

• REST stands for Representational State Transfer and it is an

approach for implementing webs services with performance,
scalability and maintainability in mind

• WebSocket services allow realtime notifications from the

websites

9
Where’s the Data?

• NoSQL storage systems

are increasingly used to
manage variety of data
• Examples are Cassandra,
mongoDB and HBASE
• They provide APIs to
allow users to access the
data

10
A Real Example

11
Summary

12
Step 2A – Exploring Data

• By the end of this discussion, you will be able to:

• Explain the importance of exploring data

• Identify methods to perform preliminary analysis of your data

13
Step 2A – Exploring Data

14
Step 2A – Exploring Data

• After getting data you might be tempted to immediately build

models to analyze the data
• We must resist this temptation
• Perform preliminary investigation to gain better understanding
of specific characteristics of data
• We’ll be looking for correlations, general trends, outliers
15
Temptation: The desire to do something, especially something wrong or unwise
Why Explore?
• Correlation
graphs explore
dependencies
between variables
• General trends
show how data is
progressing over
time
• Outliers show
data points that
are distant from
other data points
16
Why Explore?
• Summary statistics
provide numerical
values to describe data
• Mean & median are
measures of location of
specific values
• Mode is the value that
occurs most frequently
• Range and standard
deviation are measures
of spread in data
17
Visualize Data
• Heat map show hot spots
• Histogram show data
distribution (unusual
dispersion)
• Boxplots also show data
distribution
• Lines graphs show
change of value over time
• Scatter plots show
correlation between two
variables 18
Step 2B – Pre-Processing Data

• By the end of this discussion, you will be able to:

• Identify some problems with real world data

• Describe what is needed to transform raw data to data that can be used
for analysis

19
Step 2B – Pre-Processing Data

Clean: To address data quality issues

Transform: To make it suitable for analysis 20
Real World Data is Messy!

• Inconsistent values: Customer with two different addresses

• Duplicate records: Customer recorded at two different locations
• Missing values: Missing customer age
• Invalid data: Invalid step code e.g. 6 digit zip code
• Outliers: Due to sensor failure values are much higher or lower
than expected for a period of time

Outliers: Things situated away or detached from main system 21

Addressing Data Quality Issues
• Remove data with missing values
• Merge duplicate records
• Generate best estimate for invalid values
• Remove outliers

• To address these issues Domain Knowledge is required

• Keep record of changes you made
22
Getting Data in Shape
• The second part is to manipulate the clean data into a format
needed for analysis called data manipulation, data pre-
processing, data wrangling or data munging
• Some operations in data munging
• Scaling
• Transformation
• Feature selection
• Dimensionality reduction
• Data manipulation
23
Scaling
• Scaling involves
changing range of
values such as from 0
to 1
• E.g. magnitude of
weight value is much
greater than
magnitude of height
value
• Scaling both values
between 0 and 1 will
equalize contributions
24
Transformation

• It reduces noise and

variability
• Aggregation is one type
of transformation
which results data in
less variability used in
long term analysis
• E.g. daily sales figures
transform into weekly
or monthly sales figures 25
Feature Selection

• It removes redundant
features, combining
features and creating
new features
• If two features are very
correlated, one can be
removed

26
Dimensionality Reduction
• It is useful when dataset
has large number of
dimensions
• It involves finding
smaller subset of
dimensions that
capture most of the
variation in the data
• E.g. principal
component analysis
27
Data Manipulation

• Raw data often has to

be manipulated to be in
the correct format for
the analysis
• It involves creating
groups and capturing
mean, range and
standard deviation for
each group

28
Summary
• Data preparation is very
important part of data
science process
• Here we spend most of
our time
• It can be tedious but is a
crucial step
• Don’t get good results if
we don’t put time and
effort, no matters how
sophisticated techniques
we use for analysis 29
Step 3 – Analyze Data

• By the end of this discussion, you will be able to:

• Describe what is involved in applying an analysis technique to your data

• List three basic analysis techniques

30
Step 3 – Analyze Data

31
Step 3 – Analyze Data

32
Categories of Analysis Techniques

• There are different types of problems so there are different types

of analysis techniques. The main techniques are
• Classification
• Regression
• Clustering
• Association analysis
• Graph analysis
33
Classification

• Goal is to predict the

category of input data
• An example is
predicting the weather
as sunny, rainy, windy
and cloudy
• Another example is to
identify handwritten
digits as being one of 10
categories i.e. 0 to 9 34
Regression

• When our model has to

predict a numeric value
then it becomes
regression problem
• An example is to
predict price of stock
over time
• Another example is to
estimate weekly sales
of a new product 35
Clustering

• The goal is to organize

similar items into
groups
• An example is to group
company’s customers
as seniors, teenagers
and adults
• Another example is to
determining different
weather groups like
rainy, cold or snowy 36
Association Analysis

• The goal is to find rules

to capture associations
between items or events
• Common example is
market basket analysis
to understand customer
purchasing behavior
• E.g. banking customer
with CD also interested
in other investments
• Diaper-bear example 37
Graph Analytics
• When data have lot of
entities and connections
like social networks, we
use graph analytics
• E.g. exploring the
spread of disease by
analyzing doctor’s
record
• Identification of security
threats by monitoring
social media, email etc 38
Modeling
• Modeling starts with
selecting one of these
techniques
• Construct the model
using prepared data
• To validate model, apply
it to new data samples
• Divide prepared data
into set of data for
constructing model and
reserve some for 39
evaluating the model
How to Evaluate Each Model?

• For classification and

regression we’ll have the
correct output for each
sample in our data
• Comparing the correct
output and predicted
output by the model
provides a way to
evaluate the model
40
How to Evaluate Each Model?
• The groups from
clustering should be
examined to see if they
make sense for our
application
• E.g. do the customer
segments reflect your
customer base?
• Are they helpful for use
in our targeted
marketing campaigns?
41
How to Evaluate Each Model?

• In this case some

investigations will be
needed to see if the
results are correct
• E.g. network traffic
delays needs to be
investigated to see if
what our model predicts
is actually happening?
42
Determine Next Steps

43
Summary

44
Step 4 – Reporting Insights

• By the end of this discussion, you will be able to:

• Determine what to present in reporting your findings

• Identify techniques to communicate your results

45
Step 4 – Reporting Insights

46
What to Present?

• Look at results and

decide what to present
• Means determining
what part of analysis is
more important to our
company?
• Our findings determine
what the next step
should be
47
What to Present?

• All findings must be presented so that informed decisions can

be made
• If your conclusions later found to be wrong your credibility
could be seriously damaged
• Better to tell a complete and true story, even if it isn’t very
clean, then to try finesse things and make them sound more
clear than they really are

48
How to Present?

• Visualization is an
important tool in
presenting results
• Scatter plots, line
graphs etc are
effective ways to
represent your
results visually
• We have tables
with details for
deeper analysis 49
Visualization Tools

50
Step 5 – Turning Insights into Action

• By the end of this discussion, you will be able to:

• Explain what turning insights into action means

• Connect your results with your business question

51
Step 5 – Turning Insights into Action

52
Step 5 – Turning Insights into Action

• We bring together large

datasets to find
actionable insights to
help answer scientific or
commercial question

53
Questions
• Business questions
• Is there something wrong in our process?
• Is there data that should be added to our application to make is more
accurate?
• Science questions
• Where the benefits from a drug trial statistically significant?
• What is the rate of deforestation? Can you predict how much forest will
remain in 15 years?

54
Implementation

• Now we’ve to figure out

how to implement the
actions
• How should it be
automated, if it can be?
• Stakeholders need to be
identified and get
involved in this change
55
Implementation

• We need to monitor and

measure the impact of
the action on the
process
• Be sure to think about
what data you should
collect during and after
the change to properly
evaluate its impact
56
Determine Next Steps

• Big data and data

science are only useful if
the insights can be
turned into actions and
actions should be
carefully defined and
evaluated

Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
2.1 Data Analytics
No ratings yet
2.1 Data Analytics
16 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Unit 1
No ratings yet
Unit 1
11 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Intro
No ratings yet
Intro
144 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Data Science: Insights & Challenges
No ratings yet
Data Science: Insights & Challenges
33 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
Bsd1313 Chapter 3
No ratings yet
Bsd1313 Chapter 3
74 pages
Unit 1
No ratings yet
Unit 1
36 pages
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
No ratings yet
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
15 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
Ds Notes-Unit 1, II and III Upto Part1
No ratings yet
Ds Notes-Unit 1, II and III Upto Part1
341 pages
Unit 1
No ratings yet
Unit 1
9 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
CS3352 QB
No ratings yet
CS3352 QB
35 pages
Google Certificate Notes
No ratings yet
Google Certificate Notes
36 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Big Data
No ratings yet
Big Data
4 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
Data Science
No ratings yet
Data Science
32 pages
Introduction To Ds - 2024
No ratings yet
Introduction To Ds - 2024
25 pages
Unit-1 PPT Dma
No ratings yet
Unit-1 PPT Dma
83 pages
Data Science and Big Data Basics
No ratings yet
Data Science and Big Data Basics
32 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data - Mining - Warehousing Unit II
No ratings yet
Data - Mining - Warehousing Unit II
39 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
30 Must Know Data Analyst SQL Interview Questions
100% (1)
30 Must Know Data Analyst SQL Interview Questions
15 pages
AI Knowledge Systems Guide
No ratings yet
AI Knowledge Systems Guide
17 pages
Unit 3 Database Management System
No ratings yet
Unit 3 Database Management System
59 pages
Pentaho Lab
No ratings yet
Pentaho Lab
22 pages
Include Graphics Charts On Forms 6i
50% (4)
Include Graphics Charts On Forms 6i
2 pages
Openfire
No ratings yet
Openfire
7 pages
HCIS - Class 10 IT Portfolio-1
No ratings yet
HCIS - Class 10 IT Portfolio-1
74 pages
Aws-Database-Migration-Service - User Guide
No ratings yet
Aws-Database-Migration-Service - User Guide
341 pages
Maximo JSON API - CRUD
No ratings yet
Maximo JSON API - CRUD
14 pages
2.3 Lab - Explore YANG Models Using The Pyang Tool
No ratings yet
2.3 Lab - Explore YANG Models Using The Pyang Tool
2 pages
How To Configure ACC
No ratings yet
How To Configure ACC
6 pages
NoSQL Databases: Types and Features
No ratings yet
NoSQL Databases: Types and Features
59 pages
Spring 2.0
No ratings yet
Spring 2.0
127 pages
Archer Maintenance Jobs
No ratings yet
Archer Maintenance Jobs
5 pages
Tal Charnes Programming Team Leader
No ratings yet
Tal Charnes Programming Team Leader
1 page
Kinetic Techref MultiSite 2024.2 CN
No ratings yet
Kinetic Techref MultiSite 2024.2 CN
310 pages
Java Developer Roadmap
No ratings yet
Java Developer Roadmap
5 pages
Questionnaire For MS Excel 2016 (Part 4)
No ratings yet
Questionnaire For MS Excel 2016 (Part 4)
4 pages
Subject: IT-402 Topic: Electronic Spreadsheet (Advanced)
71% (14)
Subject: IT-402 Topic: Electronic Spreadsheet (Advanced)
36 pages
Lab Manual of FSJP SE
No ratings yet
Lab Manual of FSJP SE
34 pages
Running Map Reduce Program in Eclipse: C:/hadoop
No ratings yet
Running Map Reduce Program in Eclipse: C:/hadoop
6 pages
Elementary Data Structures Guide
No ratings yet
Elementary Data Structures Guide
3 pages
Study Lib
No ratings yet
Study Lib
36 pages
BloodBank DBMS Project PDF
No ratings yet
BloodBank DBMS Project PDF
36 pages
Taxi Rental Project Overview
No ratings yet
Taxi Rental Project Overview
9 pages
Sujal Ism File
No ratings yet
Sujal Ism File
44 pages
Synopsis Mern Project1
No ratings yet
Synopsis Mern Project1
24 pages
Final Document - Bank Locker Management System
No ratings yet
Final Document - Bank Locker Management System
52 pages
A Shopping Store Online Database System
No ratings yet
A Shopping Store Online Database System
4 pages
Lock-Based Protocols: 1. Exclusive (X) Mode. Data Item Can Be Both Read As Well As
No ratings yet
Lock-Based Protocols: 1. Exclusive (X) Mode. Data Item Can Be Both Read As Well As
10 pages

Lecture 3 (DS) - Steps in Data Science Process

Uploaded by

Lecture 3 (DS) - Steps in Data Science Process

Uploaded by

Data Science

• By the end of this discussion, you will be able to:

Note: All Images are taken from edx.org

• A lot of data exists in

• Data can also exist in

• REST stands for Representational State Transfer and it is an

• WebSocket services allow realtime notifications from the

• NoSQL storage systems

• By the end of this discussion, you will be able to:

• Explain the importance of exploring data

• After getting data you might be tempted to immediately build

• By the end of this discussion, you will be able to:

• Identify some problems with real world data

Clean: To address data quality issues

• Inconsistent values: Customer with two different addresses

Outliers: Things situated away or detached from main system 21

• To address these issues Domain Knowledge is required

• It reduces noise and

• Raw data often has to

• By the end of this discussion, you will be able to:

• Describe what is involved in applying an analysis technique to your data

• There are different types of problems so there are different types

• Goal is to predict the

• When our model has to

• The goal is to organize

• The goal is to find rules

• For classification and

• In this case some

• By the end of this discussion, you will be able to:

• Determine what to present in reporting your findings

• Look at results and

• All findings must be presented so that informed decisions can

• By the end of this discussion, you will be able to:

• Explain what turning insights into action means

• We bring together large

• Now we’ve to figure out

• We need to monitor and

• Big data and data

You might also like