Intro to Data Science
WELCOME TO GA
GENERAL ASSEMBLY
Travis Huang (He/Him)
Data Science Part-Time Lead Instructor
● Technical Program Manager
● Stats Nerd
● Casual Gamer
travis.huang@generalassemb.ly
https://www.linkedin.com/in/huangtravis/
2
2 | © 2018 General Assembly
What is General Assembly?
General Assembly is a pioneer in education and
career transformation, specializing in today's most
in-demand skills. We foster a flourishing community
of professionals pursuing careers they love.
What We Teach
Coding UX & Design Data
Marketing Business Career Development
Our Agreement
● Turn off or silence your devices.
● Be present — engage in active learning,
collaborate, and ask questions.
● Be curious.
You’ll receive digital copies of these slides
after class has ended and you’ve filled out the
survey.
6 | © 2021 General Assembly
Agenda
Defining Data Science
The Data Science Workflow
Crafting Good Questions
Supervised Learning
Decision Trees
7 | © 2021 General Assembly
Our Goals For Today
Define data science.
Identify the Data Science Workflow and explain the value it adds to solving
a business challenge.
Construct a good data science question.
Observe how decision trees are used in data science.
8 | © 2021 General Assembly
Big-Picture Goal
This workshop represents the first step toward
improving your data science literacy.
9 | © 2021 General Assembly
Intro to Data Science
Defining Data Science
WELCOME TO GA
GENERAL ASSEMBLY
What Is Data Science, Anyway?
11 | © 2021 General Assembly
Discussion:
Data Science Careers
How would you define data science?
12 | © 2021 General Assembly
Real Cases:
Data Science on the Job
How They’re Using Data Science:
● Prioritizes listings in popular areas, making desirable
Airbnbs easier for users to find.
● Basketball hoop rim sensors track real-time data to
better predict court placement for successful shots.
● Optimizes package drop-off and delivery transport
using machine learning and AI to predict delivery
obstacles (e.g., weather, traffic).
13 | © 2021 General Assembly
“
The ability to take data — to be able to understand it,
to process it, to extract value from it, to visualize it, to
communicate it — that’s going to be a hugely
important skill in the next decades.
Hal Varian, chief economist at Google | UC Berkeley professor
14 | © 2021 General Assembly
Data Science is the Extraction of Knowledge From Data.
15 | © 2021 General Assembly
Real Cases:
Data Science on the Job
Consider these three products and services:
● How do they utilize data science?
● What kinds of data do you think they use?
● How might they leverage data science in other parts of their business?
16 | © 2021 General Assembly
Discussion:
Data Science Careers
What skills and competencies do you think are most
important for data scientists?
17 | © 2021 General Assembly
Makeup of a Data Scientist
Tech Soft Skills
SAS, R, Python, Perl, Influencing, critical thinking,
Excel, SQL, Hadoop, systems thinking, visual
JavaScript, IoT. thinking, design.
Domain
Math Methods
Statistics techniques, Knowledge
quantitative and qualitative Industry knowledge,
methods. workflows, data operations,
analytics.
18 | © 2021 General Assembly
Intro to Data Science
The Data Science Workflow
WELCOME TO GA
GENERAL ASSEMBLY
Overcoming Challenges With Data Science
Going from answering... To...
“Let’s optimize our sales “Here are actionable
funnel to improve our recommendations drawn from
conversion rates. ” data-driven insights.”
20 | © 2021 General Assembly
Why Does It Matter?
Think of the steps in the Data
Science Workflow as
problem-solving guidelines.
21 | © 2021 General Assembly
Steps in the Workflow
Iterative —
repeat as
needed!
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
22 | © 2021 General Assembly
Frame: “What Is the Challenge?”
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
23 | © 2021 General Assembly
Intro to Data Science
A Closer Look at
the Data Science Workflow
WELCOME TO GA
GENERAL ASSEMBLY
“
Asking the right questions is what separates data
scientists that know ‘why’ from folks that only know
‘what’ (tools and technologies).
Kayode Ayankoya, MBA, PhD | clinical data scientist
25 | © 2021 General Assembly
Discussion:
Frame Problems With Good Questions
What makes a question “good”?
26 | © 2021 General Assembly
Asking Good Questions...
Establishes the basis for reproducibility.
Enables collaboration through clear goals.
Produces actionable recommendations and strategies for stakeholders.
27 | © 2021 General Assembly
Some Good Questions
“Which ad distribution channels would yield
the greatest volume at the lowest cost of
acquisition?”
Place
“Which markets are most attractive in terms Photo
of profit potential?” On Top
Of Box
“The past three quarters have seen a
year-over-year decline of 5% — what are the
top five changes in competitive dynamics?”
28 | © 2021 General Assembly
Some Not-So-Good Questions
“What is the best way to attract
more users?”
Place
Photo
“Which markets should we enter?”
On Top
Of Box
“What is causing the decline in sales?”
29 | © 2021 General Assembly
Discussion:
Spot the Differences
Good Questions Not-So-Good Questions
“Which ad distribution channels would yield the
“What is the best way to attract more users?”
greatest volume at the lowest cost of acquisition?”
“Which markets are most attractive in terms of
“Which markets should we enter?”
profit potential?”
“The past three quarters have seen a
year-over-year decline of 5% — what are the top “What is causing the decline in sales?”
five changes in competitive dynamics?”
30 | © 2021 General Assembly
Group Exercise:
Restructure the Question
Consider the wording of this question:
“What is going to happen with my stock?”
How could you rephrase the question to make it stronger?
31 | © 2021 General Assembly
Real Cases:
Data Science In Action: Survival Prediction
On April 15, 1912, the RMS Titanic sank after
colliding with an iceberg.
The crash resulted in 1,502 fatalities out of
2,224 passengers and crew members.
Some groups were more likely to survive
than others, such as women, children, and
members of the upper class.
32 | © 2021 General Assembly
Real Cases:
Data Science In Action: Survival Prediction (Cont.)
If we wanted to explore which groups of
people were likely to survive, we could apply
machine learning tools to predict which
passengers survived the tragedy, examining
the attributes of passengers
that would lead to higher survival rates.
33 | © 2021 General Assembly
Discussion:
Data Science In Action: Framing Survival Prediction
What sorts of questions would you ask to identify
attributes of passengers with higher survival rates?
34 | © 2021 General Assembly
Prepare: “What’s Needed?”
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
35 | © 2021 General Assembly
Why Bother Cleaning and Preparing Data?
Suggested Timing
36 | © 2021 General Assembly
Cleaning Data...
Suggested
Ensures that
Timing
data is defined and structured.
Helps to check and polish data formatting.
Preprocesses data into a format that’s interpretable
by machine learning frameworks.
37 | © 2021 General Assembly
Cleaning and Preparing Data...
Ensures that data is defined and structured.
Helps to check and polish data formatting.
Preprocesses data into a format that’s interpretable
by machine learning frameworks.
Examples of machine learning frameworks:
● Natural language processing (string data such as tweets or product reviews).
● Categorical data into binary dummies (1/0).
● Images into multi-dimensional NumPy arrays.
● Timestamps into datetime format.
Suggested
How Do You Prepare Data?
Timing
Often, we’re given secondary data, or
data that was collected previously.
In these cases, we have to learn as
much as possible about our data using
tools like data dictionaries to
determine how the set was gathered.
39 | © 2021 General Assembly
Real Cases:
Warby Parker
● Used an open-source project to generate its data dictionary.
● Needed all business units to agree on terms of the dictionary.
● Secured approval from the co-CEOs to implement a data dictionary
sign-off date.
● Top-down support proved to be valuable to its data teams.
40 | © 2021 General Assembly
Suggested
Using Data Dictionaries
Timing
Here’s an example:
Data Dictionary: A list of key Variable
Variable Description
Type
terms and metrics with definitions.
survival Fate of passenger Binary
Ensures that all stakeholders are pclass Ticket class Discrete
on the same page with the
age Age in years of passenger Continuous
meanings of all variables.
fare Price of ticket (1912 dollars) Continuous
41 | © 2021 General Assembly
Suggested
Variable Types
Timing
A data dictionary is a list of key terms and metrics with definitions.
Variable
Variable Description
Type
Binary data is discrete data
survival Fate of passenger Binary that can only be in one of
two categories — either yes
pclass Ticket class Discrete or no, 1 or 0, off or on, etc. It
age Age in years of passenger Continuous can be thought of as
ordinal, nominal, count, or
interval data.
fare Price of ticket (1912 dollars) Continuous
42 | © 2021 General Assembly
Suggested
Variable Types (Cont.)
Timing
A data dictionary is a list of key terms and metrics with definitions.
Variable
Variable Description
Type
survival Fate of passenger Binary
Discrete data can’t be
pclass Ticket class Discrete measured, but it can
be counted.
age Age in years of passenger Continuous
fare Price of ticket (1912 dollars) Continuous
43 | © 2021 General Assembly
Suggested
Variable Types (Cont.)
Timing
A data dictionary is a list of key terms and metrics with definitions.
Variable
Variable Description
Type
survival Fate of passenger Binary
pclass Ticket class Discrete
Continuous data
age Age in years of passenger Continuous
represents measurements.
Its values can’t be counted,
fare Price of ticket (1912 dollars) Continuous but they can be measured.
44 | © 2021 General Assembly
Discussion:
Data Science In Action: Framing
What sort of features would you want to see in this data
set that are necessary for determining survival rates?
45 | © 2021 General Assembly
Analyze: “What Happened?”
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
46 | © 2021 General Assembly
Digging Deeper With Data Analysis
After you’ve collected the right
data to answer your questions, it’s
time to start data analysis.
47 | © 2021 General Assembly
Digging Deeper With Data Analysis (Cont.)
This step — the initial analysis of
trends, correlations, variations, and
outliers in your data — helps you to...
● Focus your data analysis on
answering your initial questions
in better ways.
● Address any objections others
might have.
48 | © 2021 General Assembly
Common Stats
Data scientists often check the mean, standard deviation,
or specific frequency counts of their data.
49 | © 2021 General Assembly
Real Cases:
Data Science In Action: Survival Prediction Statistics
Revisiting our Titanic example from earlier…
The following are statistics we might expect survival variables to include:
Variable Mean or Frequency (%)
survival 38.38%
pclass 1: 24.24%, 2: 20.65%, 3: 55.11%
age 29.70 years
fare $32.20
50 | © 2021 General Assembly
Interpret: “Why and How Did This Happen?”
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
51 | © 2021 General Assembly
Interpreting Your Data
Now that you’ve analyzed
your data, you can begin to
interpret the results.
52 | © 2021 General Assembly
Discussion:
Keys to Interpreting Data
What factors should you keep in mind when
interpreting data?
53 | © 2021 General Assembly
Questions to Form a Hypothesis
1. Does the data answer your original question?
2. Does the data help you defend against any objections?
3. Are there any limitations on your conclusions?
54 | © 2021 General Assembly
Questions to Form a Hypothesis
1. Does the data answer your original question?
2. Does the data help you defend against any objections?
3. Are there any limitations on your conclusions?
If your interpretation of the data holds up under
all of these questions and considerations, then
you have likely come to a productive conclusion.
55 | © 2021 General Assembly
Suggested
Forming Conclusions
Timing
Now that you have a hypothesis, what
are some things you should check?
Can you convert your findings into a
conclusion or next step?
56 | © 2021 General Assembly
Communicate: “How Do We Share This?”
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
57 | © 2021 General Assembly
Suggested
Show (and Explain) The Results
Timing
You’ve framed the problem, and you’ve
prepared, analyzed, and interpreted
the data to develop a solution.
Now, you need to distill that into
something that can be clearly
communicated to an audience.
58 | © 2021 General Assembly
Discussion:
Presenting Data Effectively
What are some key factors to consider when
presenting data science findings and conclusions?
59 | © 2021 General Assembly
Suggested
Capture Their Attention
Timing
Presentations are a critical part of
your analysis.
The most basic form of a data science
presentation should describe your
results in the most simple and
engaging way for your audience.
60 | © 2021 General Assembly
A Good Story: The Key to Effective Data Presentations
Set the scene for your listeners, relating the problem to your audience's interests.
Focus on your hypothesis/solution. Help your audience see what you’re proposing.
Highlight your methodology. How did you come to your conclusion? Be concise —
present steps at a high level.
Feature contributions made and results. Highlight how your results made an impact.
61 | © 2021 General Assembly
Real Cases:
Static Presentation of Data
● The PollEverywhere team wanted
to look for opportunities to
improve employee benefits
packages.
● Author of this data presentation
highlighted the main takeaways
for the audience, explaining each
axis’ meaning.
● This data helped to clearly
illustrate next steps for the
company in crafting benefits
packages. .
62 | © 2021 General Assembly
Real Cases:
Interactive Presentation of Data
Data science presentations can also be far more complex and exciting, like some of the
research presented by Nate Silver's FiveThirtyEight blog.
63 | © 2021 General Assembly
Do Your
Thing
64 | © 2021 General Assembly
Let’s Recap
● The Data Science Workflow is used to iteratively
develop solutions.
● Crafting good questions is key.
● Cleaning and preparing your data is crucial.
● Analyzing data helps answer outstanding questions.
● Interpreting data leads you to form a
hypothesis/solution.
● Clearly communicating findings creates relevancy
for your audience.
65 | © 2021 General Assembly
Intro to Data Science
Decision Trees
WELCOME TO GA
GENERAL ASSEMBLY
Imagine a flow chart where each level is a question
with a yes or no answer, eventually leading to a
solution to the original question.
That’s a decision tree.
67 | © 2021 General Assembly
What Are Decision Trees?
Decision trees are a Machine
Learning Model for regression and
classification that help to classify
complex data science challenges.
68 68
| © |2021 General
© 2021 Assembly
General Assembly
Back to the Workflow...
When are decision trees used?
Frame Prepare Analyze Interpret Communicate
Develop Select, import, and Structure, visualize, Make business Present data-driven
hypothesis-driven clean relevant data. and complete your decisions based on insights to your
questions for your analysis. data. audience.
analysis.
Where you’d
use decision
trees.
69 | © 2021 General Assembly
Real Cases:
Non-Data Science Decision Tree
● This tree models a set of
sequential, hierarchical Alone or with
friends
decisions that ultimately lead
Alone Friends
to some final result.
● Decisions remain “high level” Weather Weather
outside? outside?
to keep the tree small and
achieve a higher level of Sunny Rainy Sunny Rainy
accuracy.
Video Video Soccer Movie
games games
70 | © 2021 General Assembly
Discussion:
Decision Tree Questions
Let’s say we’re using a data set consisting Does the animal
of animals with lots of different breathe air?
characteristics and you wanted to classify Yes No
them as mammals, birds, or fish.
Fish
What might be a good decision tree
question to start predicting their
classification?
71 | © 2021 General Assembly
Discussion:
Decision Tree Questions (Cont.)
What’s a second question that could further Does the animal
determine their class? breathe air?
Yes No
Does the animal
Fish
lay eggs?
Yes No
Bird Mammal
72 | © 2021 General Assembly
Decision Trees
In data science, the creation of
decision tree rules are governed by
an algorithm that learns which
questions to ask by analyzing an
entire data set.
73 | © 2021 General Assembly
The “knowledge” learned by a
decision tree is directly formulated
into a hierarchical structure, which
is determined by what yes/no rules
will predict the outcome variable.
74 | © 2021 General Assembly
Decision Trees (Cont.)
These yes/no rules appear as a
tree with several branching paths, or
splits.
❗ Adding too many splits makes decision trees overly complex and not adaptable
to new data.
75 | © 2021 General Assembly
Decision Trees (Cont.)
ROOT
The starting point of a decision tree
is referred to as the root.
76 | © 2021 General Assembly
Decision Trees (Cont.)
Subsequent points are called nodes.
NODE
77 | © 2021 General Assembly
Decision Trees (Cont.)
Splits resulting from nodes
are called branches.
BRANCH
78 | © 2021 General Assembly
Decision Trees (Cont.)
Nodes that do not split further
are then called leaves.
LEAVES
79 | © 2021 General Assembly
Splitting a Decision Tree
Two metrics decide how to split a tree...
● Gini impurity: A measurement of the likelihood of an incorrect
classification of a new instance of a random variable.
● Entropy: The measure of impurity (or uncertainty) in variables. This affects
how a decision tree draws its boundaries.
80 | © 2021 General Assembly
Group Exercise:
Knowledge Check
Let’s take a look at this example of a decision tree.
● Which is the node, which is a branch, and
which are leaves?
● Why is each subsequent level of branches
wider?
81 | © 2021 General Assembly
Cognitive
Load Break
82 | © 2021 General Assembly
Back to Titanic Survival Prediction
Let’s see how data sets and a decision
tree model could be used to predict
Titanic passenger survival rates.
83 | © 2021 General Assembly
Group Exercise:
Titanic Survival Prediction
In this example, our decision (leaves) will be survival (0 = no; 1 = yes).
Features will be the following conditions (nodes):
● sex: Sex (0 = female; 1 = male)
● pclass: Passenger class (1 = first; 2 = second; 3 = third)
● fare: Passenger fare (in 1912 dollars)
● age: Age (in years)
84 | © 2021 General Assembly
Group Exercise:
Titanic Survival Prediction (Cont.)
Each condition (node) represents a feature.
In this case, this would be either a category such as male or female or a range
of numbers (greater than or equal to age 10).
For variables that have more than one category — cabin class, for example —
you would make another branch off of a condition.*
*Within those that are NOT Class 3 and also NOT Class 2.
85 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction
They’re most likely going to survive.
They probably won’t.
86 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
1. Given that the root node is sex, why would you think that this is the best way to
predict if someone died when the Titanic sank? (male = 1)?
87 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
1. Given that the root node is sex, why would you think that this is the best way to
predict if someone died when the Titanic sank? (male = 1)?
88 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
2.
1. What is the probability of death, given you are a male in second or third class?
89 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
2.
1. What is the probability of death, given you are a male in second or third class?
90 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
3.
1. What is the survival rate of a female in first or second class who paid more than $32?
91 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
3.
1. What is the survival rate of a female in first or second class who paid more than $32?
92 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
4.
1. If you were a 7-year-old boy in third class, would you be more likely to survive than a
7-year old boy in first class? What's the difference in your chances of survival?
93 | © 2021 General Assembly
Group Exercise:
Titanic Passenger Survival Prediction (Cont.)
4.
1. If you were a 7-year-old boy in third class, would you be more likely to survive than a
7-year old boy in first class? What's the difference in your chances of survival?
94 | © 2021 General Assembly
Decision Trees Visually Explained
95 | © 2021 General Assembly
Today We’ve...
● Defined data science.
● Outlined the Data Science Workflow.
● Defined supervised and unsupervised learning.
● Explored decision trees.
● Examined the differences between regression and classification.
96 | © 2021 General Assembly
AMA: Ask Me Anything!
97 | © 2021 General Assembly
What’s Next?
Let us know what you liked about this class
and what we can improve.
Complete a quick survey at:
ga.co/introclass
This survey is mobile- and laptop-friendly.
98 | © 2021 General Assembly
Want to Learn More?
Career-Changing Courses: bit.ly/fulltimeclasses
10–12 week Immersive courses developed to help you make a career pivot.
Skill-Building Courses: bit.ly/parttimeclasses
8–10-week part-time or 1-week accelerated courses developed to help you advance your career.
Short-Form Workshops: bit.ly/galaworkshops
Learn a skill in as little as two hours, or tackle something in more depth for 1–2 days.
99 | © 2021 General Assembly
Thank You!
WELCOME TO GA
GENERAL ASSEMBLY