[go: up one dir, main page]

0% found this document useful (0 votes)
37 views63 pages

DSILYTC Session 3 - Data Visualization and Presentation

This document covers data visualization and presentation techniques using spreadsheet and statistical software, focusing on summarizing categorical and numerical data. It includes methods for creating frequency distributions, bar charts, pie charts, and histograms, along with practical examples using the Titanic dataset. Additionally, it discusses the importance of skewness in data interpretation and the use of pivot tables for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views63 pages

DSILYTC Session 3 - Data Visualization and Presentation

This document covers data visualization and presentation techniques using spreadsheet and statistical software, focusing on summarizing categorical and numerical data. It includes methods for creating frequency distributions, bar charts, pie charts, and histograms, along with practical examples using the Titanic dataset. Additionally, it discusses the importance of skewness in data interpretation and the use of pivot tables for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

DSILYTC:

Introduction to
Analytics
SESSION 4: DATA VISUALIZATION AND PRESENTATION
Learning Outline and Outcomes

 Data Visualization and  CLO1: Extract data from


presentation using primary and secondary
spreadsheet and
statistical software sources and convert them
into processed information
 Summarizing Data Using
Graphs and Charts using useful for decision-making.
various Software  CLO5: Construct and present
 Frequency Distribution information using data-
using Count if Function appropriate for visualization
and Pivot Table tools and techniques.
Types of Data
Summarizing Categorical Data
(Qualitative Data Type)
 Frequency Distribution
 Relative Frequency Distribution
 Percent Frequency Distribution
 Bar Chart
 Pie Chart
Summarizing Categorical Data

 Frequency Distribution: tabular summary of data


showing the number (frequency) of
observations in each of several non-overlapping
categories or classes.
 Theobjective is to provide insights about the
data that cannot be quickly obtained by
looking only at the original data.
Frequency Distribution Example
 Softdrink purchasers were asked to choose one
of the five popular soft drinks: Coca-Cola, Diet
Coke, Dr. Pepper, Pepsi, or Sprite.
Table 1: Soft Drinks Selected by a Sample of 20 Purchasers  Create your
Coca-Cola Pepsi Dr. Pepper own excel file
Diet Coke Dr. Pepper Dr. Pepper containing this
Dr. Pepper Coca-Cola Pepsi data.
Pepsi Diet Coke Diet Coke
 Follow the steps
Pepsi Pepsi Dr. Pepper
as we discuss.
Pepsi Pepsi Sprite
Pepsi Pepsi
Frequency Distribution Formula
 UsingEXCEL’S COUNTIF Function to construct a
Frequency Distribution
 =COUNTIF(range, criteria)
 range is the range of cells where you have your
data (e.g., A2:A11)
 criteriais the category you want to count (e.g.,
D2 for Coca-Cola).
Frequency Distribution Example
 Using
EXCEL’S COUNTIF Function to construct a
Frequency Distribution
 =COUNTIF(range, criteria)
Table 1: Soft Drinks Selected by a Sample of 20 Purchasers Table 2: Frequency
Coca-Cola Pepsi Dr. Pepper Soft drinks Frequency
Diet Coke Dr. Pepper Dr. Pepper Coca-Cola 2
Dr. Pepper Coca-Cola Pepsi Diet Coke 3
Pepsi Diet Coke Diet Coke Dr. Pepper 5
Pepsi Pepsi Dr. Pepper Pepsi 9
Pepsi Pepsi Sprite Sprite 1
Pepsi Pepsi Total 20
Summarizing Categorical Data

 RelativeFrequency Distribution: the proportion


of times each value appears compared to the
total number of data points.
 You
calculate this by dividing the frequency of
each value by the total number of values.
Relative and Percentage
Frequency Example
Table 3: Formula for Relative / Percentage Frequency Table 3: Relative / Percentage Frequency
Relative Percentage Relative Percentage
Soft drinks Frequency Soft drinks
Frequency Frequency Frequency Frequency
Coca-Cola =COUNTIF(A$2:A$21,C2) =D2/$D$7 =E2*100 Coca-Cola 0.1 10
Diet Coke =COUNTIF(A$2:A$21,C3) =D3/$D$7 =E3*100 Diet Coke 0.15 15
Dr. Pepper =COUNTIF(A$2:A$21,C4) =D4/$D$7 =E4*100 Dr. Pepper 0.25 25
Pepsi =COUNTIF(A$2:A$21,C5) =D5/$D$7 =E5*100 Pepsi 0.45 45
Sprite =COUNTIF(A$2:A$21,C6) =D6/$D$7 =E6*100 Sprite 0.05 5
Total =SUM(D2:D6) =SUM(E2:E6) =SUM(F2:F6) Total 1.00 100

Note: Column A-B and Rows 9-21 are not shown.


Summarizing Categorical Data
 Bar Chart: displays qualitative data using bars to
represent categories, with one axis for labels and another
for values, and bars spaced to highlight distinct
categories.
 The horizontal axis shows the different categories or
labels. The vertical axis shows how often each category
occurs, or how big its relative size or percentage is.
 Each category has a bar. The height of the bar shows the
amount or frequency for that category.
 The bars are separated to highlight that each category is
distinct.
Using Excel to Create Bar Charts
 Select Data: Highlight the data you want to include in
the chart.
 Insert Chart: Go to the Insert tab on the Ribbon.
 Choose Bar Chart: Click on the Bar Chart icon and select
the desired bar chart type.
 Adjust Chart: Customize the chart using the Chart Tools
options (e.g., add titles, adjust axis labels).
 Format Chart: Use the Format tab to style the chart as
needed.
Bar Chart Example
Pareto Diagram
 A specific type of bar chart where bars are arranged in
descending order of height from left to right, with the
most frequently occurring cause appearing first. This
helps highlight the most significant issues or causes
according to the Pareto Principle.
 In quality control, bar charts are used to identify the most
important causes of problems.
Summarizing Categorical Data
 Pie Chart: displays relative frequencies and percentages
for categorical data.
 First draw a circle; then use the relative frequencies to
subdivide the circle into sectors that correspond to the
relative frequency for each class.
 Since there are 360 degrees in a circle, a class with a
relative frequency of .25 would consume .25(360) = 90
degrees of the circle.
Pie Chart Example

 Almost one-half of the


customers surveyed
preferred Pepsi (looking at
the left side of the pie).
 The second preference is for
Dr. Pepper with 25% of the
customers opting for it.
 Only 5% of the customers
opted for Sprite.
Using Excel to Create Pie Charts

 To display a pie chart, click anywhere in the bar chart to


display three tabs (Analyze, Design, and Format) located
on the ribbon under the heading PivotChart Tools.
 Click the Design tab and choose the Change Chart Type
option to display the Change Chart Type dialog box.
 Click the Pie option and then OK.
Summarizing Numerical Data
(Quantitative Data Type)
 Frequency Distribution
 Relative Frequency Distribution
 Percent Frequency Distribution
 Dot Plot
 Histogram
 Cumulative Distribution
 Stem and Leaf Display
Categorical vs. Numerical Data

 The process for computing frequency, relative frequency,


and percent frequency is similar for both categorical and
numerical data, but there are some key differences in
how the data is organized.
Categorical vs. Numerical Data

 Frequency
Categorical
Data: You count how many times
each category (e.g., "male" or "female")
appears.
Numerical Data: You group the data into
intervals or classes (e.g., "1-10", "11-20") and
count how many data points fall within each
range. =COUNTIFS(A2:A100,">=0", A2:A100,"<=10")
Categorical vs. Numerical Data
 Relative Frequency
CategoricalData: The relative frequency is the
frequency of each category divided by the
total number of observations.
Numerical Data: The same principle applies, but
you calculate the relative frequency for each
interval or class instead of individual categories.
=COUNTIFS(A2:A100,">=0", A2:A100,"<=10") / B1
Categorical vs. Numerical Data

 Percent Frequency
CategoricalData: Multiply the relative
frequency of each category by 100 to get the
percentage.
Numerical Data: Multiply the relative frequency
for each interval or class by 100 to get the
percentage.
Categorical vs. Numerical Data

 Key Difference
Categorical Data: Focuses on individual, distinct
categories.
Numerical Data: Requires grouping numbers
into intervals or classes before calculating
frequencies.
Summarizing Numerical Data
(Quantitative Data Type)
 Frequency Distribution
 Relative Frequency Distribution
 Percent Frequency Distribution
 Dot Plot
 Histogram
 Cumulative Distribution
 Stem and Leaf Display
LabActivity: Titanic Data Set
 Using the "Sex" column from the Titanic dataset, summarize the categorical
data through the following:
 Create a Frequency Distribution for male and female passengers.
 Calculate the Relative Frequency Distribution for each category.
 Present the Percent Frequency Distribution for both male and female
passengers.
 Construct a Bar Chart to visually display the frequency of male and
female passengers.
 Create a Pie Chart to show the proportion of male and female
passengers.
 Submit a table with the Frequency, Relative Frequency, and Percent
Frequency, along with the Bar Chart and Pie Chart. A short analysis of the
findings will be a plus.
LabActivity: Titanic Data Set
 Using the “Age" column from the Titanic dataset,
summarize the numerical data through the following:
 Createa frequency distribution by grouping
passengers' ages into intervals (e.g., 0-10, 11-20, etc.).
 Calculatethe relative and percent frequencies for
each age group.
 A table showing the Frequency, Relative Frequency, and
Percent Frequency for age groups.
Determining Number of Classes
 Guidelines
Use between 5 and 20 classes, unless there is a
standard classes available.
Data sets with larger number of elements usually
require a larger number of classes.
Smaller data sets usually fewer classes.
The goal is to use enough classes to show the
variation in the data, but not so many classes
that some contain only a few data items.
Determining Number of Classes

 Guidelines
Inpractice, the number of classes and the
appropriate class width are determined by trial
and error.
Ultimately,the analyst uses judgement to
determine the combination of the number of
classes and class width that provides the best
frequency distribution for summarizing the data.
Pivot Tables in Excel

 Allows you to summarize, analyze, and


explore large datasets by automatically
organizing and aggregating data.
 Widely used for business analytics and
reporting because of their flexibility and
ease of use
Pivot Tables in Excel
 Steps
1. Highlight the entire dataset, including headers,
or click anywhere inside your dataset.
2. Go to the Insert tab on the Ribbon.
3. Click on PivotTable in the "Tables" group.
4. In the dialog box that appears, Excel will auto-
select your dataset. You can adjust the range
if necessary.
Pivot Tables in Excel
 Steps
5. Choose whether
you want the Pivot
Table in a New
Worksheet or an
Existing Worksheet,
then click OK.
Pivot Tables in Excel
 Steps
6. After the Pivot Table
is created, you'll see
a blank grid and a
PivotTable Fields
pane on the right.
7. The Fields pane
contains the column
headers from your
dataset.
Pivot Tables in Excel
 Titanic Data Set
You want to analyze
how Survival Rates
differed based on
Passenger Class and
Gender.
Pivot Tables in Excel
 Titanic Data Set
1. Drag the “Pclass” field to the Rows area. This
will group the data by passenger class (1st,
2nd, 3rd class).
2. Drag the “Sex” field to the Columns area. This
will create a comparison of survival rates
between Male and Female passengers.
Pivot Tables in Excel
 Titanic Data Set
3. Drag the “Survived” field to the Values area. By
default, Excel will count the number of
"Survived" entries.
4. To get survival rates, you can:
• Right-click the Survived field in the Values area.
• Select Value Field Settings.
• Change the Summarize Values By option to Sum
(this counts the number of survivors).
Dot Plot
A type of data visualization used to show
the distribution or frequency of data points.
 Each dot in a dot plot represents a single
data point or occurrence.
 A simple way to compare frequencies or
observe patterns in small to moderately
sized datasets.
Dot Plot using Titanic Data Set
 Each dot represents
one survivor, and
the dots are
stacked vertically
for each age group
to show their
frequency. This
simple plot provides
a clear visual
comparison of how
many survivors
came from each
age range.
Histograms
A type of data visualization that displays the
distribution of a dataset by grouping data
into bins or intervals and showing the
frequency of data points within each bin.
 Unlike bar charts, which display categorical
data, histograms are used for continuous or
quantitative data.
Histogram using Jamovi
 Open Jamovi and import the
Titanic dataset (you can use
the same .csv file).
 Click on the Descriptives
module from the toolbar.
 Choose the "Age" variable
from the dataset to include it
in the analysis.
 Under the Plots option in the
Descriptives module, check
the Histogram option.
Histogram using Excel

https://www.youtube.com/watch?v=yh5ihdHwmTk
Histogram using Excel
 Insert a Histogram: Go to the
Insert tab.
 In the Charts group, click on
Insert Statistic Chart.
 Choose Histogram from the
dropdown options.
 After the histogram appears,
right-click on the x-axis.
 Select Format Axis.
 You can adjust the bin width
or number of bins to better
display your data.
Skewness in Data Visualization
 Itis a measure of how unevenly data is
distributed in a dataset. It tells us about the
direction and degree of asymmetry in the
data.
 Symmetrical (Zero Skewness):When data is evenly
distributed around a central point (like a bell curve)
 Positive Skew (Right Skew):The tail of the distribution is
longer on the right side.
 Negative Skew (Left Skew):The tail of the distribution is
longer on the left side.
Skewness in Data Visualization
Skewness in Data Visualization
 Helps identify how data points are spread
out. A symmetrical distribution suggests that
the data is balanced, while skewed
distributions indicate potential outliers or
trends.
 Provides context for data interpretation,
helps clarify the significance of averages,
and aids in making informed decisions
based on the distribution of data.
Skewness in Data Visualization
 Impact on Averages: In skewed distributions,
the mean (average) can be misleading. For
example, in a positively skewed dataset (like
income), the average may be higher than
most values, which can distort the true
picture of the data. In such cases, the
median (the middle value) might provide a
better representation.
Skewness in Data Visualization
 Data Interpretation: Reports can use
skewness to explain trends. For instance, if a
report shows a negative skew in test scores,
it can highlight that most students
performed well, with a few low scores
affecting the average.
Skewness in Data Visualization
 Decision Making: Understanding skewness
can influence decisions. For instance, if a
company sees a positive skew in sales data,
it might investigate high-performing
products to replicate their success.
Skewness in Data Visualization
 Decision Making: Understanding skewness
can influence decisions. For instance, if a
company sees a positive skew in sales data,
it might investigate high-performing
products to replicate their success.
Skewness in Data Visualization
 Statistical
Analysis: Skewness can affect the
choice of statistical tests. Many tests assume
normality (symmetrical distribution), so
knowing the skewness helps analysts choose
the appropriate methods.
Stem and Leaf Display
 Isa method for displaying quantitative data in a
way that retains the original data values while
also showing their distribution. It combines the
benefits of a histogram and a table, making it
easy to visualize data and see individual data
points.
 An effective tool for organizing and visualizing
numerical data, making it easier to identify
patterns, distributions, and individual data points.
Stem and Leaf Display
 Stem: The stem consists of the leading digits of
each data point. For example, in the number 45,
the stem would be 4.
 Leaf:
The leaf consists of the last digit of each
data point. For the same number 45, the leaf
would be 5.
Stem and Leaf Display
 How to Create a Stem-and-Leaf Display
 Organize the Data: Sort the data in ascending
order.
 Determine the Stems: Identify the unique leading
digits and list them in a vertical column.
 Add Leaves: Next to each stem, write the
corresponding leaves in increasing order.
Stem and Leaf Display
Stem and Leaf Display

 Stem and leaf


plots are the
basis for bar
graphs and
curves.
Crosstabulation
 Often referred to as a "crosstab," is a statistical tool
used to analyze the relationship between two or
more categorical variables.
 Itdisplays the distribution of one variable across
the categories of another variable in a matrix
format, allowing for easier comparison and
analysis.
Crosstabulation
 Key Features
 Presented in a table where one variable's categories
are listed in rows and another's in columns. The cells of
the table show the frequency (count) of occurrences
for each combination of categories.
 Helpsidentify patterns or relationships between
variables.
 In
addition to counts, crosstabs often include row,
column, or total percentages, providing insights into
the distribution of responses.
Crosstabulation
 Can be used when…
 one
variable is categorical, and the other is
quantitative,
 both variables are categorical, or
 both variables are quantitative.
 Theleft and top margin labels define the classes for
the two variables.
Crosstabulation

 Example:
MEAL PRICE
Quality
Rating
$10-19 $20-29 $30-39 $40-49 Total
Crosstabulation of
Good 42 40 2 0 84
quality rating and
Very Good 34 64 46 6 150
Excellent 2 14 28 22 66
meal price of 300
TOTAL 78 118 76 28 300 restaurants

 Try to generate insights based on the crosstab.


Table vs. Graphs
 Inmost case, a graphical display is more useful
than a table for recognizing patterns and trends.
 Displaying data in creative ways can lead to
powerful insights.
 Scatter diagram and trendlines are useful in
exploring the relationship between two (2)
variables.
Scatter Diagram
 Also
known as a scatter plot, is a graphical
representation that shows the relationship
between two quantitative variables.
 Eachpoint on the diagram represents an
observation, with the x-axis representing one
variable and the y-axis representing the other.
Scatter Diagram
 If you were to plot the
relationship between hours
studied (x-axis) and test
scores (y-axis) for a group of
students, each student’s
data would be represented
as a point on the scatter
plot.

 Positive Correlation: If most points trend upwards, it suggests that more hours
studied are associated with higher test scores.
 No Correlation: If points are scattered without a discernible pattern, it suggests no
correlation between the two variables.
DSILYTC:
Introduction to
Analytics
SESSION 3: DATA VISUALIZATION AND PRESENTATION

You might also like