1.
Introduction to Data &
Descriptive Statistics
Business Statistics
June 30, 2025
Parasuram Balasubramanian
Assistant Professor of Strategy
About the course
• 4 weeks, 12 sessions
• Sessions: lecture + exercise
• Assigned seating and name tags
• Attendance – Programs office policy
• Punctuality
• Office hours
– Thursday: 11 am – 12 pm, 2 – 3 pm
– After lectures, or email for appointment
2
About the course
• Reference books if needed (see course outline)
• Assignments: 3 (due on 10/07, 18/07, 25/07)
• Mid-term exam: written, on paper (July 12, Saturday)
• Final exam: written + MS Excel (July 28, Monday)
• Use of AI
• Honor code and academic integrity
3
Session outline
1. Introduction to Data & Descriptive 7. Simple Linear Regression
Statistics
8. Simple Linear Regression
2. Introduction to Probability
9. Multiple Linear Regression
3. Probability Distributions
10. Multiple Linear Regression
4. Sampling and Estimation
11. Non-Linear Response
5. Hypothesis Testing
12. Miscellaneous Topics and Recap
6. Comparing Groups: independent and
paired t-tests
4
Outline
• What is Business Statistics? Its importance in business decision-making.
• Types of data: categorical vs. numerical, scales of measurement (nominal, ordinal,
interval, ratio)
• Data collection methods: surveys, experiments, observations, and sampling
techniques
• Descriptive statistics: measures of central tendency (mean, median, mode) and
dispersion (range, variance, standard deviation)
• Data visualizations
• In-class exercise
5
What is Statistics?
• Statistics - collection, analysis, interpretation, and presentation of data
• Descriptive statistics - Organize and summarize data
– Includes measures such as mean, median, standard deviation, and
visualization tools
• Inferential statistics - a formal method to draw conclusions from the data
– Use of probability to determine confidence in the conclusions
– Includes hypothesis testing, confidence intervals, regression analysis
6
What is Business Statistics?
• Business statistics is the application of statistical techniques to analyze and
interpret data for effective decision-making
• Involves collecting, summarizing, and interpreting quantitative information
to measure performance, identify trends, as well as predict future outcomes
• Use cases
– to analyze sales data
– to understand customer behavior
– to improve operational efficiency
– enable managers to make data-driven decisions
7
Types of data
• Numerical data – count or measure attributes of a population
– Number of people in a town, amount of money, number of students in
university, stock price
– Discrete: no. of students, children; number of stocks in portfolio
– Continuous: height, weight, stock prices
• Categorical data
– Type of car: sedan, hatchback, SUV
– Movie genre: action, comedy, drama, kids
– Education level: dropout, high school, college, master’s, PhD
8
Methods of data collection
• Questionnaires, surveys
– Customer feedback, common in primary market research
• Experiments: hypothesis testing in controlled settings
– A/B testing in marketing campaigns, testing of new drugs
• Observational studies: to collect data without inference
– Tracking number of customers in a store
• Archival or secondary data sources
– Quantitative data such as stock prices, financial data
9
Descriptive vs Inferential statistics
• Descriptive statistics summarizes and provides a description of the sample
(dataset)
– Includes measures such as mean, median, standard deviation, minimum,
maximum
– Visualizations such as histograms, box plots
• Inferential statistics uses sample data to make an inference or prediction
about a population
– Hypothesis testing, confidence intervals, regression analysis
10
Scales of measurement
Nominal
Categories without any particular
order, e.g., color, marriage status
Categorical
(qualitative) Ordinal
Categories that can be ordered, e.g.,
rankings, education level
Variable
Discrete
A variable that takes on distinct,
countable values, e.g., number of
steps, number of births
Numerical
(quantitative)
Continuous
A variable that can take on any
value within a range, and can have
infinite values within that range,
e.g., distance walked, weight of
newborn babies
11
Descriptive statistics
12
Measures of central tendency
• Central tendency – extent to which all data values group
around a typical or central value
– Mean, median, mode
• Mean is used quite often, unless outliers or extreme values
exist
• Since median is not sensitive to outlier values, this measure is
also commonly used
– E.g. median home prices are often reported
• Mean and median together
13
Measures of dispersion
• Variation – amount of dispersion or scattering of values
• Standard deviation: numerical measure of overall amount of
variation in a dataset
– Can be used to determine whether data values are close to or far off
from the mean
• Small standard deviation, values are bunched around the mean
• If ‘x’ is a data value, then ‘x – mean’ is called its deviation
• Variance: average of squares of deviations
14
Skewness
Symmetric Left or negatively Right or positively
distribution skewed distribution skewed distribution
• Skewness: quantifies the degree to which a distribution’s tail extends toward
one side
• The long, thin part of the curve is the skewed portion
• What does distribution does income of the population follow?
15
Population parameters and sample statistics
Population
Measure Sample Statistic
Parameter
Mean 𝜇 𝑋ത
Variance 𝜎2 𝑠2
Standard Deviation 𝜎 𝑠
16
Quartiles
• Quartiles split data into four segments with an equally distributed values in
each segment
25% 25% 25% 25%
Q1 Q2 Q3
• First quartile Q1, value for which 25% of observations are smaller, 75%
larger
• Q2, same as median, 50% of observations on either side
• Interquartile range (IQR) = Q3 – Q1
17
Outliers
• Outliers are observations that lie far outside the typical range of a dataset
• They can arise from data entry errors, measurement issues, or genuine
extreme events
• Why should we care about outliers?
– They skew summary statistics (mean, variance)
– Distort model estimates and weaken predictive accuracy
– However, outliers may signal data quality problems or important rare
events
18
Detecting Outliers
• Boxplot (IQR Rule): values < Q1 – 1.5·IQR; or values > Q3 + 1.5·IQR
• 3 standard deviations away from the mean can also flag extreme values
• Visualization: scatterplots, histograms
• Example: Monthly sales data at the store level
• Confirm true value before deciding to exclude or model separately
• Use robust measures (median, trimmed mean) if genuine extreme values
are business-relevant
• Should outliers be dropped?
19
Data visualization
• Histograms
• Scatter plot
• Box plot
• Bar graph, pie chart, dot plot, etc.
20
In-class exercise
• Dataset: “s1_hotels_Vienna.xlsx”
• Calculate descriptive statistics for hotel prices per night in Vienna
• Generate data visualizations
• Interpret your results
21