Statistics and Probability in Data Science
Presentation Summary
Introduction
▶ Statistics and probability are foundational to data science
▶ Essential for AI, machine learning, and deep learning
▶ Mathematics is embedded in every aspect of our lives
Data Types
▶ Qualitative Data
▶ Nominal: No inherent order (e.g., gender, race)
▶ Ordinal: Ordered series (e.g., ratings)
▶ Quantitative Data
▶ Discrete: Limited possible values (e.g., number of students)
▶ Continuous: Unlimited possible values (e.g., weight)
Variable Types
▶ Discrete variables (categorical)
▶ Continuous variables
▶ Independent variables
▶ Dependent variables
Statistics Overview
▶ Definition: Applied mathematics for data collection, analysis,
interpretation, and presentation
▶ Types:
▶ Descriptive Statistics: Summarizes data features
▶ Inferential Statistics: Makes predictions based on samples
▶ Key concepts: Population and Sample
Sampling Techniques
▶ Probability Sampling
▶ Random sampling
▶ Systematic sampling
▶ Stratified sampling
▶ Cluster sampling
▶ Non-probability Sampling
Information Gain and Entropy
▶ Entropy: Measure of uncertainty in data
▶ Information Gain: How much information a feature provides
about the final outcome
▶ Used in decision trees and random forests
▶ Example: Predicting if a match can be played based on
weather conditions
Probability Theory
▶ Probability: Measure of how likely an event will occur
▶ Key concepts:
▶ Random experiment
▶ Sample space
▶ Event
▶ Types of events: Disjoint and Non-disjoint
Types of Probability
▶ Marginal Probability: Unconditional on any other event
▶ Joint Probability: Measure of two events happening
simultaneously
▶ Conditional Probability: Probability based on the occurrence
of a previous event
Probability Distribution Functions
▶ Probability Density Function (PDF)
▶ Normal Distribution
▶ Central Limit Theorem
Bayes’ Theorem
▶ Shows relation between conditional probability and its inverse
▶ Formula: P(A|B) = P(B|A)∗P(A)
P(B)
▶ Used in naive Bayes algorithm (e.g., spam filtering)
Inferential Statistics
▶ Forms inferences and predictions about a population based on
a sample
▶ Point Estimation vs. Interval Estimation
▶ Confidence Interval and Margin of Error
▶ Methods of Estimation:
▶ Method of Moments
▶ Maximum Likelihood
▶ Bayes Estimator
▶ Bayes Unbiased Estimator
Conclusion
▶ Statistics and probability are crucial for data science
▶ Understanding these concepts helps in:
▶ Data analysis
▶ Machine learning model development
▶ Interpreting results
▶ Making informed decisions