[go: up one dir, main page]

0% found this document useful (0 votes)
26 views68 pages

DA Notes

The document provides an overview of data analytics, detailing types of data (structured, semi-structured, unstructured, categorical, temporal, and spatial) and their significance in analytics. It outlines the data analytics life cycle, including phases such as discovery, data preparation, exploratory data analysis, modeling, validation, and deployment, along with challenges faced in each phase. Additionally, it discusses supervised and unsupervised learning techniques, their algorithms, and real-world applications across various sectors.

Uploaded by

rawatabhay635
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views68 pages

DA Notes

The document provides an overview of data analytics, detailing types of data (structured, semi-structured, unstructured, categorical, temporal, and spatial) and their significance in analytics. It outlines the data analytics life cycle, including phases such as discovery, data preparation, exploratory data analysis, modeling, validation, and deployment, along with challenges faced in each phase. Additionally, it discusses supervised and unsupervised learning techniques, their algorithms, and real-world applications across various sectors.

Uploaded by

rawatabhay635
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Unit 1

Data Analytics
“Data” refers to raw facts, figures, or symbols that represent conditions, ideas, objects, or
events. Data on its own does not carry meaning until it is processed or analyzed.
 Structured Data:
Data that is organized in a fixed format, usually in rows and columns like in relational
databases.
Example: Tables in MySQL or Excel sheets.
 Semi-Structured Data:
Data that does not reside in a traditional table but still has some organizational properties
like tags or markers.
Example: XML, JSON files.
 Unstructured Data:
Data that has no predefined format or organization. It is often complex and difficult to
analyze directly.
Example: Images, videos, emails, audio files, social media posts.
 Categorical Data:
Categorical data represents characteristics or labels that can be divided into different
groups or categories. It can be nominal (no order) or ordinal (ordered).
Example:
o Nominal: Gender (Male, Female)
o Ordinal: Education Level (High School, Bachelor, Master, PhD)
Temporal and Spatial Data:
 Temporal Data: Data related to time-based events. It records time-related attributes such
as date, time, or duration.
Example: Daily temperature readings, stock market prices over time.
 Spatial Data: Data that represents the physical location and shape of objects. Often used
in geographical information systems (GIS).
Example: Maps, GPS coordinates, land boundaries.
Temporal Data (Time-Based Data)
Definition:
Temporal data refers to data that is associated with time—such as dates, times, or timestamps.
Importance in Analytics:
 Trend Analysis: Helps identify patterns or trends over time.
Example: Sales increase during festivals or weekends.
 Forecasting: Used in predictive models to forecast future outcomes.
Example: Weather prediction, stock price forecasting.
 Performance Monitoring: Tracks performance over time.
Example: Monthly website traffic, daily machine productivity.
 Event Detection: Detects anomalies or events at a specific time.
Example: Sudden drop in server response time.
Spatial Data (Location-Based Data)
Definition:
Spatial data is related to location—such as coordinates (latitude and longitude), addresses, or
regions.
Importance in Analytics:
 Geographic Insights: Understand location-specific trends or behaviors.
Example: Most online orders come from urban areas.
 Resource Optimization: Helps in route planning, logistics, and delivery.
Example: Delivery companies use spatial data for shortest path algorithms.
 Risk Management: Analyzes environmental or regional risks.
Example: Flood-prone zones, crime heatmaps.
 Marketing Strategies: Target location-specific advertising or promotions.
Example: Local ads on Google/Facebook based on user’s location.
*********************************************************************
Real-World Applications of Data Analytics
 Healthcare – Predicting disease outbreaks, patient diagnosis, personalized treatment.
 Finance – Fraud detection, risk assessment, stock market analysis.
 Retail – Customer segmentation, recommendation systems, inventory management.
 Marketing – Targeted advertising, customer behavior analysis.
 Transportation – Route optimization, traffic prediction, fleet management.
**************************************************************
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) involves analyzing datasets visually and statistically to
uncover patterns, detect anomalies, test hypotheses, and check assumptions. It helps in
understanding the structure, trends, and relationships within the data before applying formal
models.
ETL Process: Extract, Transform, Load
ETL stands for Extract, Transform, Load — a data integration process used to collect data
from various sources, clean and structure it, and then load it into a target system like a data
warehouse, database, or data lake.

The diagram explains the ELT (Extract, Load, Transform) process, which is used for handling
data.
1. Extract – Data is collected from different sources like Facebook, MySQL, Salesforce,
and Shopify.
2. Load – This raw data is stored in a data warehouse.
3. Transform – The data is cleaned, processed, and converted into a useful format inside
the warehouse.
4. Analyze – The transformed data is used to create charts, reports, and insights for
decision-making.
Steps of the ETL Process:
 Data Extraction
 Data Staging
 Data Transformation
 Data Validation
 Data Loading
 Monitoring and Logging
*************************************************************************
****************************************************************************
Case Study: data Analytics life cycle :
Improving Student Performance and Retention
🎯 Phase 1: Discovery – Understanding the Problem
Business Objective:
A university is experiencing a drop in student academic performance and rising dropout
rates. The goal is to analyze student data to predict students at risk of poor performance or
dropping out and intervene early.
Key Questions:
 Which students are at risk of failing or dropping out?
 What are the factors affecting student performance?
Stakeholders:
 University administration
 Faculty members
 Academic advisors
Challenges:
 Vague definition of “at-risk” student
 Multiple departments involved with different data systems

🗃️ Phase 2: Data Preparation – Gathering & Cleaning Data


Data Sources:
 Student academic records (grades, attendance)
 Demographic data (age, gender, background)
 LMS (Learning Management System) usage logs
 Feedback and counseling reports
Data Cleaning:
 Handled missing attendance and assignment scores
 Standardized course codes and exam formats
 Merged data from multiple departments and systems
Challenges:
 Incomplete attendance records
 Inconsistent grade scales (CGPA vs percentage)
 Privacy and data security concerns
📊 Phase 3: Data Exploration (EDA) – Understanding the Data
Analytical Tasks:
 Univariate Analysis: Analyzed grade distributions across departments
 Bivariate Analysis: Checked correlation between attendance and grades
 Visualization:
o Bar charts for department-wise failure rates
o Heatmaps for subject correlation
o Pie charts for dropout causes
Findings:
 Students with attendance < 70% often scored below 50%
 LMS activity correlated positively with academic performance
 First-generation college students showed higher dropout rates
Challenges:
 Large number of variables
 Need for anonymization during visualization for privacy

️ Phase 4: Modeling – Predictive Analytics


Modeling Task:
Develop a predictive model to identify students at risk of failure or dropout.
Models Tested:
 Logistic Regression
 Decision Tree
 Random Forest (best performance)
Features Used:
 Attendance %
 Assignment submission count
 LMS activity hours
 Previous semester grades
Model Accuracy: 89%
Recall (for identifying at-risk students): 83%
Challenges:
 Imbalanced data (only 15% were dropouts)
 Need to balance interpretability for non-technical users (faculty)

✅ Phase 5: Validation – Model Testing


Tasks:
 Used 10-fold cross-validation
 Compared model performance on unseen test data
 Involved faculty to review prediction explanations
Challenges:
 Some false predictions flagged students who were not actually at risk
 Required human feedback loop to refine features

🚀 Phase 6: Deployment & Monitoring – Putting the Model into Action


Deployment Steps:
 Integrated model into the Student Management System (SMS)
 Developed a dashboard for faculty and advisors
 Sent monthly reports of at-risk students to academic counselors
Monitoring:
 Feedback from faculty and counselors was collected each semester
 Re-trained model with updated data every academic term
 Adjusted thresholds based on program-specific needs
Challenges:
 Initial resistance from some departments
 Continuous monitoring and re-training required to maintain accuracy
 Ensuring ethical use and student privacy
📌 Summary Table of the Lifecycle

Phase Key Activity Challenge Faced

Vague risk factors, multiple


Discovery Defined student performance issues
stakeholders

Data Prep Cleaned and merged student records Missing/inconsistent data

Identified patterns using stats &


EDA Complex variables, privacy
visuals

Built predictive model for at-risk Imbalanced data, model


Modeling
students interpretability

Validation Tested accuracy and recall False positives, faculty input needed

Ethical use, data drift, change


Deployment Alerts, dashboards, periodic updates
management

*****************************************************************************
Data Analytics Life Cycle: Phases and Challenges
The Data Analytics Life Cycle defines the process from identifying a business problem to
delivering insights or solutions using data. It consists of 6 major phases:
✅ 1. Discovery (Problem Definition)
 What it is: Understanding the business problem and defining the goal of the analysis.
 Example: For a store, the goal might be to predict next month’s sales.
What happens:
 Understand the business problem.
 Define goals and success criteria.
 Identify required resources.
Challenges:
 Vague or unclear business objectives.
 Miscommunication between business and data teams.
 Lack of domain knowledge.
✅ 2. Data Preparation (Data Collection & Cleaning)
 What it is: Gathering the relevant data needed for analysis.
 Example: Collect sales data, customer information, and product details.
What happens:
 Collect data from various sources (databases, APIs, files).
 Clean data (remove duplicates, fix errors, handle missing values).
 Integrate data into a usable form.
Challenges:
 Incomplete, inconsistent, or noisy data.
 Missing values or corrupted files.
 Difficulty in accessing relevant data sources.

✅ 3. Data Exploration (EDA - Exploratory Data Analysis)


 What it is: Exploring the data to find patterns or insights using visualizations and basic
statistics.
 Example: Plotting sales data to see if there are any seasonal trends.
What happens:
 Analyze data distributions and relationships.
 Identify patterns, trends, and anomalies.
 Visualize data using charts and graphs.
Challenges:
 Dealing with large volumes of data.
 Identifying misleading patterns.
 Choosing the right visualization or summary method.

✅ 4. Modeling (Apply Algorithms)


 What it is: Using algorithms (e.g., regression, classification) to build a model that can
make predictions based on the data.
 Example: Building a sales prediction model using historical sales data.
What happens:
 Apply statistical or machine learning models.
 Train and test models to make predictions.
 Optimize model parameters.
Challenges:
 Choosing the right algorithm.
 Overfitting or underfitting of models.
 Need for high computing power for large datasets.

✅ 5. Validation (Evaluation & Testing)


What happens:
 Evaluate model performance using metrics (accuracy, precision, recall, etc.).
 Validate using test data or cross-validation.
 Ensure the model generalizes well to new data.
Challenges:
 Biased or unbalanced datasets.
 Misinterpretation of model performance metrics.
 Lack of proper validation data.

✅ 6. Deployment and Monitoring


 What it is: Putting the model into a live environment so it can start providing insights or
predictions.
 Example: Using the model to predict next month’s sales automatically.
What happens:
 Deploy the model into production (real-time or batch).
 Monitor model performance over time.
 Provide dashboards or reports to stakeholders.
Challenges:
 Integration issues with existing systems.
 Performance degradation over time (data drift).
 Need for constant monitoring and updating of the model.
************************************************************************
Supervised Learning
Supervised Learning is a type of machine learning where the model is trained using labeled
data, meaning that each input in the training dataset is paired with a correct output (label or target).
Types of Supervised Learning:
1. Classification:: Predicts discrete labels (categories).
Example: Spam vs. Not Spam, Disease Present vs. Absent.
Algorithms: Logistic Regression, Decision Trees, Random Forest, Support Vector
Machine (SVM), k-NN
2. Regression: Predicts continuous values.
Example: Predicting house prices, temperature, or salary.
Algorithms: Linear Regression, Ridge Regression, Lasso Regression

Algorithms in Supervised Learning


1. Linear Regression
 Type: Regression
 Use: Predict continuous output
 Characteristic: Assumes linear relationship between input and output.
 Example: Predicting salary based on years of experience.
2. Logistic Regression
 Type: Classification
 Use: Binary or multi-class classification
 Characteristic: Estimates probabilities using a logistic function.
 Example: Predicting if a tumor is malignant or benign.
3. Decision Tree
 Type: Both Classification & Regression
 Use: Predict class or value based on tree structure of decisions
 Characteristic: Easy to interpret, prone to overfitting
 Example: Approving loan applications based on income, credit score
4. Random Forest
 Type: Both
 Use: Ensemble of decision trees
 Characteristic: Reduces overfitting and improves accuracy
 Example: Predicting customer churn or defaulting on loans
5. Support Vector Machine (SVM)
 Type: Mostly Classification
 Use: Finds optimal hyperplane to separate classes
 Characteristic: Works well for high-dimensional data
 Example: Classifying images of cats vs dogs
6. k-Nearest Neighbors (k-NN)
 Type: Both
 Use: Based on similarity with nearby data points
 Characteristic: Simple, lazy learner, sensitive to noise
 Example: Handwriting recognition
7. Naive Bayes
 Type: Classification
 Use: Based on Bayes’ Theorem with independence assumption
 Characteristic: Fast, works well with text data
 Example: Spam filtering, sentiment analysis
📊 Comparison Table of Supervised Learning Algorithms

Algorithm Type Strengths Weaknesses

Linear Regression Regression Simple, interpretable Assumes linearity

Logistic Regression Classification Probabilistic output Not good for complex patterns

Decision Tree Both Easy to visualize Prone to overfitting

Random Forest Both Accurate, reduces overfitting Less interpretable

SVM Classification Works well on complex data Slow with large datasets

k-NN Both Easy to implement Slow for large datasets

Naive Bayes Classification Fast, works well on text data Assumes feature independence
*********************************************************
Unsupervised Learning
 Unsupervised learning is a machine learning approach where the model is trained on
unlabeled data, i.e., no output or target variable is provided.
 The algorithm tries to identify hidden patterns, structures, or groupings in the data on
its own.

🎯 Key Goals of Unsupervised Learning:


 Clustering: Group similar data points.
 Dimensionality Reduction: Simplify data while preserving structure.
 Anomaly Detection: Find unusual or unexpected patterns.
 Association Mining: Find rules and relationships among variables.

Clustering with K-Means – In-Depth


K-Means is one of the most widely used unsupervised clustering algorithms.
🔧 How it works:
1. Choose the number of clusters (K).
2. Initialize K centroids randomly.
3. Assign each data point to the nearest centroid.
4. Recalculate centroids as the mean of assigned points.
5. Repeat steps 3–4 until centroids stabilize (converge).

📘 Common Unsupervised Learning Techniques

Technique Purpose Example

Partition data into


K-Means Clustering Customer segmentation
groups

Hierarchical Clustering Create nested clusters Gene similarity analysis

Density-based Identifying unusual patterns in traffic


DBSCAN
clustering data
Technique Purpose Example

PCA (Principal Component Dimensionality


Image compression, visualization
Analysis) reduction

Feature learning /
Autoencoders Image noise removal
reduction

Association rule Market basket analysis (e.g.,


Apriori / FP-Growth
mining Amazon/Flipkart)

How Initial Centroids Affect Clustering


🔁 Effect of Initial Centroids:
 Different initial centroids can lead to different final clusters.
 Poor initial placement can result in:
o Sub-optimal clustering
o Slower convergence
o Local minima (not the best solution)
✅ Solution:
 Use K-Means++ initialization which spreads centroids apart to improve stability and
clustering quality.
📌 Impact on Clustering:
 The choice of distance metric directly affects how “similarity” is measured.
 Inappropriate distance measures may group unrelated points together or split natural
clusters.

Centroid & Distance Metric Effects

Aspect Impact on Clustering

Initial Centroid Choice Affects convergence and final clusters

Distance Metric Defines similarity; affects shape and composition of clusters

Bad Initialization May trap K-Means in local minima, misleading clusters

Poor Metric Choice May group dissimilar points or miss natural clusters

******************************************************************************
UNIT 2
Data cleaning
Data cleaning is a foundational process in the data analytics lifecycle. It ensures that the data is
accurate, consistent, and ready for meaningful analysis, helping organizations make informed,
data-driven decisions.
Key Roles of Data Cleaning:
1. Improves Data Quality
2. Handles Missing Data
3. Ensures Consistency
4. Eliminates Irrelevant Data
5. Detects and Removes Outliers
6. Enables Accurate Analysis and Modeling
7. Reduces Bias and Errors
*****************************************************************************
Data Extraction
Data Extraction is the process of collecting raw data from different data sources—such as
databases, spreadsheets, web pages, APIs, sensors, or cloud platforms—and converting it into a
structured format for further analysis.
Data extraction is the process of retrieving relevant data from various sources (structured, semi-
structured, or unstructured) for analysis. It is the first step in the data analytics lifecycle and plays
a critical role in ensuring accurate and relevant insights.
Key Objectives:
 Collect relevant data from multiple sources.
 Prepare raw data for further processing.
 Enable efficient and effective decision-making through clean, accessible data.
******************************************************************************
Data Preprocessing
 Data preprocessing is the process of cleaning, transforming, and organizing raw data
to make it suitable for analysis or machine learning.
 Raw data often contains inconsistencies, missing values, noise, and errors—
preprocessing helps correct these issues to ensure better accuracy and performance in data
analytics.
Two Common Preprocessing Techniques:
🔹 1. Handling Missing Data
 Why it's needed: Real-world data is rarely complete; missing entries can skew analysis.
 Techniques:
o Deletion: Remove rows or columns with too many missing values.
o Imputation: Fill missing values using:
 Mean or median (for numerical data)
 Mode (for categorical data)
 Predictive methods (e.g., regression)
 Example:
If the age of a student is missing, it can be filled using the average age of the class.

2. Removing Outliers in Data Preprocessing


✅ What Are Outliers?
Outliers are data points that differ significantly from the rest of the data. They can result from
errors in data collection or represent rare but important cases. Regardless, they can skew statistical
analysis and affect the performance of many machine learning algorithms.

Methods to Detect and Remove Outliers:


***************************************************************************
Hypothesis : Null and Alternative Hypothesis
Hypothesis testing
 Hypothesis testing is a statistical method used to make inferences or draw conclusions
about a population based on a sample of data.
 It is a way of testing whether a claim or assumption about a population parameter (such as
a mean, proportion, or variance) is likely to be true.

Steps to Formulate a Hypothesis :


Formulating a hypothesis involves a systematic approach to making assumptions and testing
them based on data.
🔍 Steps:
1. Identify the Research Problem
o Define what you are investigating.
o Example: Does exercise affect weight loss?
2. Define Variables
o Independent Variable (IV): Exercise
o Dependent Variable (DV): Weight loss
3. State the Null Hypothesis (H₀)
o Assumes no effect or no difference.
o Example: H₀: Exercise has no effect on weight loss.
4. State the Alternative Hypothesis (H₁ or Ha)
o Assumes an effect or a difference.
o Example: H₁: Exercise does affect weight loss.
5. Select the Appropriate Test
o Based on data type, sample size, and design.
6. Set the Significance Level (α)
o Common values: 0.05 or 0.01.
7. Collect Data & Conduct the Test
o Use tools like t-test, ANOVA, etc.
8. Make a Decision
o If p-value < α → Reject H₀
o If p-value ≥ α → Fail to reject H₀
Difference Between Null and Alternative Hypothesis
Aspect Null Hypothesis (H₀) Alternative Hypothesis (H₁ or Ha)

Meaning Assumes no effect or difference Assumes effect or difference exists

Symbol H₀ H₁ or Ha

Assumption Status quo / default Contradicts the null

Decision Rule Reject only if strong evidence exists Accepted if null is rejected

Example H₀: μ₁ = μ₂ (no mean difference) H₁: μ₁ ≠ μ₂ (means are different)

Types of t-Tests with Examples


The t-test is used to determine whether there is a significant difference between means.
✳️ A. One-Sample t-Test
 Use: Compare sample mean to a known population mean.
 Example:
A class has an average test score of 65. A new batch scores 70. Is this improvement
significant?
✳️ B. Independent Two-Sample t-Test
 Use: Compare means of two independent groups.
 Example:
Comparing average scores of boys vs. girls in a test.
Hypotheses:
 H₀: μ₁ = μ₂ (no difference)
 H₁: μ₁ ≠ μ₂ (there is a difference)
✳️ C. Paired Sample t-Test (Dependent t-test)
 Use: Compare means from same group at two different times.
 Example:
Measure students' weights before and after a fitness program.
Hypotheses:
 H₀: μ_d = 0 (no change)
 H₁: μ_d ≠ 0 (significant change)

Test Type Use Case Groups Example

One-Sample t- Sample vs. known population Class average vs. national


1 sample
Test mean average

Two-Sample t- Compare two independent


2 samples Male vs. Female performance
Test groups

Same group at two different 1 paired Before vs. After training test
Paired t-Test
times sample scores

*****************************************************************************
Central Tendency
Central Tendency is a statistical measure that identifies a single value as representative of an
entire dataset. It aims to provide an accurate description of the "center" or "average" of the data
distribution.
Main Measures of Central Tendency:
Mean (Arithmetic Average):
o The sum of all values divided by the number of values.
Median:
 The middle value when data is arranged in order.
 If even number of values, median is the average of the two middle values.
Mode:
 The value that occurs most frequently in a dataset.
 A dataset can have no mode, one mode (unimodal), or more than one mode (bimodal,
multimodal).
Why Central Tendency is Important:
 Summarizes large data sets with a single value.
 Helps in comparing different data sets.
 Essential in decision-making and predictive analysis.
*************************************************************************
Probability Distributions

Discrete Probability Distributions


a) Bernoulli Distribution
 Description: Models a single trial with two outcomes: success (1) or failure (0).
 Parameter: ppp (probability of success)
 Example Use: Tossing a coin (Heads = 1, Tails = 0)

b) Binomial Distribution
 Description: Models the number of successes in a fixed number of independent Bernoulli
trials.
 Parameters: nnn (number of trials), ppp (probability of success)
 Example Use: Number of heads in 10 coin tosses

c) Poisson Distribution
 Description: Models the number of events in a fixed interval of time or space.
 Parameter: λ\lambdaλ (average rate of occurrence)
 Example Use: Number of emails received in an hour
✅ 2. Continuous Probability Distributions
a) Normal (Gaussian) Distribution
 Description: Symmetrical bell-shaped distribution; most values cluster around the mean.
 Parameters: μ\muμ (mean), σ\sigmaσ (standard deviation)
 Example Use: Heights, test scores, measurement errors

b) Exponential Distribution
 Description: Models time between events in a Poisson process.
 Parameter: λ\lambdaλ (rate)
 Example Use: Time until next customer arrives at a service desk

c) Uniform Distribution
 Description: All values in a given interval are equally likely.
 Parameters: aaa, bbb (min and max values)
 Example Use: Random number generation, dice roll
Distribution Type Use Case Example

Bernoulli Discrete Tossing a coin

Binomial Discrete Number of successes in exams

Poisson Discrete Calls received per minute in a call center

Normal Continuous Heights, IQ scores, product dimensions

Exponential Continuous Time between server requests

Uniform Continuous Random selection in lottery

&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
Hypothesis testing
 Hypothesis testing is a statistical method used to make inferences or draw conclusions
about a population based on a sample of data.
 It is a way of testing whether a claim or assumption about a population parameter (such as
a mean, proportion, or variance) is likely to be true.

Type II Error
 A Type II Error (also called false negative) occurs in hypothesis testing when:
 The null hypothesis (H₀) is false, but we fail to reject it.
 We miss detecting a real effect or difference that actually exists.
Type II error occurs when:
 The sample size is too small.
 The test is not sensitive enough.
 The significance level (α) is set too low (e.g., 0.01).
 The variability in data is high.
It is denoted by β (beta).
The power of the test is defined as 1 − β, which is the probability of correctly rejecting a false
null hypothesis.
Example:
Scenario: Medical Testing
 H₀ (Null Hypothesis): The patient does not have a disease.
 H₁ (Alternative Hypothesis): The patient has the disease.
If the patient actually has the disease (H₀ is false), but the test result says "No disease", then:
🎯 A Type II error has occurred — the disease is missed.

T-Test
 A t-test is used to compare the means of one or more groups to determine if there is a
statistically significant difference between them.
 There are different types of t-tests depending on the number of samples and the
relationship between them.
📌 Types of t-Tests
 One-Sample t-test
 Independent Two-Sample t-test
 Paired Sample t-test

Chi-Square Test
The Chi-Square test is used for categorical data to test the association between variables or to
determine if the observed frequency distribution matches an expected distribution.
Types of Chi-Square Tests:
 Chi-Square Goodness-of-Fit Test
 Chi-Square Test of Independence
************************************************************************************

Pearson’s Correlation Coefficient


The strength and direction of correlation are measured using a correlation coefficient. The most
common one is Pearson’s Correlation Coefficient (r), which ranges from -1 to +1:
 r = +1 → Perfect positive correlation
 r = -1 → Perfect negative correlation
 r = 0 → No correlation
Assumptions of Pearson’s Correlation
 Both variables should be continuous (interval or ratio scale).
 The relationship should be linear.
 Data should be normally distributed (or approximately normal).
 No significant outliers.
Use of Pearson’s Correlation:
 To find the degree of relationship between two continuous variables.
 Used in:
o Feature selection (in machine learning)
o Statistical analysis (e.g., income vs. education level)
o Business analytics (e.g., advertising spend vs. sales)

Example:
Data: Study Hours vs. Test Scores
Student Hours Studied (X) Test Score (Y)
A 2 50
Student Hours Studied (X) Test Score (Y)
B 4 60
C 6 70
D 8 80
E 10 90

Step 1: Find the Means


Xˉ=6
yˉ=70
Step 2: Apply the Formula

Result:
 r = 1 → Perfect positive correlation
 Interpretation: As study hours increase, test scores increase proportionally.
************************************************************************

Comparison between Normal and Binomial distributions:


Normal Distribution
 A continuous probability distribution.
 Symmetrical, bell-shaped curve.
 Describes variables that can take any real value.
 Common in natural phenomena like height, weight, test scores.
🔹 Binomial Distribution
 A discrete probability distribution.
 Represents the number of successes in a fixed number of independent Bernoulli trials
(yes/no outcomes).
 Used for counting things (e.g., number of heads in coin tosses).
Parameters

Distribution Parameters

Normal Mean (μ), Standard Deviation (σ)

Binomial Number of trials (n), Probability of success (p)

Shape and Characteristics

Feature Normal Distribution Binomial Distribution

Shape Bell-shaped, symmetric Symmetric if p = 0.5 and n is large; otherwise skewed

Values Any real number (−∞ to ∞) Integer values from 0 to n

Symmetry Always symmetric Only symmetric if p = 0.5


Summary Table

Feature Normal Distribution Binomial Distribution

Data Type Continuous Discrete (counts of events)

Shape Bell curve Skewed or symmetric

Parameters μ (mean), σ (std. deviation) n (trials), p (success probability)

Use Case Heights, weights, scores Coin flips, pass/fail, yes/no outcomes

Domain (−∞, +∞) 0, 1, 2, …, n

*****************************************************************
UNIT 3
Principal Component Analysis
PCA is a dimensionality reduction technique used in data analytics and machine learning to
reduce the number of features (variables) in a dataset while preserving as much variance
(information) as possible.

🎯 Purpose of PCA in Dimensionality Reduction

1. 🔻 Reduce Complexity
2. 🔍 Eliminate Redundancy

3. 🚀 Improve Model Performance


4. Aid in Visualization
How PCA Works (Simplified Steps):
1. Standardize the data (mean = 0, std = 1).
2. Compute the covariance matrix of the features.
3. Calculate eigenvalues and eigenvectors of the covariance matrix.
4. Sort eigenvectors based on eigenvalues (variance explained).
5. Select top k eigenvectors → These form the new axes (principal components).
6. Project data onto these axes → This gives reduced-dimension data.
How PCA Works – Step-by-Step
Step 1: Standardize the Data
 PCA is sensitive to scale. So, features are standardized (mean = 0, variance = 1).
Step 2: Compute Covariance Matrix
 Measures how features vary together.
 For n features, an n × n covariance matrix is computed.
Step 3: Compute Eigenvalues and Eigenvectors
 Eigenvectors → Directions of new feature space (principal components).
 Eigenvalues → Measure of variance explained by each component.
Step 4: Sort Eigenvalues and Select Top k Components
 Rank principal components by variance (eigenvalue).
 Choose top k components that explain the most variance.
Step 5: Transform Original Data
 Project the original data onto the new k-dimensional feature space.
 The result is a reduced dataset with minimal information loss.
🔄 How PCA Transforms Features into Principal Components
 PCA forms new features (principal components) as linear combinations of original
features.
 These new axes:
o Are orthogonal (uncorrelated).
o Capture the most meaningful variation in the data.
 The first principal component (PC1) captures the maximum variance, the second
captures the next highest, and so on.
📊 Impact of PCA on Model Performance
✅ Advantages:
1. Reduces Overfitting
2. Speeds Up Training
3. Improves Generalization
4. Enhances Visualization
⚠️ Limitations:
 Loss of Interpretability:
Loss of Information:
************************************************************************
Decision Tree
A Decision Tree is a flowchart-like tree structure used for decision-making and classification.
Each internal node represents a test on an attribute, each branch represents the outcome of
the test, and each leaf node represents a class label or decision.

Applications:
 Classification problems
 Customer segmentation
 Risk analysis
Weather
/ | \
Sunny Rainy Cloudy
/ \ \
Play Don't Play Play

 Root node: "Weather"

 Branches: Sunny, Rainy, Cloudy

 Leaves: Decision to Play or Not

Working Steps:

1. Select the Best Attribute:


o Use a splitting criterion like Information Gain or Gini Index to choose the
feature that best separates the classes.
2. Split the Dataset:
o Divide the dataset into subsets based on the selected feature’s values.
3. Recursive Tree Building:
o Repeat steps 1–2 for each child subset until:
 All records in a node belong to the same class, or
 No more attributes are left, or
 A stopping condition (e.g., tree depth or minimum samples) is met.
4. Leaf Nodes:
o Assign class labels to terminal nodes (leaves).

Handling Overfitting

Overfitting happens when the model learns noise from training data.

Techniques to prevent overfitting:

1. Pruning (Post or Pre):


o Remove nodes with low significance.
2. Set Maximum Depth:
o Limit tree depth to control complexity.
3. Minimum Samples Split/Leaf:
o Ensure a node must have a minimum number of samples before splitting.
4. Cross-validation:
o Evaluate model performance on unseen data.

INFORMATION GAIN AND GINI INDEX


 Information Gain and Gini Index are two popular metrics used in decision tree algorithms
(like ID3, C4.5, and CART) to determine the best attribute to split the data at each node.
 Both metrics help identify the "best" attribute for splitting data by measuring the purity
or impurity of subsets resulting from a split.
 Information Gain (used in ID3 and C4.5) measures the reduction in entropy (disorder)
after a split.
 Gini Index (used in CART) measures impurity; a lower Gini value means a purer node.

*********************************************************************
Precision
 Precision is the ratio of correct positive predictions to the total predicted positives.
 It tells us how many of the items we labeled as positive are actually positive.

 TP = True Positives (correctly predicted positive cases)


 FP = False Positives (incorrectly predicted as positive)
Examples: Email Spam Detection, Cancer Detection Test

Predicted Positive Predicted Negative

Actual Positive ✅ True Positive (TP) ❌ False Negative (FN)

Actual Negative ❌ False Positive (FP) ✅ True Negative (TN)

Precision focuses only on the Predicted Positive column.

************************************************************************************

Cross-Validation
Cross-validation is a resampling technique used to evaluate the generalization ability of a
model on unseen data by partitioning the dataset into multiple folds.
Role and Importance:
 Reduces overfitting by testing the model on multiple subsets.
 Provides a more accurate estimate of model performance than a single train-test split.
 Helps in model selection and hyperparameter tuning.
 Ensures robust evaluation, especially with limited data.

Confusion Matrix
A confusion matrix is a performance measurement tool for classification problems, showing
the counts of true and false classifications for each class.
Structure (for Binary Classification):

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

Role and Importance:


 Provides detailed insight into the types of errors a model is making.
 Helps identify imbalances in classification, especially in skewed datasets.
 Essential for calculating other performance metrics like precision, recall, and F1-score.
 Useful in evaluating models in sensitive applications like healthcare or fraud detection.
Comparison:

Aspect Cross-Validation Confusion Matrix

Validate model performance across data Evaluate prediction quality (TP, FP, FN,
Purpose
subsets TN)

Type Resampling strategy Evaluation metric tool

Average performance scores (accuracy,


Output Count-based matrix of predictions
etc.)

Use
Model selection, overfitting detection Classification error analysis
Case
*****************************************************************************
K-Means Clustering
K-Means Clustering is an unsupervised machine learning algorithm used to group similar
data points into K clusters based on feature similarity.

How It Works: Steps in K-Means Algorithm


1. Choose the number of clusters (K)
Decide how many clusters you want to divide your data into.
2. Initialize K centroids randomly
These centroids are initially placed at random positions in the data space.
3. Assign each data point to the nearest centroid
Based on Euclidean distance (or other distance metrics), assign data points to the closest
cluster.
4. Update the centroids
Recalculate the centroid of each cluster as the mean of all points assigned to it.
5. Repeat steps 3–4 until convergence
Iteration continues until the assignments no longer change or the centroids stabilize.
Key Characteristics:
 Unsupervised: No labeled data required.
 Centroid-based: Groups data based on distance to the cluster mean.
 Iterative: Repeats assignment and update steps to minimize variance.
Use Cases:
 Customer segmentation
 Image compression
 Market basket analysis
 Document classification
Choosing K: The Elbow Method
Plot the Within-Cluster Sum of Squares (WCSS) for different values of K.
 Look for an "elbow point" where the rate of decrease slows down.
 This point gives an optimal number of clusters.

Strengths:
 Simple and fast for large datasets
 Easily interpretable
 Works well when clusters are spherical and well-separated

Limitations:
 Requires pre-defining K
 Sensitive to initial centroids
 Poor performance with non-spherical or overlapping clusters
 Affected by outliers

Applications:
 Customer segmentation
 Market basket analysis
 Image compression
 Document classification
***********************************************************************
Time series data
Time series data refers to data points collected or recorded at specific time intervals (e.g., daily,
monthly, yearly). Analyzing time series involves breaking it down into its fundamental
components to understand patterns and make accurate forecasts.
🔹 Main Components of Time Series Data
1. Trend (T)
 Definition: The long-term direction of the data over a period of time.
 Role:
o Shows overall increase or decrease in the data (e.g., rising stock prices).
o Helps in understanding underlying growth or decline.
 Example: Steady rise in temperature due to global warming.
2. Seasonality (S)
 Definition: Repeating short-term cycle in the data occurring at regular intervals (e.g.,
monthly, quarterly).
 Role:
o Captures periodic fluctuations due to weather, holidays, or other repeating
events.
o Critical for planning and forecasting in retail, agriculture, tourism, etc.
 Example: Higher ice cream sales during summer months every year.
3. Cyclic Component (C)
 Definition: Long-term up-and-down movements not of fixed period (unlike seasonality).
 Role:
o Reflects economic cycles, business conditions, etc.
o Important for strategic decision-making.
 Example: Business cycles with periods of expansion and recession.
4. Irregular or Random Component (R)
 Definition: Unpredictable, random variations in the data.
 Role:
o Represents noise or unexpected events.
o Helps to isolate meaningful patterns by filtering out randomness.
 Example: Sudden dip in stock prices due to geopolitical crisis.
******************************************************************************
Logistic Regression
 Logistic Regression is a supervised machine learning algorithm used for classification
problems.
 Logistic regression is a statistical algorithm which analyze the relationship between two
data factors.
 Logistic regression is used for binary classification where we use sigmoid function , that
takes input as independent variables and produces a probability value between 0 and 1.

Use of Logistic Regression


Logistic regression is widely used in various fields for predictive classification, especially
when the output involves two or more categories.

Types of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
Sigmoid Function
 Sigmoid Function (also known as the logistic function) is a mathematical function used to
map any real-valued number into a value between 0 and 1.
 It is widely used in logistic regression, neural networks, and other models that need to
output probabilities.

Working of Logistic Regression: Steps


1. Input Features: Take independent variables (e.g., age, income).
2. Linear Combination: Compute z=w0+w1x1+w2x2+....
3. Apply Sigmoid Function:
Convert z to probability using σ(z)\sigma(z)σ(z).
4. Classification Rule:
o If σ(z)≥0.5 → Class = 1
o If σ(z)<0.5 → Class = 0
5. Training:
Use optimization (like Gradient Descent) to minimize log-loss (cross-entropy loss).

Use-Cases

Domain Use-Case Example

Healthcare Predicting whether a patient has a disease (yes/no)

Finance Credit card fraud detection

Marketing Customer churn prediction (will leave/stay)

E-commerce Whether a user will click on an ad


***********************************************************************
K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a lazy learner and non-parametric algorithm used for
classification and regression.
 Lazy learner: It doesn't learn a model during training; instead, it stores all data and
classifies new data at prediction time.
 Non-parametric: It makes no assumptions about the underlying data distribution.

3. How It Works: Step-by-Step


� Step 1: Choose the value of K
 K is the number of neighbors to consider (common choices: 3, 5, 7, etc.)
� Step 2: Measure Distance
Most common: Euclidean Distance
Other distances:
 Manhattan distance (L1)
 Minkowski distance
 Cosine similarity
� Step 3: Find the K nearest neighbors
 Sort training data by distance to the query (test) point
 Pick top K closest
� Step 4: Vote or average
 For classification: Take the majority class among the neighbors.
 For regression: Take the average value of neighbors.
📦 Example: Fruit Classification
We want to classify a new fruit based on weight and texture (smooth/rough).
We already have the following labeled data:

Fruit Weight (g) Texture Label

Apple 150 Smooth 🍎 Apple

Orange 170 Rough 🍊 Orange

Apple 160 Smooth 🍎 Apple

Orange 180 Rough 🍊 Orange

A new fruit has:


 Weight = 155 g
 Texture = Smooth
️ Steps:
1. Calculate distances to all training points (numerical + encoded texture).
2. Assume we encode texture: Smooth = 0, Rough = 1.
3. Apply distance formula.
4. Choose K = 3.
5. Suppose 2 of the 3 closest neighbors are Apples, 1 is Orange.
✅ Prediction: Apple (majority class)

📌 Key Characteristics:
 Instance-based: No learning/training phase
 Non-parametric: Makes no assumptions about data distribution
 Simple and intuitive
📉 Limitations: Slow for large datasets (needs to compare with all training points)
 Sensitive to irrelevant features and scaling
 Requires choosing optimal K value

Limitations of KNN
❌ Slow prediction time on large datasets (needs to calculate distance to every point)
❌ Sensitive to irrelevant features or noisy data

❌ Requires feature scaling (e.g., normalization)


❌ Suffers from curse of dimensionality in high dimensions

🔹 Applications of KNN
 Recommender systems (e.g., books, movies)
 Image recognition
 Handwriting detection
 Medical diagnosis
 Anomaly detection
***************************************************************************
Comparison between K-Nearest Neighbors (KNN) and Decision Tree
K-Nearest Neighbors (KNN)
 Type: Instance-based (lazy learner)
 Approach: Classifies based on the majority vote of the K closest training instances in
feature space.
 Training time: Fast (no model building)
 Prediction time: Slow (distance calculation needed)
🔹 Decision Tree
 Type: Model-based (eager learner)
 Approach: Splits data using feature-based decision rules (based on Gini index or
Information Gain) to build a tree structure.
 Training time: Slower (due to tree construction)
 Prediction time: Fast (simple rule traversal)

Feature/Aspect KNN Decision Tree

Lazy learner (stores entire training


Model Type Eager learner (builds model/tree)
data)

Recursive partitioning using


Working Principle Distance-based voting
decision rules

Data Preprocessing Needs scaling (e.g., normalization) Not sensitive to scaling

Handling Non- Good with low-dimensional non-


Handles non-linear data well
linearity linear data

High (use pruning or max depth to


Overfitting Risk Low if K is large
control)

Medium (depends on splitting


Noise Sensitivity High (neighbor-based)
criteria)

Easy to explain with decision


Explainability Hard to explain (no model)
paths

Use-Case Recommendations

Situation Recommended Classifier

Small dataset with clear structure KNN

Need for human interpretability Decision Tree

High-dimensional data Decision Tree (or tuned)

Real-time prediction needed Decision Tree

Dataset with outliers or missing values Decision Tree (more robust)

Multiclass classification Both (but Decision Tree scales better)


****************************************************************************
ARIMA Forecasting with Exponential Smoothing for Time Series Prediction
Time series forecasting is critical in fields like finance, sales, inventory, weather prediction, and
more. Two widely used methods are:
1. ARIMA (AutoRegressive Integrated Moving Average)
2. Exponential Smoothing (ETS)
Though distinct, both methods aim to capture patterns in time series data and predict future
values. Let’s discuss both approaches and how they compare or complement each other.

📘 1. ARIMA Model
🔹 Definition:
ARIMA combines three components:
 AR (AutoRegressive): Relationship between an observation and its previous values
 I (Integrated): Differencing of observations to make the time series stationary
 MA (Moving Average): Relationship between an observation and a residual error from a
moving average model

️ ARIMA(p, d, q):
 p = Number of AR terms
 d = Number of differencing required to make series stationary
 q = Number of MA terms

🔁 Steps to Use ARIMA:


1. Check stationarity (use ADF test)
2. Difference the series if needed (to remove trend)
3. Determine p and q using ACF/PACF plots
4. Fit ARIMA model
5. Forecast future values
📘 2. Exponential Smoothing (ETS)
Exponential Smoothing models forecast future values by giving exponentially decreasing
weights to past observations.

📦 Types of ETS Models:


1. Simple Exponential Smoothing (SES): For series with no trend/seasonality
2. Holt’s Linear Trend Method: For series with trend
3. Holt-Winters Method: For series with trend and seasonality

🔁 ETS Model Components:


 Error (E): Additive or Multiplicative
 Trend (T): Additive, Multiplicative, or None
 Seasonality (S): Additive, Multiplicative, or None
🔁 ARIMA vs Exponential Smoothing: A Comparison

Aspect ARIMA Exponential Smoothing (ETS)

Data Assumption Requires stationarity No stationarity required

Components AR, I, MA Level, Trend, Seasonality

Handles Seasonality SARIMA extension needed Holt-Winters handles seasonality

Captures Autocorrelation Yes No

Interpretability Moderate High (component-based)

Accuracy (in practice) Often similar to ETS Often similar to ARIMA

Preferred When Autocorrelation is present Clear trend and seasonal patterns


***************************************************************************8
ARIMA model for forecasting monthly sales data
ARIMA stands for:
 AR – AutoRegressive: Uses the relationship between a current observation and a number
of lagged (previous) observations.
 I – Integrated: Uses differencing of raw observations to make the data stationary (i.e.,
removing trends or seasonality).
 MA – Moving Average: Models the relationship between an observation and a residual
error from a moving average model.
An ARIMA model is denoted as:
ARIMA(p,d,q)
Where:
 p = number of autoregressive terms
 d = number of differencing required to make the series stationary
 q = number of lagged forecast errors in the prediction equation

******************************************************************************
K-Means clustering algorithm for customer segmentation using a sample dataset (e.g.,
customer age and spending score).
Steps Involved:
1. 📂 Data Preprocessing

2. 📉 Choosing Optimal K (Elbow Method)


3. ️ Applying K-Means Algorithm
4. 📊 Visualizing Clusters
5. 📌 Discussing Strengths & Limitations of K-Means

🔹 Step 1: Data Preprocessing


Assume a simple dataset:

CustomerID Age Spending Score

1 19 39

2 21 81

3 23 6

4 31 40

5 45 99

6 50 5
CustomerID Age Spending Score

… … …

Tasks:
 Select relevant features: Age and Spending Score
 Standardize the data (optional but helps improve clustering accuracy)
 Remove duplicates or missing values

🔹 Step 2: Choosing Optimal K using the Elbow Method


Goal:
Find the number of clusters (K) that best fits the data.
Process:
 Run K-Means for a range of values (e.g., K=1 to K=10)
 Record WCSS (Within-Cluster Sum of Squares) for each K
 Plot K vs. WCSS
 Identify the elbow point (where the reduction in WCSS slows down)
📌 Elbow Point = Optimal number of clusters
🔹 Step 3: Applying the K-Means Algorithm
Algorithm steps:
1. Randomly initialize K cluster centroids
2. Assign each data point to the nearest centroid (Euclidean distance)
3. Compute new centroids by taking the mean of points in each cluster
4. Repeat steps 2–3 until cluster assignments don’t change (convergence)
Output:
 Cluster labels for each data point
 Coordinates of final cluster centroids
🔹 Step 4: Visualizing the Clusters
2D Plot:
 X-axis: Age
 Y-axis: Spending Score
 Use colors to show cluster groups
 Mark centroids with a special symbol (e.g., yellow X)
🎯 This helps visually identify:
 High-spending young customers
 Low-spending older customers
 Average-spending mid-age customers, etc.
🔹 Step 5: Strengths and Limitations of K-Means

✅ Strengths:

Feature Benefit

Simplicity Easy to understand and implement

Speed Fast and efficient with large datasets

Versatility Can be used for customer segmentation, image compression, etc.

Scalability Works well with large samples (better than hierarchical clustering)

⚠️ Limitations:

Limitation Explanation

Requires K Must manually define the number of clusters

Sensitive to outliers A few extreme points can distort the results

Assumes spherical clusters Doesn’t perform well if clusters are not circular

Random initialization May converge to a local minimum — use KMeans++

Affected by feature scaling Needs standardization for meaningful distance metrics

📈 Real-Life Use Cases of Customer Segmentation with K-Means:


 Targeted marketing campaigns
 Personalized customer experiences
 Loyalty program design
 Product recommendation systems
***********************************************************************
Unit 4
Common types of data visualization charts
Different charts in identifying trends, relationships, and outliers in data
1. Line Charts – Identifying Trends Over Time
✅ Purpose:
 Line charts are ideal for visualizing data points over time and identifying trends, such
as upward, downward, or cyclical patterns.
✅ Use in Identifying:
 Trends (e.g., increasing/decreasing sales)
 Seasonality
 Sudden changes or disruptions
✅ Example:
 A company tracks monthly revenue over 2 years. A line chart shows:
o A steady increase during festive seasons
o A sudden dip during a pandemic period
✅ Decision-Making Aid:
 Helps managers forecast future revenue and adjust marketing strategies during high-
or low-sales months.
 Enables budgeting decisions based on historical trends.
📊 2. Scatter Plots – Exploring Relationships Between Variables

✅ Purpose:
 Scatter plots display individual data points based on two numerical variables, helping
detect correlations, clusters, and outliers.
✅ Use in Identifying:
 Relationships (linear, non-linear, or no correlation)
 Outliers (points far from the rest)
 Clusters (grouped data points)
✅ Example:
 A retailer analyzes the relationship between advertising spend and monthly sales:
o A strong upward trend suggests positive correlation
o A few data points far from the cluster indicate campaigns with low ROI
✅ Decision-Making Aid:
 Supports marketing decisions by optimizing ad budgets.
 Identifies ineffective campaigns to avoid future waste of resources.
📦 3. Box Plots – Understanding Distributions and Outliers

✅ Purpose:
 Box plots summarize distribution of a dataset using median, quartiles, and outliers. They
are great for comparing datasets side by side.
✅ Use in Identifying:
 Outliers
 Spread (variability)
 Skewness of data
 Comparative distributions across categories
✅ Example:
 A school evaluates exam scores across three classes using box plots:
o One class shows a wide range and many outliers
o Another has tightly grouped scores around the median
✅ Decision-Making Aid:
 Helps in identifying students who need support (outliers with low scores).
 Allows teachers to compare teaching effectiveness across classes.

Chart Type Purpose Insights Provided


Trends, seasonality,
Line Chart Shows trends over time
increasing/decreasing patterns
Displays relationships between
Scatter Plot Correlation, clusters, and outliers
two numeric variables
Box Plot (Box-and- Summarizes distribution and
Median, quartiles, and outliers
Whisker) outliers
Shows distribution of a single
Histogram Skewness, modality, and outliers
variable
Compares categories with Category-wise comparison, some
Bar Chart
frequency or values trend observations
Visualizes correlation or Pattern recognition, relationship
Heatmap
frequency in a matrix format between variables
Similar to line chart, but with filled
Area Chart Trends + volume over time
area
Pair Plot (in Plots multiple scatter plots for Multivariate relationships and
Seaborn) feature pairs outliers
Bar Chart
 Purpose: Compare quantities across categories.
 Use Case: Comparing sales across different products.
 Types: Vertical bar chart, horizontal bar chart, stacked bar chart.
Line Chart
 Purpose: Show trends over time.
 Use Case: Stock prices, temperature over days.
 Features: Can show multiple lines for comparison.
Pie Chart
 Purpose: Show proportions or percentages of a whole.
 Use Case: Market share of companies.
 Note: Avoid for datasets with many categories.
Histogram
 Purpose: Show the distribution of a continuous variable.
 Use Case: Examining age distribution, income levels.
 Note: X-axis shows bins (intervals).
Scatter Plot
 Purpose: Show relationships or correlations between two variables.
 Use Case: Height vs weight, sales vs ad spend.
 Features: Points may be clustered or form a trend line.

****************************************************************************
Forecasting method in time series analysis.
ARIMA stands for:
 AR – AutoRegressive: Uses dependency between an observation and a number of lagged
observations (past values).
 I – Integrated: Makes the data stationary by differencing (subtracting previous values).
 MA – Moving Average: Uses dependency between an observation and a residual error
from a moving average model applied to lagged observations.
ARIMA Notation: ARIMA(p, d, q)
 p: Number of autoregressive terms.
 d: Number of times the data needs to be differenced to make it stationary.
 q: Number of moving average terms.
***************************************************************************
Visualization Tools in Decision-Making
Data visualization tools play a critical role in the decision-making process by transforming raw
data into clear, meaningful visual formats that help individuals and organizations understand
trends, patterns, and insights.
✅ 1. Simplifying Complex Data
 Role: Convert massive and complex datasets into simple visuals.
 Impact: Helps decision-makers quickly grasp information without needing advanced
statistical knowledge.
 Example: Dashboards showing real-time KPIs (Key Performance Indicators).
✅ 2. Identifying Trends and Patterns
 Role: Reveal hidden trends, correlations, or anomalies.
 Impact: Supports strategic planning, forecasting, and risk assessment.
 Example: Sales trend line charts showing seasonality or growth.
✅ 3. Supporting Faster and Informed Decisions
 Role: Enables real-time data monitoring and reporting.
 Impact: Reduces time to decision and improves responsiveness.
 Example: Managers taking quick action based on live stock level heatmaps.
✅ 4. Enhancing Communication and Collaboration
 Role: Offers a universal language for presenting insights across teams.
 Impact: Improves cross-functional understanding and collaboration.
 Example: Boardroom presentations using pie charts to show market share distribution.
✅ 5. Enabling Predictive and Prescriptive Analytics
 Role: Integrates with forecasting models to visualize future scenarios.
 Impact: Guides decisions based on predictions (what will happen) and recommendations
(what to do).
 Example: Forecast dashboards using ARIMA models and visual output.
✅ 6. Supporting Data-Driven Culture
 Role: Encourages reliance on data over intuition.
 Impact: Builds a culture of accountability and continuous improvement.
 Example: BI tools like Tableau, Power BI, and Google Data Studio driving regular
performance reviews.
🔧 Popular Visualization Tools Used in Decision-Making

Tool Strengths

Tableau Interactive dashboards, advanced analytics

Power BI Integration with Microsoft ecosystem

Google Data Studio Free, easy integration with Google tools

Qlik Sense Associative data engine, intuitive visuals

Excel Widely used for basic visualization tasks

***************************************************************************
Scatter Plots and Box Plots in Data Visualization
Scatter plots and box plots are powerful tools in data visualization that serve different but
complementary purposes. Together, they help analysts understand patterns, relationships,
and distributions in datasets.
📊 1. Scatter Plots

✅ Purpose
Scatter plots are used to visualize the relationship between two continuous variables.
✅ Key Features
 Each point represents a data observation.
 X-axis and Y-axis represent two different variables.
 Optionally, color or size can represent additional dimensions.
✅ Uses
 Correlation Analysis: Positive, negative, or no correlation.
 Trend Identification: Linear or nonlinear patterns.
 Outlier Detection: Unusual points that deviate from the general pattern.
 Clustering: Identify natural groupings in data.
✅ Example
Plotting Advertising Budget (X) vs. Sales (Y) to determine if increased spending leads to higher
sales.
✅ Benefits
 Simple and effective for spotting relationships and trends.
 Helps in regression analysis and model validation.
📦 2. Box Plots (Box-and-Whisker Plots)
✅ Purpose
Box plots are used to summarize the distribution of a dataset using five-number summary:
 Minimum, Q1 (25th percentile), Median, Q3 (75th percentile), and Maximum.
✅ Key Features
 Box shows interquartile range (IQR = Q3 − Q1).
 Line inside the box shows the median.
 Whiskers extend to min and max (excluding outliers).
 Outliers are shown as individual points.
✅ Uses
 Distribution Comparison: Across multiple categories.
 Outlier Detection: Clearly highlights data points outside 1.5 × IQR.
 Spread and Skewness: Indicates variability and symmetry of data.
✅ Example
Visualizing exam scores across different classes to compare performance.
✅ Benefits
 Excellent for comparing data distributions across categories.
 Helps in identifying skewed data and data variability.
🆚 Scatter Plot vs. Box Plot – Comparison Table

Feature Scatter Plot Box Plot

Type of Variables Two continuous variables One variable (or grouped comparisons)

Main Focus Relationship between variables Distribution and spread of values

Best For Correlation, clustering, trend spotting Outlier detection, summary statistics

Visual Element Dots on a 2D plane Box with whiskers and median line
**************************************************************************
Different visualization libraries like Matplotlib, Seaborn, and Power BI
1. Matplotlib (Python Library)

Aspect Details

Type Programming-based (Python)

Ease of Use Low – Requires more code and configuration

Customization Very High – Fully customizable with granular control

Output Formats Static images (PNG, PDF, SVG), interactive plots (via extensions)

- Fine control over plots


Strengths - Large community
- Good for static graphics

- Verbose code
Weaknesses
- Not user-friendly for quick visualizations

- Custom plotting in Python


Best Use Cases - Academic publications
- Engineering & scientific plotting

🎨 2. Seaborn (Built on Matplotlib)


Aspect Details

Type Programming-based (Python)

Ease of Use Medium – Simpler than Matplotlib, but still requires coding

Customization High – Built-in themes, colors, and automatic summaries

Output Formats Static and some interactive via Matplotlib

- Great for statistical plots


Strengths - Clean, attractive default styles
- Works well with Pandas

Weaknesses - Less flexible than Matplotlib for fine-tuning

- Exploratory Data Analysis (EDA)


Best Use Cases - Quick insights with statistical context
- Visualizing distributions and categories

📈 3. Power BI (Microsoft Tool)

Aspect Details

Type GUI-based Business Intelligence Tool

Ease of Use Very High – No coding required, drag-and-drop interface

Customization Medium – Custom visuals supported, but some limitations

Output Formats Interactive dashboards, reports, exports to PDF, Excel, and web

- Real-time dashboards
Strengths - Easy integration with Excel, SQL, Azure
- Role-based access, sharing

- Limited statistical depth


Weaknesses
- Custom visuals may require Power BI Pro or DAX knowledge

- Business reporting
- KPI tracking
Best Use Cases
- Executive dashboards
- Data storytelling for decision-makers

� Comparison Table
Feature / Tool Matplotlib Seaborn Power BI

Coding Required Yes (High) Yes (Medium) No

Ease of Learning Moderate to Hard Moderate Easy

Low (via
Interactivity Low (native), add-ons High (built-in)
Matplotlib)

Customization Very High High Medium

Statistical
Moderate Strong Limited
Support

Integration Python ecosystem Python ecosystem Microsoft ecosystem

Developers, Business analysts,


Target Users Data scientists
researchers executives

************************************************************************
Unit 5

Types of Bias in Data Science

1. Sampling Bias
 What it is: When the data sample is not representative of the entire population.
 Sampling bias occurs when the sample collected is not representative of the population
intended to be analyzed.
 This bias can lead to inaccurate conclusions because some members of the population are
either overrepresented or underrepresented in the sample.
 Implication: Models trained on biased samples produce inaccurate predictions for
underrepresented groups.
 Example: Surveying only urban users about internet usage excludes rural populations.

2. Selection Bias
 What it is: Occurs when data is selected in a way that it is not random or systematically
excludes certain groups.
 Implication: Leads to over- or under-estimation of outcomes.
 Example: Using data from only successful loan applicants to predict credit risk.
3. Measurement Bias (or Instrument Bias)
 What it is: When the tools or methods used to collect data introduce error.
 Implication: Data collected is systematically inaccurate.
 Example: A faulty sensor recording temperature always 2°C higher than actual.
4. Observer Bias
 What it is: When a person’s expectations or beliefs influence the data recording.
 Implication: Subjective observations skew the dataset.
 Example: A doctor unconsciously recording more symptoms in patients they think are at
high risk.
5. Confirmation Bias
 What it is: Focusing on data that supports a pre-existing belief while ignoring opposing
data.
 Implication: Misleads model development and interpretation.
 Example: Ignoring data that contradicts a hypothesis during model validation.
6. Algorithmic Bias
 What it is: When machine learning algorithms inherit bias from training data or design.
 Implication: Can lead to unfair or discriminatory decisions.
 Example: A facial recognition system performing poorly on darker skin tones due to
imbalanced training data.

***************************************************************************
Algorithmic Bias
Algorithmic bias refers to systematic errors in a computer system that create unfair outcomes,
such as privileging one group over another. It usually stems from:
 Biased or incomplete training data
 Unbalanced feature selection
 Lack of diversity in model testing
 Human assumptions coded into algorithms
Effects of Algorithmic Bias:
1. Unfair Decisions
 People or groups may be favored or disadvantaged unfairly based on race, gender, age,
etc.
2. Reinforcement of Social Inequality
o Existing inequalities in the data are amplified, not corrected.
3. Loss of Trust
o Users lose confidence in systems that consistently show bias.
4. Legal and Ethical Issues
o Organizations may face lawsuits or penalties for discrimination.
5. Poor Model Performance
o Biased algorithms often fail to generalize and perform poorly in real-world scenarios.
Example:
A recruitment algorithm used by a tech company is trained on historical hiring data. Since most
hires in the past were male, the model learns to favor male candidates and reject female applicants
with similar or better qualifications.
🔴 Effect: Qualified women are unfairly rejected, leading to gender discrimination and loss of
talent.
How to Mitigate Algorithmic Bias
1. Use diverse and representative training datasets
2. Regularly audit models for fairness and accuracy
3. Implement bias detection tools (e.g., IBM Fairness 360, Google’s What-If Tool)
4. Promote transparency in model design
5. Include domain experts and ethicists in the development process

Ethics Matter in Data-Driven Decision-Making


Data-driven decision-making is the process of using data insights, analytics, and statistical
models to guide business strategies and actions. It allows organizations to make informed,
objective, and efficient decisions.
1. ✅ Ensures Fairness and Prevents Discrimination
 Data can reflect societal biases (e.g., gender or racial bias).
 Without ethical oversight, algorithms can reinforce inequalities.
 Example: A hiring algorithm may favor male applicants if trained on biased historical
data.
2. ✅ Protects Privacy and Data Rights
 Ethical frameworks guide how personal data is collected, stored, and used.
 Prevents misuse or overreach, ensuring compliance with laws like GDPR.
 Example: A health app must anonymize user data to prevent identity leaks.
3. ✅ Builds Trust and Transparency
 When decisions are explainable and transparent, users are more likely to trust the system.
 Ethics fosters accountability and openness in automated decision-making.
 Example: Financial institutions must explain why a loan application is rejected.
4. ✅ Supports Informed and Responsible Decisions
 Ethical principles guide data scientists to consider social, legal, and moral
consequences.
 Promotes thoughtful use of insights, not just blindly following data patterns.
 Example: A predictive policing model must be evaluated for ethical and societal impacts
before deployment.
5. ✅ Promotes Human-Centric AI
 Keeps the human impact of automated decisions at the core.
 Encourages inclusion, accessibility, and social benefit.
 Example: Ethical guidelines ensure that education platforms don’t disadvantage learners
from low-income backgrounds.
Real-World Examples

Scenario Ethical Concern Impact of Ignoring Ethics

Racial profiling from biased


Predictive policing Reinforces systemic discrimination
datasets

Incorrect predictions due to poor


Healthcare AI Puts patient lives at risk
data

Credit scoring Unfairly denying loans to


Economic exclusion
models minorities

Data misuse, manipulation of Influences elections, promotes fake


Targeted advertising
opinions news

� Ethical Principles to Follow:


 Beneficence (do good)
 Non-maleficence (do no harm)
 Autonomy (respect user rights)
 Justice (ensure fairness)
 Accountability and Transparency

***************************************************************************
Analyzing Techniques to Detect and Mitigate Bias in AI/ML Models
Bias in AI/ML models can lead to unfair decisions and ethical concerns. Detecting and mitigating
this bias is essential for building fair, transparent, and responsible AI systems.
We can classify bias mitigation techniques into three main categories:
🔍 1. Bias Detection Techniques
Before mitigation, it’s crucial to detect bias using metrics such as:
 Statistical Parity (Demographic Parity): Checks if all groups have equal outcomes.
 Equal Opportunity: Ensures true positive rates are the same across groups.
 Disparate Impact: Measures if a protected group receives a negative outcome more
frequently.
 Calibration: Ensures predicted probabilities are equally accurate for all groups.
Example:
If an ML model is used for hiring and it selects 80% of male applicants but only 40% of female
applicants, disparate impact is present.
🛠 2. Bias Mitigation Techniques
Bias mitigation can be applied at different stages of the ML pipeline:
📌 A. Pre-processing Techniques (Before model training)
These methods aim to clean and balance the data to remove bias.

Technique Description Example

Over-sample minority groups or under- Balancing male and female


Re-sampling
sample majority applicant data in training

Assign different weights to different Assign more weight to female


Reweighing
groups to balance distributions applicants in biased hiring data

Data Remove or encode sensitive attributes Remove gender from resume


Transformation like race or gender features before training

✅ Advantage: Prevents the model from learning bias from the start.

⚠Limitation: May reduce information that could be useful for fairness.

⚙� B. In-processing Techniques (During model training)


These methods modify the learning algorithm itself to promote fairness.
Technique Description Example

Fairness Add constraints during optimization Ensure equal opportunity constraint


Constraints to enforce fairness during training

Adversarial Use adversarial networks to remove Train one model to predict outcomes,
Debiasing bias signals from features another to remove gender bias

✅ Advantage: Incorporates fairness directly into model learning.

⚠Limitation: Complex to implement and may affect performance.

C. Post-processing Techniques (After model training)


These techniques adjust the model predictions to reduce unfair outcomes.

Technique Description Example

Threshold Use different decision thresholds for Set lower threshold for female
Adjustment different groups candidates if they’re under-selected

Output Modify predicted labels or Re-label some rejected female


Modification probabilities to balance outcomes applicants as selected

✅ Advantage: Can be applied to any model.

⚠Limitation: May seem unfair or artificial to stakeholders.

Stage Method Goal Example

Pre- Reweighing, Balance male and female data in


Balance input data
processing Resampling hiring dataset

Adversarial Block model from using gender


In-processing Learn fair model
Debiasing during learning

Post- Threshold Balance final Adjust thresholds to equalize loan


processing Adjustment predictions approvals
**************************************************************************
Ethical AI Framework: Components and Design

Comparison of Ethical AI Framework Components

Corporate
Algorithmic Fairness-Aware
Aspect Data Ethics Layer Responsibility
Ethics Layer ML Layer
Layer

Ensuring data quality Detecting and Embedding Promoting


Primary
and mitigating bias in fairness in model accountability and
Focus
representativeness algorithms design ethical governance

Both Sampling
Bias Institutional &
Sampling Bias Algorithmic Bias and Algorithmic
Addressed Societal Bias
Bias

- Pre-processing
(e.g.,
- Diverse data - Bias detection - AI Ethics board
reweighting)
Key sampling tools - Impact assessments
- In-processing
Techniques - Data audits - Explainability - Regulation
(fair loss)
- Documentation - Fairness metrics compliance
- Post-processing
(thresholds)

Ensure
Fairness Evaluate fairness Build models to organizational
Collect fair data
Approach of predictions ensure fairness support and ethical
use

Adjusting
Balancing Monitoring loan Publishing model
models to reduce
Examples gender/race in approvals across cards or fairness
bias during/after
datasets race/gender reports
training

Governance policies,
SHAP, LIME, AI Fairness 360,
Datasheets, Data Ethical audits, Legal
Tools Used Aequitas, Fairness Fairlearn,
Cards compliance
Indicators Themis-ML
frameworks

Directly modifies Builds long-term


Prevents bias at the Improves model
Strength the modeling trust, ensures
source transparency and process to be fair accountability
Corporate
Algorithmic Fairness-Aware
Aspect Data Ethics Layer Responsibility
Ethics Layer ML Layer
Layer

fairness
assessment

Can reduce Slow process; may


May not fix bias
Reactive rather model accuracy be influenced by
Limitation already present in
than proactive if over- organizational
system
constrained priorities

******************************************************************************
Fairness-Aware Algorithms
 Fairness-aware algorithms are machine learning models and techniques specifically
designed to detect, prevent, or reduce bias and unfair treatment of individuals or
groups—especially those based on protected attributes such as gender, race, age, or
disability.
 These algorithms aim to ensure that model decisions are equitable across all
demographic segments, even if the training data contains biases.
🔍 Why Do We Need Fairness-Aware Algorithms?
Traditional machine learning algorithms optimize for accuracy, not fairness. If historical data is
biased (due to societal inequalities), the model will learn and amplify those patterns, resulting
in unfair decisions.
� Types of Fairness-Aware Algorithms
Fairness-aware approaches can be applied at three stages of the ML pipeline:
🔹 1. Pre-processing Algorithms
 Modify the input data to reduce bias before training the model.
 Techniques:
o Reweighing: Assign weights to balance groups
o Data transformation or sampling
🔹 2. In-processing Algorithms
 Modify the learning process to incorporate fairness directly.
 Techniques:
o Fairness constraints in the loss function
o Adversarial debiasing to remove sensitive attribute influence
🔹 3. Post-processing Algorithms
 Modify the output predictions to achieve fairness.
 Techniques:
o Threshold adjustment for different groups
o Equalizing false positive/negative rates
🎯 Why Are Fairness-Aware Algorithms Important in Ethical AI Design?

Reason Explanation

1. Prevents Discrimination Ensures decisions do not unfairly disadvantage any group.

Transparent and fair systems are more likely to gain user and
2. Builds Public Trust
societal trust.

3. Ensures Legal Compliance Aligns with anti-discrimination laws (e.g., GDPR, EEOC, etc.)

4. Promotes Inclusion and


Encourages equitable access to opportunities and resources.
Diversity

5. Enhances Social
Reflects ethical values in algorithm design and deployment.
Responsibility

*****************************************************************************
Corporate Responsibility in Ensuring Ethical Data Practices
 Corporate responsibility plays a critical role in ensuring ethical practices in AI and data
analytics.
 As organizations increasingly rely on data-driven technologies, they are expected to
operate ethically, transparently, and accountably—not only to comply with legal
requirements but to build public trust and long-term sustainability.

🔹 The Role of Corporate Responsibility


Corporate responsibility encompasses an organization’s commitment to:
 Fair and ethical use of data and AI
 Protecting individual rights (privacy, consent, fairness)
 Preventing harm and bias
 Promoting transparency and accountability
When embedded in the core strategy, it influences every step of data handling—from collection
and processing to algorithm design and deployment.

🔹 Influence on Ethical Decision-Making in AI and Data Analytics


1. Ethical Governance Structures
 Companies adopt ethics review boards or AI ethics committees to:
o Evaluate potential risks.
o Oversee AI project approvals.
o Monitor compliance with ethical standards.
✅ Example:
Google’s Advanced Technology External Advisory Council was initially set up to review
ethical challenges in AI, although it faced challenges and was later disbanded, highlighting the
complexity of governance.

2. Bias Mitigation Strategies


 Tools and processes are introduced to detect, audit, and reduce algorithmic bias.
 Techniques include:
o Balanced datasets.
o Fairness-aware machine learning.
o Regular audits of algorithmic decisions.

3. Stakeholder Engagement
 Ethical corporate behavior includes engaging:
o Customers (for informed consent and data rights).
o Employees (for internal training and ethics culture).
o Communities and experts (for diverse perspectives).
✅ Example:
Microsoft engages with external researchers, NGOs, and human rights organizations while
developing AI-based tools, ensuring ethical implications are considered.

4. Compliance with Ethical Frameworks


 Following established codes like:
o ACM Code of Ethics
o IEEE Guidelines
o GDPR (for data protection)
✅ Example:
Salesforce employs a Chief Ethical and Humane Use Officer to ensure AI tools align with
human rights and ethical norms, using frameworks like ACM’s principles (public good,
avoidance of harm, fairness, etc.).

🔹 Real-World Scenario: Amazon’s AI Hiring Tool


 Issue: Amazon’s AI tool for recruitment showed gender bias against women, learned from
historical (biased) hiring data.
 Ethical Lapse: Lack of pre-deployment bias testing and oversight.
 Corporate Responsibility Gap: No ethical review board or bias mitigation processes were
applied before using the tool.
 Outcome: Tool was discontinued after public backlash.
 Lesson: Demonstrates the necessity of proactive ethical governance and responsibility
to avoid reputational damage and harm.

You might also like