DIFFERENCE BETWEEN ML/DL/DATA SCIENCE/AI
✅ 1. Artificial Intelligence (AI)
AI is the science of creating systems that simulate human intelligence — enabling machines to reason,
learn, adapt, and act autonomously.
Scope: Broadest field — includes logic, rules-based systems, ML, DL, robotics, etc.
Key Trait: Decision-making ability like humans.
Examples: Self-driving cars, fraud detection, AI-powered chatbots.
💡 Think of AI as the goal: making machines behave smartly.
✅ 2. Machine Learning (ML)
ML is a subset of AI that uses algorithms to learn patterns from data and make predictions or
decisions without being explicitly programmed.
Scope: Statistical models trained on historical data.
Key Trait: Learns from data, improves over time.
Examples: Spam filters, churn prediction, recommendation engines.
💡 ML is the engine that powers AI behavior from data.
✅ 3. Deep Learning (DL)
DL is an advanced subset of ML that uses multi-layered neural networks to model complex patterns in
high-dimensional data. In much simpler terms, it replicates just like the human brain as all the neural
networks are connected in the brain
Scope: Uses artificial neural networks (ANNs), particularly useful in unstructured data (images,
audio, text).
Key Trait: Learns features automatically — needs big data + compute.
Examples: Language translation, facial recognition, autonomous driving perception systems.
💡 DL mimics how the human brain works — with multiple layers of abstraction.
✅ 4. Data Science
Data Science is an interdisciplinary field that blends statistics, machine learning, domain expertise,
and programming to extract actionable insights from data.
Scope: End-to-end data workflow — from collection to modeling and storytelling.
Key Trait: Business impact from data-driven decisions.
Examples: Sales forecasting, customer segmentation, A/B testing.
💡 Data Science = Insight + Impact, not just models.
WHERE ML,DEEP LEARING USED?
Criteria Machine Learning Deep Learning
Data Excellent performance on small/medium Excellent performance on large
Dependencies datasets datasets
Requires powerful machines with
Hardware
Works on low-end machines GPUs due to intensive matrix
Dependencies
computations
Feature Requires manual feature selection and Automatically learns relevant
Engineering understanding features from data
Can take days to weeks; neural
Execution Time Ranges from minutes to hours networks compute a large
number of weights
Some models (e.g., logistic regression, decision
Often difficult to interpret;
Interpretability trees) are interpretable; others (e.g., SVM,
considered a black-box approach
XGBoost) are harder to interpret
Types of Data Analysis
Descriptive analysis
Descriptive analysis, as the name suggests, describes or summarizes raw data and makes it
interpretable. It involves analyzing historical data to understand what has happened in the past. This
type of analysis is used to identify patterns and trends over time.
For example, a business might use descriptive analysis to understand the average monthly sales for the
past year.
Diagnostic analysis
Diagnostic analysis goes a step further than descriptive analysis by determining why something
happened. It involves more detailed data exploration and comparing different data sets to understand
the cause of a particular outcome.
For instance, if a company's sales dropped in a particular month, diagnostic analysis could be used to
find out why.
Predictive analysis
Predictive analysis uses statistical models and forecasting techniques to understand the future. It
involves using data from the past to predict what could happen in the future. This type of analysis is
often used in risk assessment, marketing, and sales forecasting.
For example, a company might use predictive analysis to forecast the next quarter's sales based on
historical data.
Prescriptive analysis
Prescriptive analysis is the most advanced type of data analysis. It not only predicts future outcomes
but also suggests actions to benefit from these predictions. It uses sophisticated tools and technologies
like machine learning and artificial intelligence to recommend decisions.
For example, a prescriptive analysis might suggest the best marketing strategies to increase future sales.
How to Explain EDA in an Interview:
“EDA is the process of exploring and understanding the data before applying any modeling. It helps
detect data quality issues, understand distributions, spot patterns or anomalies, and generate
hypotheses. I usually break it into five stages:”
✅ Complete Explanation with Practical & Interview-Ready Detail
🔹 1. Mean (Average)
➤ What it is:
The arithmetic average — add up all values and divide by the number of values.
➤ Why it's important:
Gives a single number that represents the center of the data. It’s used in many statistical methods like
regression.
➤ When to use it:
When data is symmetrically distributed
No extreme outliers or heavy tails
For quick central tendency
➤ Insight from our example:
python
[30,000, 35,000, 38,000, 40,000, 5,000,000]
Mean = 1,028,600
💡 The mean is not representative here — it's misleading because the last value (5M) skews it
drastically.
➤ Interview Line:
“I always compare mean with median. If they differ a lot, the data is probably skewed — I may
need to use median instead.”
🔹 2. Median
➤ What it is:
The middle value in a sorted dataset. If the number of values is even, it’s the average of the two middle
values.
➤ Why it's important:
It’s robust to extreme values. It tells you what a typical individual looks like even when the data is
skewed.
➤ When to use:
If you suspect outliers or skewed distribution
When comparing typical behavior of users, prices, transactions, etc.
➤ Insight from example:
python
CopyEdit
Sorted: [30K, 35K, 38K, 40K, 5M]
Median = 38,000
💡 The median gives a much more accurate central income than the mean.
➤ Interview Line:
“In skewed distributions like income or house prices, I prefer median since it’s resistant to extreme
values.”
🔹 3. Standard Deviation (std)
➤ What it is:
It tells how much the data varies from the mean. It’s the square root of variance.
➤ Why it's important:
Shows consistency or volatility
Useful in understanding spread in time, price, or behavior
➤ When to use:
To measure spread or variability
To identify if data points are close to mean or highly scattered
➤ Insight from example:
Here, std will be very large because 5M is so far from the mean → high variability.
💡 If you’re building a model, such high std means you need to either normalize or handle outliers.
➤ Interview Line:
“High std tells me the data is inconsistent — so I may need to segment it or transform it.”
🔹 4. Min
➤ What it is:
The smallest value in the column.
➤ Why it's important:
Helps identify errors, e.g., age = -10
Understands lower bound of real-world values
➤ When to use:
During data validation or range checking
To detect possible invalid entries
➤ Insight from example:
python
Min = 30,000 → reasonable in income
But if it was -3000, that would indicate data entry error.
➤ Interview Line:
“I always check min/max to make sure no values fall outside acceptable real-world ranges.”
🔹 5. Max
➤ What it is:
The largest value in the column.
➤ Why it's important:
Highlights potential outliers
Reveals maximum bounds, which may require capping or transformation
➤ When to use:
In fraud detection (e.g., suspiciously large transactions)
In business planning to understand best-case scenarios
➤ Insight from example:
python
CopyEdit
Max = 5,000,000 → clearly an outlier
➤ Interview Line:
“If max is too far from median, I investigate whether it's an error, outlier, or VIP customer.”
🔹 6. Count
➤ What it is:
The number of non-null entries in a column.
➤ Why it's important:
Shows data completeness
Helps determine if a feature can be used as-is or needs imputation
➤ When to use:
Before modeling or visualizing
To decide whether to drop or fill missing values
➤ Insight:
If your dataset has 1000 rows, and income count is 850 → you have 150 missing values.
➤ Interview Line:
“I always check count early in EDA to find missing values that may break my model later.”
🔹 7. Skewness
➤ What it is:
Describes asymmetry of the distribution.
Value Interpretation
Skew = 0 Symmetrical
Skew > 0 Right-skewed (long right tail)
Skew < 0 Left-skewed (long left tail)
➤ When to use:
To decide whether to transform features
When choosing model assumptions (e.g., regression)
➤ Insight from example:
python
CopyEdit
Skew = high positive → long tail on right
💡 Could signal income disparity. You may want to use log(income) in modeling.
➤ Interview Line:
“I apply log/square-root transformations when I see strong skewness in numeric variables.”
Metric Best Used For What It Tells You What to Watch Out For
Central value (sensitive to
Mean Symmetric data Skewed by large values
outliers)
Median Skewed data Middle value (robust to outliers) Doesn’t reflect spread
Std Spread of data High = inconsistency Inflated by outliers
Validity check or outlier
Min/Max Range and anomalies May not be realistic
detection
Helps with missing value
Count Completeness check Nulls need attention
handling
Skewness Distribution shape Guides transformation decisions High skew = likely outliers
Tailedness (outlier High = unstable model
Kurtosis Shows risk of extreme values
detection) behavior
📚 EDA & Descriptive Statistics Interview Questions
1. General EDA Concepts
What is EDA? What is its purpose?
Exploratory Data Analysis (EDA) is the process of examining and visualizing data to extract insights,
identify patterns, detect anomalies, test assumptions, and prepare for modeling.
The primary goal of EDA is to understand the structure and quality of the data before applying any
machine learning algorithms. It acts as a diagnostic phase where we assess:
Data distribution (e.g., normal vs skewed)
Missing values and outliers
Relationships between variables
Trends, clusters, and anomalies
📊 Key Techniques Include:
Descriptive statistics: mean, median, std, skew, kurtosis
Visualizations: histograms, boxplots, scatter plots, correlation heatmaps
Bivariate/multivariate analysis
Describe the lifecycle of a data science project.
Phase 1: Problem Definition (Business Understanding)
Identify a business problem or opportunity that can be addressed through data analysis
Define project goals, objectives, and key performance indicators (KPIs)
Develop a clear understanding of the problem domain and stakeholders' needs
Create a project proposal and obtain stakeholder buy-in
Phase 2: Data Collection and Ingestion
Identify relevant data sources (internal and external)
Collect, store, and process data from various sources (e.g., databases, APIs, files)
Ensure data quality, integrity, and security
Perform initial data exploration and cleaning
Phase 3: Data Exploration and Analysis (Data Understanding)
Explore data distributions, relationships, and trends
Perform statistical analysis and data visualization
Identify correlations, patterns, and anomalies
Develop a deeper understanding of the data and its limitations
Phase 4: Data Preprocessing and Feature Engineering
Clean and preprocess data (e.g., handling missing values, outliers)
Transform and normalize data (e.g., scaling, encoding)
Create new features through feature engineering (e.g., dimensionality reduction, feature
extraction)
Prepare data for modeling
Phase 5: Modeling and Algorithm Development
Select suitable machine learning algorithms and techniques
Train and test models using various evaluation metrics (e.g., accuracy, precision, recall)
Perform hyperparameter tuning and model selection
Develop a robust and accurate predictive model
Phase 6: Model Evaluation and Validation
Evaluate model performance on unseen data (e.g., validation set, cross-validation)
Assess model interpretability and explainability
Validate model assumptions and limitations
Compare model performance to baseline models or benchmarks
Phase 7: Deployment and Integration
Deploy the model in a production-ready environment (e.g., API, container, cloud)
Integrate the model with existing systems and workflows
Ensure model scalability, reliability, and maintainability
Monitor model performance and data drift
Phase 8: Monitoring and Maintenance
Continuously monitor model performance and data quality
Update and retrain models as needed (e.g., concept drift, data changes)
Address model interpretability and explainability concerns
Refine and improve the model over time
Phase 9: Communication and Storytelling
Present findings and insights to stakeholders
Communicate model results and recommendations
Visualize and summarize complex data insights
Drive business decisions and actions through data-driven storytelling
Keep in mind that this is a general outline, and the specifics may vary depending on the project's
scope, complexity, and requirements. A data science project lifecycle may be iterative, and some
phases may overlap or repeat. Effective project management, collaboration, and communication
are essential to ensure successful project outcomes.
Explain the difference between descriptive and inferential statistics.
Descriptive Statistics
Descriptive statistics aim to describe and summarize the basic features of a dataset. This type of
statistics provides an overview of the data, including:
1. Measures of Central Tendency: mean, median, mode
2. Measures of Variability: range, variance, standard deviation
3. Data Distribution: histograms, box plots, density plots
Descriptive statistics help you:
Understand the data's shape and distribution
Identify patterns and outliers
Summarize large datasets
Examples of descriptive statistics:
Calculating the average age of customers
Creating a histogram to visualize the distribution of exam scores
Inferential Statistics
Inferential statistics aim to make inferences or conclusions about a population based on a
sample of data. This type of statistics helps you:
1. Test Hypotheses: determine if a relationship exists between variables
2. Estimate Population Parameters: make educated estimates about population characteristics
3. Predict Outcomes: forecast future events or trends
Inferential statistics involve:
1. Sampling: collecting a representative sample from a population
2. Statistical Modeling: using statistical techniques to analyze the sample data
3. Inference: drawing conclusions about the population based on the sample data
Examples of inferential statistics:
Conducting a t-test to compare the average scores of two groups
Using regression analysis to predict house prices based on features like location and size
Key differences:
1. Purpose: Descriptive statistics describe and summarize data, while inferential statistics make
inferences about a population.
2. Scope: Descriptive statistics focus on the sample data, while inferential statistics aim to
generalize findings to a larger population.
3. Methodology: Descriptive statistics involve simple calculations and visualizations, while
inferential statistics require more complex statistical techniques and modeling.
By understanding the difference between descriptive and inferential statistics, you'll be better
equipped to analyze and interpret data, make informed decisions, and communicate your
findings effectively.
2. Univariate / Bivariate / Multivariate Analysis
Define and provide examples of univariate, bivariate, and multivariate analysis.
🔹 1. Univariate Analysis
Definition:
Analysis of a single variable to understand its distribution, central tendency, and variability.
Usage:
Check data quality: spot outliers, missing values.
Feature selection: determine if a variable has useful variance.
Business reporting: e.g., average customer age, most frequent product sold.
Techniques:
Numeric: mean, median, std, histogram, boxplot
Categorical: value_counts(), bar plot
Example:
Understanding the age distribution of customers.
🔹 2. Bivariate Analysis
Definition:
Analysis of two variables to discover relationships or dependencies.
Usage:
Explore feature-target relationships before modeling
Detect correlation (e.g., does experience influence salary?)
Find interactions (e.g., does gender affect churn?)
Techniques:
Numeric vs Numeric: scatter plot, correlation
Categorical vs Numeric: boxplot, grouped mean
Categorical vs Categorical: crosstab, stacked bar chart
Example:
Analyzing how experience impacts salary.
python
CopyEdit
df = pd.DataFrame({'experience': [1, 2, 3, 4, 5], 'salary': [30, 40, 45, 60, 80]})
sns.scatterplot(x='experience', y='salary', data=df)
plt.title("Bivariate Analysis - Experience vs Salary")
plt.show()
🔹 3. Multivariate Analysis
Definition:
Analysis of three or more variables to understand combined effects, patterns, and interactions.
Usage:
Build predictive models (regression, classification)
Reduce dimensionality (e.g., using PCA)
Detect hidden patterns (e.g., clustering, segmentations)
Multicollinearity checks before modeling
Techniques:
pairplot, heatmap, multiple regression, PCA, clustering, decision trees
How do you analyze numerical vs categorical variables
Numerical plot
Method Purpose What It Shows
Frequency of values,
Histogram Understand distribution
skewness, modality
Min, Q1, Median, Q3, Max,
Boxplot Detect outliers and spread
outliers
Combine distribution +
Violin Plot Density + IQR by group
spread
Distribution shape (useful
Density Plot (KDE) Smoothed histogram
for comparing)
Changes in numeric value
Line Plot Trend over time (time series)
over time
Relationship between two Correlation, clusters,
Scatter Plot
numeric variables patterns
Heatmap (Correlation Numeric feature Correlation values among
Matrix) relationships features
Categorical plot
Method Purpose What It Shows
Show count or average per
Bar Plot Height = frequency or summary value
category
Count Plot Quick count of categories Good for class imbalance
Relative % (only for small category
Pie Chart (less preferred) Proportion of categories
counts)
Distribution within/between
Stacked Bar Plot Categorical variable comparisons
categories
Boxplot (if mixed with
Numeric stats per category e.g., income by gender
numeric)
3. Basic Descriptive Metrics
What is variance vs. standard deviation?
Concept Definition
The average of the squared differences from the mean. It measures how far each
Variance (σ² or Var)
data point is from the mean on average, but in squared units.
Standard Deviation The square root of the variance. It brings the measure of spread back to the
(σ or std) original units of the data, making it easier to interpret.
I use standard deviation when I want to understand or explain the spread in real-world units — for
example, saying that salaries typically vary by 10,000 PKR is meaningful.
But I use variance when I’m doing internal calculations or modeling — it’s mathematically convenient
because it’s additive and shows up in many algorithms like PCA, clustering, and Gaussian models
Define range, IQR (interquartile range), skewness, and kurtosis — what do they
tell us?
Range, IQR, Skewness, and Kurtosis
Range
Definition The difference between the maximum and minimum value in a dataset.
Formula Range = Max − Min
Use Gives a quick estimate of how spread out the values are.
Limitation Highly sensitive to outliers — doesn't show how values are distributed in between.
🔍 Example:
If income = [30k, 35k, 40k, 200k] → Range = 200k - 30k = 170k
2️⃣ IQR (Interquartile Range)
The range between the 25th percentile (Q1) and 75th percentile (Q3) — the middle 50%
Definition
of the data.
Formula IQR = Q3 − Q1
Measures spread of central values, robust to outliers. Used in boxplots and outlier
Use
detection.
Outliers Rule Common rule: anything below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is an outlier.
🔍 Example:
If Q1 = 40k and Q3 = 70k → IQR = 30k
Example Meaning
25th percentile (Q1) 25% of the data falls below this value
50th percentile (Median) 50% of the data falls below this value
75th percentile (Q3) 75% of the data falls below this value
Formula (Conceptual)
Percentiles are calculated by:
1. Sorting the data in ascending order
2. Finding the rank using:
Rank=P100×(n+1)\text{Rank} = \frac{P}{100} \times (n + 1)Rank=100P×(n+1)
where P is the percentile and n is the number of data points.
3️⃣ Skewness
Definition Measures the asymmetry of the distribution.
Skew = 0 Data is symmetrical (normal distribution)
Skew > 0 Right-skewed (long tail on the right)
Skew < 0 Left-skewed (long tail on the left)
Use Tells if you should apply transformations (e.g., log) before modeling.
🔍 Example:
Income data is usually right-skewed — most people earn around the average, but a few earn a lot more.
✅ Fix right skew: Apply log, sqrt, or Box-Cox transformation.
Positive skew: Mean > Median
Negative skew: Mean < Median
4. Missing Data & Outliers
How do you detect and handle missing values? (e.g., MCAR, MAR, MNAR, imputation)
Can be fixed by
Type Meaning Example
Imputation?
MCAR (Missing Missingness has A sensor fails
Completely at no pattern — it's to transmit ✅ Yes
Random) random some readings
Missing depends Income
MAR (Missing at on other missing more
✅ Yes
Random) observed often for
variables younger users
❌ Hard to fix
Missing depends People with
MNAR (Missing (requires modeling
on unobserved high income
Not at Random) or domain
data don’t report it
knowledge)
Handling Missing Values (Imputation Strategies)
Strategy When to Use Notes
When few rows/cols are Use only when impact is
Drop rows/columns
missing negligible
Mean/Median MCAR or MAR for numerical Median is more robust to
Imputation values outliers
Fill with most frequent
Mode Imputation For categorical variables
value
When data has patterns
KNN Imputation Uses neighboring points
between features
Regression Predict missing value from More accurate but adds
Imputation other variables complexity
Indicator for Create a new feature: Especially useful for tree-
missing was_missing based models
When missing is MAR and you Multiple Imputation with
Advanced: MICE
want statistical reliability Chained Equations
I first analyze the missingness type — whether it’s MCAR, MAR, or MNAR — using .isnull(), heatmaps,
and checking if missingness correlates with other features.
For MCAR or MAR, I often use mean/median/mode imputation depending on data type. For more
accurate models, I might use KNN imputation or predictive models. If it’s MNAR, I consult domain
experts or use techniques like creating missing indicators.
I always assess the impact of imputation on the distribution and model performance using visualizations
and cross-validation
How do you detect outliers? (e.g., IQR, z-scores, boxplots)
1. IQR (Interquartile Range) Method
📌 Concept:
Based on the middle 50% of the data
Values below or above a certain range are flagged as outliers
🧮 Formula:
IQR=Q3−Q1IQR = Q3 - Q1 IQR=Q3−Q1
Lower bound = Q1 − 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR
Any data point outside this range is an outlier
✅ Best for:
Non-normal, skewed data
Easy and interpretable
2. Z-Score Method (Standard Deviation Method)
📌 Concept:
Measures how many standard deviations a point is from the mean
Z > 3 or Z < -3 → outlier (assuming normal distribution)
🧮 Formula:
Z=(x−μ)σZ = \frac{(x - \mu)}{\sigma}Z=σ(x−μ)
✅ Best for:
Normally distributed data
Fast and simple method
3️⃣ Boxplot – Visualization-Based Detection
🔹 Concept:
Boxplots display:
o Median (Q2)
o Q1 and Q3 (box)
o Whiskers (min/max within 1.5×IQR)
o Outliers (points outside the whiskers)
📊 Use:
Quickly visualize the spread
Spot outliers by eye
Great for comparing distributions by category
What are strategies to treat outliers? (capping, winsorization, removal)
trategies to Treat Outliers
(Capping, Winsorization, Removal & More)
1️⃣ Remove Outliers
🔹 What it is Completely delete rows containing outliers
✅ Best When Outliers are errors, or make up <5% of data
❌ Avoid When Data is small or outliers are important (e.g., fraud)
🧪 Python:
# Using IQR
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]
2️⃣ Capping (Truncation)
🔹 What it is Replace outliers with upper or lower thresholds (based on IQR or percentiles)
✅ Best When You want to keep dataset size but reduce extreme influence
❌ Avoid When Business logic demands raw extreme values
🧪 Python (95% cap):
lower_cap = df['value'].quantile(0.05)
upper_cap = df['value'].quantile(0.95)
df['value'] = df['value'].clip(lower_cap, upper_cap)
3️⃣ Winsorization
Like capping, but replaces top/bottom X% with percentile values instead of just
🔹 What it is
thresholds
✅ Best When You want a robust, statistical way to reduce influence
🔧 Tool Use scipy.stats.mstats.winsorize()
🧪 Python:
from scipy.stats.mstats import winsorize
df['value_winsor'] = winsorize(df['value'], limits=[0.05, 0.05]) # Cap bottom/top 5%
4️⃣ Transformation (Log / Sqrt)
🔹 What it is Apply a mathematical transformation to reduce skew
✅ Best When Right-skewed distributions or outliers that need soft adjustment
❌ Avoid When Data contains zeros or negatives (for log)
🧪 Python:
import numpy as np
df['log_value'] = np.log1p(df['value']) # log1p = log(1 + x)
5. Distribution Analysis
How do you check if data is normally distributed?
What is the empirical rule (68-95-99.7)? Medium
6. Skewness & Kurtosis
Define skewness and kurtosis, and explain what they imply about data distribution.
GeeksforGeeks+15Analytics Vidhya+15365 Data Science+15
How can skewness or kurtosis impact your model?
What transformations help address skewness/kurtosis issues?
7. Correlation & Multicollinearity
What is the difference between covariance and correlation?
How do you detect and handle multicollinearity? (e.g., correlation matrix, VIF)
Medium+15AmanXai+15aiquest.org+15Exponent+5Medium+5Analytics Vidhya+5
8. Feature Reduction
How does PCA (Principal Component Analysis) work for dimensionality reduction?
9. Statistical Testing & Confidence
Explain hypothesis testing (null/alternative), t-tests, chi-square, ANOVA, p-values, and
confidence intervals.
What is the Central Limit Theorem and why is it important? Medium
10. Advanced & Miscellaneous
What is autocorrelation, and how does it differ from correlation?
Explain sampling distribution vs. probability distribution. 365 Data Science
What is the difference between one-tailed vs two-tailed hypothesis testing?
ListenData+15Analytics Vidhya+15Analytics Vidhya+15
Define type I vs type II errors. Wikipedia+3Analytics Vidhya+3Analytics Vidhya+3