[go: up one dir, main page]

0% found this document useful (0 votes)
23 views24 pages

Data Science Interview 2025

The document outlines the differences between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science, highlighting their scopes, key traits, and examples. It also discusses various types of data analysis, including descriptive, diagnostic, predictive, and prescriptive analysis, along with the importance of Exploratory Data Analysis (EDA) in data science projects. Additionally, it covers the lifecycle of a data science project and the distinctions between descriptive and inferential statistics.

Uploaded by

Areesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views24 pages

Data Science Interview 2025

The document outlines the differences between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science, highlighting their scopes, key traits, and examples. It also discusses various types of data analysis, including descriptive, diagnostic, predictive, and prescriptive analysis, along with the importance of Exploratory Data Analysis (EDA) in data science projects. Additionally, it covers the lifecycle of a data science project and the distinctions between descriptive and inferential statistics.

Uploaded by

Areesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

DIFFERENCE BETWEEN ML/DL/DATA SCIENCE/AI

✅ 1. Artificial Intelligence (AI)

AI is the science of creating systems that simulate human intelligence — enabling machines to reason,
learn, adapt, and act autonomously.

 Scope: Broadest field — includes logic, rules-based systems, ML, DL, robotics, etc.

 Key Trait: Decision-making ability like humans.

 Examples: Self-driving cars, fraud detection, AI-powered chatbots.

💡 Think of AI as the goal: making machines behave smartly.

✅ 2. Machine Learning (ML)

ML is a subset of AI that uses algorithms to learn patterns from data and make predictions or
decisions without being explicitly programmed.

 Scope: Statistical models trained on historical data.

 Key Trait: Learns from data, improves over time.

 Examples: Spam filters, churn prediction, recommendation engines.

💡 ML is the engine that powers AI behavior from data.

✅ 3. Deep Learning (DL)

DL is an advanced subset of ML that uses multi-layered neural networks to model complex patterns in
high-dimensional data. In much simpler terms, it replicates just like the human brain as all the neural
networks are connected in the brain

 Scope: Uses artificial neural networks (ANNs), particularly useful in unstructured data (images,
audio, text).

 Key Trait: Learns features automatically — needs big data + compute.

 Examples: Language translation, facial recognition, autonomous driving perception systems.

💡 DL mimics how the human brain works — with multiple layers of abstraction.

✅ 4. Data Science

Data Science is an interdisciplinary field that blends statistics, machine learning, domain expertise,
and programming to extract actionable insights from data.

 Scope: End-to-end data workflow — from collection to modeling and storytelling.


 Key Trait: Business impact from data-driven decisions.

 Examples: Sales forecasting, customer segmentation, A/B testing.

💡 Data Science = Insight + Impact, not just models.

WHERE ML,DEEP LEARING USED?


Criteria Machine Learning Deep Learning

Data Excellent performance on small/medium Excellent performance on large


Dependencies datasets datasets

Requires powerful machines with


Hardware
Works on low-end machines GPUs due to intensive matrix
Dependencies
computations

Feature Requires manual feature selection and Automatically learns relevant


Engineering understanding features from data

Can take days to weeks; neural


Execution Time Ranges from minutes to hours networks compute a large
number of weights

Some models (e.g., logistic regression, decision


Often difficult to interpret;
Interpretability trees) are interpretable; others (e.g., SVM,
considered a black-box approach
XGBoost) are harder to interpret

Types of Data Analysis


Descriptive analysis

Descriptive analysis, as the name suggests, describes or summarizes raw data and makes it
interpretable. It involves analyzing historical data to understand what has happened in the past. This
type of analysis is used to identify patterns and trends over time.

For example, a business might use descriptive analysis to understand the average monthly sales for the
past year.

Diagnostic analysis

Diagnostic analysis goes a step further than descriptive analysis by determining why something
happened. It involves more detailed data exploration and comparing different data sets to understand
the cause of a particular outcome.

For instance, if a company's sales dropped in a particular month, diagnostic analysis could be used to
find out why.

Predictive analysis

Predictive analysis uses statistical models and forecasting techniques to understand the future. It
involves using data from the past to predict what could happen in the future. This type of analysis is
often used in risk assessment, marketing, and sales forecasting.

For example, a company might use predictive analysis to forecast the next quarter's sales based on
historical data.

Prescriptive analysis
Prescriptive analysis is the most advanced type of data analysis. It not only predicts future outcomes
but also suggests actions to benefit from these predictions. It uses sophisticated tools and technologies
like machine learning and artificial intelligence to recommend decisions.

For example, a prescriptive analysis might suggest the best marketing strategies to increase future sales.

How to Explain EDA in an Interview:

“EDA is the process of exploring and understanding the data before applying any modeling. It helps
detect data quality issues, understand distributions, spot patterns or anomalies, and generate
hypotheses. I usually break it into five stages:”

✅ Complete Explanation with Practical & Interview-Ready Detail

🔹 1. Mean (Average)

➤ What it is:

The arithmetic average — add up all values and divide by the number of values.

➤ Why it's important:

Gives a single number that represents the center of the data. It’s used in many statistical methods like
regression.

➤ When to use it:

 When data is symmetrically distributed

 No extreme outliers or heavy tails

 For quick central tendency

➤ Insight from our example:


python
[30,000, 35,000, 38,000, 40,000, 5,000,000]
Mean = 1,028,600

💡 The mean is not representative here — it's misleading because the last value (5M) skews it
drastically.

➤ Interview Line:

“I always compare mean with median. If they differ a lot, the data is probably skewed — I may
need to use median instead.”

🔹 2. Median
➤ What it is:

The middle value in a sorted dataset. If the number of values is even, it’s the average of the two middle
values.

➤ Why it's important:

It’s robust to extreme values. It tells you what a typical individual looks like even when the data is
skewed.

➤ When to use:

 If you suspect outliers or skewed distribution

 When comparing typical behavior of users, prices, transactions, etc.

➤ Insight from example:

python

CopyEdit

Sorted: [30K, 35K, 38K, 40K, 5M]

Median = 38,000

💡 The median gives a much more accurate central income than the mean.

➤ Interview Line:

“In skewed distributions like income or house prices, I prefer median since it’s resistant to extreme
values.”

🔹 3. Standard Deviation (std)

➤ What it is:

It tells how much the data varies from the mean. It’s the square root of variance.

➤ Why it's important:

 Shows consistency or volatility

 Useful in understanding spread in time, price, or behavior

➤ When to use:

 To measure spread or variability

 To identify if data points are close to mean or highly scattered

➤ Insight from example:

Here, std will be very large because 5M is so far from the mean → high variability.
💡 If you’re building a model, such high std means you need to either normalize or handle outliers.

➤ Interview Line:

“High std tells me the data is inconsistent — so I may need to segment it or transform it.”

🔹 4. Min

➤ What it is:

The smallest value in the column.

➤ Why it's important:

 Helps identify errors, e.g., age = -10

 Understands lower bound of real-world values

➤ When to use:

 During data validation or range checking

 To detect possible invalid entries

➤ Insight from example:

python

Min = 30,000 → reasonable in income

But if it was -3000, that would indicate data entry error.

➤ Interview Line:

“I always check min/max to make sure no values fall outside acceptable real-world ranges.”

🔹 5. Max

➤ What it is:

The largest value in the column.

➤ Why it's important:

 Highlights potential outliers

 Reveals maximum bounds, which may require capping or transformation

➤ When to use:

 In fraud detection (e.g., suspiciously large transactions)

 In business planning to understand best-case scenarios


➤ Insight from example:

python

CopyEdit

Max = 5,000,000 → clearly an outlier

➤ Interview Line:

“If max is too far from median, I investigate whether it's an error, outlier, or VIP customer.”

🔹 6. Count

➤ What it is:

The number of non-null entries in a column.

➤ Why it's important:

 Shows data completeness

 Helps determine if a feature can be used as-is or needs imputation

➤ When to use:

 Before modeling or visualizing

 To decide whether to drop or fill missing values

➤ Insight:

If your dataset has 1000 rows, and income count is 850 → you have 150 missing values.

➤ Interview Line:

“I always check count early in EDA to find missing values that may break my model later.”

🔹 7. Skewness

➤ What it is:

Describes asymmetry of the distribution.

Value Interpretation

Skew = 0 Symmetrical

Skew > 0 Right-skewed (long right tail)

Skew < 0 Left-skewed (long left tail)


➤ When to use:

 To decide whether to transform features

 When choosing model assumptions (e.g., regression)

➤ Insight from example:

python

CopyEdit

Skew = high positive → long tail on right

💡 Could signal income disparity. You may want to use log(income) in modeling.

➤ Interview Line:

“I apply log/square-root transformations when I see strong skewness in numeric variables.”

Metric Best Used For What It Tells You What to Watch Out For

Central value (sensitive to


Mean Symmetric data Skewed by large values
outliers)

Median Skewed data Middle value (robust to outliers) Doesn’t reflect spread

Std Spread of data High = inconsistency Inflated by outliers

Validity check or outlier


Min/Max Range and anomalies May not be realistic
detection

Helps with missing value


Count Completeness check Nulls need attention
handling

Skewness Distribution shape Guides transformation decisions High skew = likely outliers

Tailedness (outlier High = unstable model


Kurtosis Shows risk of extreme values
detection) behavior

📚 EDA & Descriptive Statistics Interview Questions

1. General EDA Concepts

What is EDA? What is its purpose?


Exploratory Data Analysis (EDA) is the process of examining and visualizing data to extract insights,
identify patterns, detect anomalies, test assumptions, and prepare for modeling.
The primary goal of EDA is to understand the structure and quality of the data before applying any
machine learning algorithms. It acts as a diagnostic phase where we assess:

 Data distribution (e.g., normal vs skewed)

 Missing values and outliers

 Relationships between variables

 Trends, clusters, and anomalies

📊 Key Techniques Include:

 Descriptive statistics: mean, median, std, skew, kurtosis

 Visualizations: histograms, boxplots, scatter plots, correlation heatmaps

 Bivariate/multivariate analysis

Describe the lifecycle of a data science project.


Phase 1: Problem Definition (Business Understanding)

 Identify a business problem or opportunity that can be addressed through data analysis

 Define project goals, objectives, and key performance indicators (KPIs)

 Develop a clear understanding of the problem domain and stakeholders' needs

 Create a project proposal and obtain stakeholder buy-in

Phase 2: Data Collection and Ingestion

 Identify relevant data sources (internal and external)

 Collect, store, and process data from various sources (e.g., databases, APIs, files)

 Ensure data quality, integrity, and security

 Perform initial data exploration and cleaning

Phase 3: Data Exploration and Analysis (Data Understanding)

 Explore data distributions, relationships, and trends

 Perform statistical analysis and data visualization

 Identify correlations, patterns, and anomalies

 Develop a deeper understanding of the data and its limitations

Phase 4: Data Preprocessing and Feature Engineering

 Clean and preprocess data (e.g., handling missing values, outliers)


 Transform and normalize data (e.g., scaling, encoding)

 Create new features through feature engineering (e.g., dimensionality reduction, feature
extraction)

 Prepare data for modeling

Phase 5: Modeling and Algorithm Development

 Select suitable machine learning algorithms and techniques

 Train and test models using various evaluation metrics (e.g., accuracy, precision, recall)

 Perform hyperparameter tuning and model selection

 Develop a robust and accurate predictive model

Phase 6: Model Evaluation and Validation

 Evaluate model performance on unseen data (e.g., validation set, cross-validation)

 Assess model interpretability and explainability

 Validate model assumptions and limitations

 Compare model performance to baseline models or benchmarks

Phase 7: Deployment and Integration

 Deploy the model in a production-ready environment (e.g., API, container, cloud)

 Integrate the model with existing systems and workflows

 Ensure model scalability, reliability, and maintainability

 Monitor model performance and data drift

Phase 8: Monitoring and Maintenance

 Continuously monitor model performance and data quality

 Update and retrain models as needed (e.g., concept drift, data changes)

 Address model interpretability and explainability concerns

 Refine and improve the model over time

Phase 9: Communication and Storytelling

 Present findings and insights to stakeholders

 Communicate model results and recommendations

 Visualize and summarize complex data insights


 Drive business decisions and actions through data-driven storytelling

Keep in mind that this is a general outline, and the specifics may vary depending on the project's
scope, complexity, and requirements. A data science project lifecycle may be iterative, and some
phases may overlap or repeat. Effective project management, collaboration, and communication
are essential to ensure successful project outcomes.

Explain the difference between descriptive and inferential statistics.


Descriptive Statistics
Descriptive statistics aim to describe and summarize the basic features of a dataset. This type of
statistics provides an overview of the data, including:

1. Measures of Central Tendency: mean, median, mode

2. Measures of Variability: range, variance, standard deviation

3. Data Distribution: histograms, box plots, density plots

Descriptive statistics help you:

 Understand the data's shape and distribution

 Identify patterns and outliers

 Summarize large datasets

Examples of descriptive statistics:

 Calculating the average age of customers

 Creating a histogram to visualize the distribution of exam scores

Inferential Statistics
Inferential statistics aim to make inferences or conclusions about a population based on a
sample of data. This type of statistics helps you:

1. Test Hypotheses: determine if a relationship exists between variables

2. Estimate Population Parameters: make educated estimates about population characteristics

3. Predict Outcomes: forecast future events or trends

Inferential statistics involve:

1. Sampling: collecting a representative sample from a population

2. Statistical Modeling: using statistical techniques to analyze the sample data


3. Inference: drawing conclusions about the population based on the sample data

Examples of inferential statistics:

 Conducting a t-test to compare the average scores of two groups

 Using regression analysis to predict house prices based on features like location and size

Key differences:

1. Purpose: Descriptive statistics describe and summarize data, while inferential statistics make
inferences about a population.

2. Scope: Descriptive statistics focus on the sample data, while inferential statistics aim to
generalize findings to a larger population.

3. Methodology: Descriptive statistics involve simple calculations and visualizations, while


inferential statistics require more complex statistical techniques and modeling.

By understanding the difference between descriptive and inferential statistics, you'll be better
equipped to analyze and interpret data, make informed decisions, and communicate your
findings effectively.

2. Univariate / Bivariate / Multivariate Analysis

Define and provide examples of univariate, bivariate, and multivariate analysis.


🔹 1. Univariate Analysis

Definition:
Analysis of a single variable to understand its distribution, central tendency, and variability.

Usage:

 Check data quality: spot outliers, missing values.

 Feature selection: determine if a variable has useful variance.

 Business reporting: e.g., average customer age, most frequent product sold.

Techniques:

 Numeric: mean, median, std, histogram, boxplot

 Categorical: value_counts(), bar plot

Example:
Understanding the age distribution of customers.
🔹 2. Bivariate Analysis

Definition:
Analysis of two variables to discover relationships or dependencies.

Usage:

 Explore feature-target relationships before modeling

 Detect correlation (e.g., does experience influence salary?)

 Find interactions (e.g., does gender affect churn?)

Techniques:

 Numeric vs Numeric: scatter plot, correlation

 Categorical vs Numeric: boxplot, grouped mean

 Categorical vs Categorical: crosstab, stacked bar chart

Example:
Analyzing how experience impacts salary.

python

CopyEdit

df = pd.DataFrame({'experience': [1, 2, 3, 4, 5], 'salary': [30, 40, 45, 60, 80]})

sns.scatterplot(x='experience', y='salary', data=df)

plt.title("Bivariate Analysis - Experience vs Salary")

plt.show()
🔹 3. Multivariate Analysis

Definition:
Analysis of three or more variables to understand combined effects, patterns, and interactions.

Usage:

 Build predictive models (regression, classification)

 Reduce dimensionality (e.g., using PCA)

 Detect hidden patterns (e.g., clustering, segmentations)

 Multicollinearity checks before modeling

Techniques:

 pairplot, heatmap, multiple regression, PCA, clustering, decision trees


How do you analyze numerical vs categorical variables
Numerical plot

Method Purpose What It Shows

Frequency of values,
Histogram Understand distribution
skewness, modality

Min, Q1, Median, Q3, Max,


Boxplot Detect outliers and spread
outliers

Combine distribution +
Violin Plot Density + IQR by group
spread

Distribution shape (useful


Density Plot (KDE) Smoothed histogram
for comparing)

Changes in numeric value


Line Plot Trend over time (time series)
over time

Relationship between two Correlation, clusters,


Scatter Plot
numeric variables patterns

Heatmap (Correlation Numeric feature Correlation values among


Matrix) relationships features

Categorical plot
Method Purpose What It Shows

Show count or average per


Bar Plot Height = frequency or summary value
category

Count Plot Quick count of categories Good for class imbalance

Relative % (only for small category


Pie Chart (less preferred) Proportion of categories
counts)

Distribution within/between
Stacked Bar Plot Categorical variable comparisons
categories

Boxplot (if mixed with


Numeric stats per category e.g., income by gender
numeric)

3. Basic Descriptive Metrics

What is variance vs. standard deviation?


Concept Definition

The average of the squared differences from the mean. It measures how far each
Variance (σ² or Var)
data point is from the mean on average, but in squared units.

Standard Deviation The square root of the variance. It brings the measure of spread back to the
(σ or std) original units of the data, making it easier to interpret.

I use standard deviation when I want to understand or explain the spread in real-world units — for
example, saying that salaries typically vary by 10,000 PKR is meaningful.

But I use variance when I’m doing internal calculations or modeling — it’s mathematically convenient
because it’s additive and shows up in many algorithms like PCA, clustering, and Gaussian models

Define range, IQR (interquartile range), skewness, and kurtosis — what do they
tell us?
Range, IQR, Skewness, and Kurtosis

Range
Definition The difference between the maximum and minimum value in a dataset.

Formula Range = Max − Min

Use Gives a quick estimate of how spread out the values are.

Limitation Highly sensitive to outliers — doesn't show how values are distributed in between.

🔍 Example:
If income = [30k, 35k, 40k, 200k] → Range = 200k - 30k = 170k

2️⃣ IQR (Interquartile Range)

The range between the 25th percentile (Q1) and 75th percentile (Q3) — the middle 50%
Definition
of the data.

Formula IQR = Q3 − Q1

Measures spread of central values, robust to outliers. Used in boxplots and outlier
Use
detection.

Outliers Rule Common rule: anything below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR is an outlier.

🔍 Example:
If Q1 = 40k and Q3 = 70k → IQR = 30k

Example Meaning

25th percentile (Q1) 25% of the data falls below this value

50th percentile (Median) 50% of the data falls below this value

75th percentile (Q3) 75% of the data falls below this value

Formula (Conceptual)

Percentiles are calculated by:

1. Sorting the data in ascending order

2. Finding the rank using:

Rank=P100×(n+1)\text{Rank} = \frac{P}{100} \times (n + 1)Rank=100P×(n+1)

where P is the percentile and n is the number of data points.


3️⃣ Skewness

Definition Measures the asymmetry of the distribution.

Skew = 0 Data is symmetrical (normal distribution)

Skew > 0 Right-skewed (long tail on the right)

Skew < 0 Left-skewed (long tail on the left)

Use Tells if you should apply transformations (e.g., log) before modeling.

🔍 Example:
Income data is usually right-skewed — most people earn around the average, but a few earn a lot more.

✅ Fix right skew: Apply log, sqrt, or Box-Cox transformation.

 Positive skew: Mean > Median

 Negative skew: Mean < Median

4. Missing Data & Outliers

 How do you detect and handle missing values? (e.g., MCAR, MAR, MNAR, imputation)

Can be fixed by
Type Meaning Example
Imputation?

MCAR (Missing Missingness has A sensor fails


Completely at no pattern — it's to transmit ✅ Yes
Random) random some readings

Missing depends Income


MAR (Missing at on other missing more
✅ Yes
Random) observed often for
variables younger users

❌ Hard to fix
Missing depends People with
MNAR (Missing (requires modeling
on unobserved high income
Not at Random) or domain
data don’t report it
knowledge)

Handling Missing Values (Imputation Strategies)


Strategy When to Use Notes

When few rows/cols are Use only when impact is


Drop rows/columns
missing negligible

Mean/Median MCAR or MAR for numerical Median is more robust to


Imputation values outliers

Fill with most frequent


Mode Imputation For categorical variables
value

When data has patterns


KNN Imputation Uses neighboring points
between features

Regression Predict missing value from More accurate but adds


Imputation other variables complexity

Indicator for Create a new feature: Especially useful for tree-


missing was_missing based models

When missing is MAR and you Multiple Imputation with


Advanced: MICE
want statistical reliability Chained Equations

I first analyze the missingness type — whether it’s MCAR, MAR, or MNAR — using .isnull(), heatmaps,
and checking if missingness correlates with other features.

For MCAR or MAR, I often use mean/median/mode imputation depending on data type. For more
accurate models, I might use KNN imputation or predictive models. If it’s MNAR, I consult domain
experts or use techniques like creating missing indicators.

I always assess the impact of imputation on the distribution and model performance using visualizations
and cross-validation

How do you detect outliers? (e.g., IQR, z-scores, boxplots)


1. IQR (Interquartile Range) Method

📌 Concept:

 Based on the middle 50% of the data

 Values below or above a certain range are flagged as outliers

🧮 Formula:

IQR=Q3−Q1IQR = Q3 - Q1 IQR=Q3−Q1

 Lower bound = Q1 − 1.5 × IQR


 Upper bound = Q3 + 1.5 × IQR

 Any data point outside this range is an outlier

✅ Best for:

 Non-normal, skewed data

 Easy and interpretable

2. Z-Score Method (Standard Deviation Method)

📌 Concept:

 Measures how many standard deviations a point is from the mean

 Z > 3 or Z < -3 → outlier (assuming normal distribution)

🧮 Formula:

Z=(x−μ)σZ = \frac{(x - \mu)}{\sigma}Z=σ(x−μ)

✅ Best for:

 Normally distributed data

 Fast and simple method

3️⃣ Boxplot – Visualization-Based Detection

🔹 Concept:

 Boxplots display:

o Median (Q2)

o Q1 and Q3 (box)

o Whiskers (min/max within 1.5×IQR)

o Outliers (points outside the whiskers)

📊 Use:

 Quickly visualize the spread

 Spot outliers by eye

 Great for comparing distributions by category


What are strategies to treat outliers? (capping, winsorization, removal)
trategies to Treat Outliers

(Capping, Winsorization, Removal & More)

1️⃣ Remove Outliers

🔹 What it is Completely delete rows containing outliers

✅ Best When Outliers are errors, or make up <5% of data

❌ Avoid When Data is small or outliers are important (e.g., fraud)

🧪 Python:

# Using IQR

Q1 = df['value'].quantile(0.25)

Q3 = df['value'].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]

2️⃣ Capping (Truncation)

🔹 What it is Replace outliers with upper or lower thresholds (based on IQR or percentiles)

✅ Best When You want to keep dataset size but reduce extreme influence

❌ Avoid When Business logic demands raw extreme values

🧪 Python (95% cap):

lower_cap = df['value'].quantile(0.05)

upper_cap = df['value'].quantile(0.95)

df['value'] = df['value'].clip(lower_cap, upper_cap)


3️⃣ Winsorization

Like capping, but replaces top/bottom X% with percentile values instead of just
🔹 What it is
thresholds

✅ Best When You want a robust, statistical way to reduce influence

🔧 Tool Use scipy.stats.mstats.winsorize()

🧪 Python:

from scipy.stats.mstats import winsorize

df['value_winsor'] = winsorize(df['value'], limits=[0.05, 0.05]) # Cap bottom/top 5%

4️⃣ Transformation (Log / Sqrt)

🔹 What it is Apply a mathematical transformation to reduce skew

✅ Best When Right-skewed distributions or outliers that need soft adjustment

❌ Avoid When Data contains zeros or negatives (for log)

🧪 Python:

import numpy as np

df['log_value'] = np.log1p(df['value']) # log1p = log(1 + x)

5. Distribution Analysis

How do you check if data is normally distributed?


 What is the empirical rule (68-95-99.7)? Medium

6. Skewness & Kurtosis

 Define skewness and kurtosis, and explain what they imply about data distribution.
GeeksforGeeks+15Analytics Vidhya+15365 Data Science+15

 How can skewness or kurtosis impact your model?

 What transformations help address skewness/kurtosis issues?

7. Correlation & Multicollinearity

 What is the difference between covariance and correlation?

 How do you detect and handle multicollinearity? (e.g., correlation matrix, VIF)
Medium+15AmanXai+15aiquest.org+15Exponent+5Medium+5Analytics Vidhya+5
8. Feature Reduction

 How does PCA (Principal Component Analysis) work for dimensionality reduction?

9. Statistical Testing & Confidence

 Explain hypothesis testing (null/alternative), t-tests, chi-square, ANOVA, p-values, and


confidence intervals.

 What is the Central Limit Theorem and why is it important? Medium

10. Advanced & Miscellaneous

 What is autocorrelation, and how does it differ from correlation?

 Explain sampling distribution vs. probability distribution. 365 Data Science

 What is the difference between one-tailed vs two-tailed hypothesis testing?


ListenData+15Analytics Vidhya+15Analytics Vidhya+15

 Define type I vs type II errors. Wikipedia+3Analytics Vidhya+3Analytics Vidhya+3

You might also like