Top 100 Data Analyst Interview Questions
1. What is the difference between INNER JOIN and LEFT JOIN in SQL?
INNER JOIN returns only matching rows; LEFT JOIN returns all rows from the left table and matches from the right,
with NULLs if no match.
2. How would you find duplicate records in a SQL table?
SELECT column, COUNT(*)
FROM table
GROUP BY column
HAVING COUNT(*) > 1;
3. Explain normalization in SQL and why it’s important.
Normalization organizes data to reduce redundancy. Common forms:
● 1NF: Atomic columns
● 2NF: Remove partial dependencies
● 3NF: Remove transitive dependencies
It improves integrity and efficiency.
4. What’s the difference between WHERE and HAVING in SQL?
● WHERE: filters rows before grouping
● HAVING: filters groups after aggregation
5. Write a query to get the second highest salary from a table.
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
6. What is a window function in SQL?
Performs calculations across rows related to the current row (e.g., ROW_NUMBER(), RANK(), LAG(), SUM()
OVER())
7. What is a CTE in SQL and when do you use it?
Common Table Expression (WITH clause) improves readability and simplifies nested queries.
8. How do you use VLOOKUP in Excel?
VLOOKUP is a function in Excel that searches for a value in the first column of a range and returns a value in
the same row from a specified column.
=VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])
Used to find corresponding data from another column.
9. What is a Pivot Table used for in Excel?
Summarizes large datasets dynamically using drag-and-drop rows, columns, values, filters.
10. How do you remove duplicates from a column in Excel?
Select column → Data tab → Remove Duplicates
11. What is the difference between COUNT, COUNT(*), and COUNT(column) in SQL?
● COUNT(*): total rows
● COUNT(column): non-NULL values
● COUNT(DISTINCT column): unique non-NULL values
12. How do you calculate the median in SQL?
Use PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column) in databases that support it.
13. In pandas, how do you drop rows with missing values?
df.dropna()
14. What is the use of .groupby() in pandas?
Groups data based on one or more columns to apply aggregation like sum, count, mean, etc.
15. How do you merge two datasets in pandas?
pd.merge(df1, df2, on='key', how='inner')
16. What is the difference between supervised and unsupervised learning?
● Supervised: labeled data (e.g., regression, classification)
● Unsupervised: no labels (e.g., clustering, PCA)
17. What is linear regression used for?
Predicts a continuous value by modeling the relationship between dependent and independent variables.
18. What is overfitting in machine learning?
When a model learns the training data too well, including noise, leading to poor generalization on unseen
data.
19. How would you handle missing values in a dataset?
Options:
● Drop rows (dropna())
● Fill with mean/median/mode (fillna())
● Use prediction models or interpolation
20. How do you visualize distributions in Python?
Using seaborn.histplot() or sns.boxplot() for outliers and spread.
21. What is the difference between correlation and covariance?
● Covariance: direction of relationship
● Correlation: direction + strength (scaled -1 to 1)
22. How would you detect outliers in a dataset?
● Boxplot (IQR method)
● Z-score method
● Visualization (scatter/histogram)
23. What does the COUNTIFS function do in Excel?
Counts rows that meet multiple criteria.
=COUNTIFS(range1, criteria1, range2, criteria2, ...)
24. What is data wrangling?
Transforming and cleaning raw data into a usable format: handling NA, renaming, reshaping, type
conversion.
25. What is the purpose of exploratory data analysis (EDA)?
To understand patterns, detect anomalies, check assumptions, and build intuition before modeling.
26. What is a JOIN in SQL?
A JOIN combines rows from two or more tables based on a related column. Common types include:
● INNER JOIN
● LEFT JOIN
● RIGHT JOIN
● FULL OUTER JOIN
27. What is a subquery in SQL?
A subquery is a query within another query, used to return a result for the main query, often in WHERE or
FROM clauses.
28. How would you find the unique values in a SQL column?
SELECT DISTINCT column_name FROM table;
29. What are aggregate functions in SQL?
Functions that summarize data, such as:
● COUNT()
● SUM()
● AVG()
● MIN()
● MAX()
30. Explain what a Normal Distribution is in statistics.
A symmetrical, bell-shaped probability distribution where most observations cluster around the mean.
31. What is the purpose of GROUP BY in SQL?
Groups rows sharing the same value into summary rows, like SUM(), COUNT(), AVG(), etc.
32. How do you perform data cleaning in pandas?
Steps might include:
● Handling missing values (fillna(), dropna())
● Removing duplicates (drop_duplicates())
● Correcting data types (astype())
33. What are primary and foreign keys in SQL?
● Primary Key: Unique identifier for a record in a table
● Foreign Key: Links to a primary key in another table
34. How do you calculate the mode in SQL?
SELECT column, COUNT(*) AS freq
FROM table
GROUP BY column
ORDER BY freq DESC
LIMIT 1;
35. What is a UNION in SQL?
Combines results from two or more queries and eliminates duplicates.
SELECT column FROM table1
UNION
SELECT column FROM table2;
36. How do you use RANK() in SQL?
The RANK() function assigns a unique rank to each row in a partition of the result set.
SELECT column1, RANK() OVER (ORDER BY column2 DESC) AS rank
FROM table;
37. Explain the difference between a clustered and non-clustered index in SQL.
● Clustered Index: Determines the physical order of data in the table. Only one per table.
● Non-Clustered Index: A separate structure pointing to the data, multiple allowed.
38. How do you calculate cumulative sum in SQL?
SELECT column1,
SUM(column2) OVER (ORDER BY column1) AS cumulative_sum
FROM table;
39. What is the difference between a procedure and a function in SQL?
● Procedure: Executes a set of SQL statements, can modify data, no return value.
● Function: Returns a single value, used in queries.
40. How would you handle an imbalanced dataset in machine learning?
Techniques include:
● Resampling (over-sampling or under-sampling)
● Synthetic data generation (SMOTE)
● Adjusting class weights in models
41. What is a decision tree algorithm in machine learning?
A tree-like structure used for classification and regression tasks. Splits data into subsets based on feature
values.
42. How do you handle categorical variables in machine learning?
● Label Encoding
● One-Hot Encoding
● Frequency Encoding
43. What is cross-validation in machine learning?
Cross-validation splits the dataset into multiple folds to train and test the model on different subsets,
helping prevent overfitting.
44. What is precision and recall in classification problems?
● Precision: Measures the accuracy of positive predictions.
● Recall: Measures the ability to find all positive instances.
45. Explain the difference between linear regression and logistic regression.
● Linear Regression: Predicts continuous values.
● Logistic Regression: Predicts probabilities of categorical outcomes (usually binary).
46. How do you detect multicollinearity in a dataset?
Using:
● Correlation matrix
● Variance Inflation Factor (VIF)
47. What is the use of .apply() in pandas?
It allows you to apply a function along a specific axis (rows/columns) of a DataFrame.
48. How would you handle missing values in Excel?
● Manual fill: Replace with mean, median, or custom values.
● Data tools: Use Find & Replace or Go To Special for handling blanks.
49. How do you visualize a correlation matrix in Python?
Using seaborn.heatmap()
sns.heatmap(df.corr(), annot=True)
50. What is a confusion matrix in machine learning?
A table used to evaluate classification algorithms, showing the actual vs predicted classifications:
● True Positive (TP)
● False Positive (FP)
● True Negative (TN)
● False Negative (FN)
51. What is the purpose of using a "WHERE" clause in SQL?
The WHERE clause is used to filter records that meet specific conditions. It is applied before any aggregation
or grouping.
52. What is normalization in the context of data cleaning?
Normalization adjusts the scale of numeric data to a standard range (e.g., 0 to 1), which is important for
many machine learning models.
53. What is a histogram and how do you interpret it?
A histogram shows the frequency distribution of a continuous variable. It helps identify patterns like
skewness, spread, or outliers.
54. How would you identify trends in a dataset?
By:
● Plotting time series data (using line charts)
● Using moving averages
● Checking correlations over time
55. How do you use "GROUP BY" with an aggregate function in SQL?
You use GROUP BY to aggregate data based on a specified column. Example
SELECT department, AVG(salary)
FROM employees
GROUP BY department;
56. What is a pivot chart and when would you use it?
A Pivot Chart is a graphical representation of data summarized in a Pivot Table. It’s useful for visualizing
trends and patterns in grouped data.
57. How do you perform outlier detection in Excel?
● Use Conditional Formatting to highlight data points above or below a certain threshold.
● Use Z-Score or IQR method in formulas to find outliers.
58. What is the difference between a LEFT JOIN and an OUTER JOIN in SQL?
● LEFT JOIN: Returns all records from the left table, matched with records from the right table, filling
unmatched rows with NULL.
● OUTER JOIN: A general term for both LEFT and RIGHT joins.
59. Explain the difference between covariance and correlation.
● Covariance: Measures the degree to which two variables change together.
● Correlation: A standardized version of covariance, scaled between -1 and 1.
60. What is a "case statement" in SQL?
A CASE statement allows conditional logic within SQL queries to return different values based on conditions.
Example:
SELECT name,
CASE
WHEN age >= 18 THEN 'Adult'
ELSE 'Minor'
END AS status
FROM users;
61. What is the role of a "foreign key" in SQL?
A foreign key is a column in one table that uniquely identifies a row in another table, establishing a
relationship between them.
62. How do you calculate the percentile in SQL?
In SQL, you can use PERCENTILE_CONT() for calculating percentiles.
Example:
SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY column) AS percentile_90
FROM table;
63. Explain the concept of data wrangling.
Data wrangling is the process of cleaning, structuring, and transforming raw data into a usable format for
analysis, often involving missing data handling, type conversion, and reshaping.
64. What are some common aggregation functions used in SQL?
● SUM()
● AVG()
● COUNT()
● MAX()
● MIN()
65. What is the difference between "UNION" and "UNION ALL" in SQL?
● UNION: Combines results and removes duplicates.
● UNION ALL: Combines results and includes duplicates.
66. What is the use of "HAVING" in SQL?
HAVING is used to filter data after it’s grouped, often with aggregate functions like SUM(), COUNT(), or
AVG().
67. How do you perform linear regression in Python?
Using the LinearRegression class from sklearn:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
68. How do you handle categorical data in machine learning?
● Label Encoding: Converts categories into numbers.
● One-Hot Encoding: Creates binary columns for each category.
69. What is feature scaling and why is it important?
Feature scaling standardizes the range of features (using methods like MinMax Scaling or Standardization).
It’s important for algorithms like K-means or SVM that rely on distance.
70. Explain the purpose of using "RANK()" and "DENSE_RANK()" in SQL.
● RANK(): Assigns a rank but leaves gaps in ranking when there are ties.
● DENSE_RANK(): Assigns a rank without gaps, even when there are ties.
71. What is the difference between supervised and unsupervised learning?
● Supervised Learning: Uses labeled data for training. Common algorithms: Decision Trees, Linear
Regression.
● Unsupervised Learning: No labeled data, used to identify patterns or groupings. Common
algorithms: K-means, PCA.
72. How do you calculate the Z-score of a dataset?
Z-score is calculated as:
Z=(X−mean)/standarddeviationZ = (X - mean) / standard deviation Z=(X−mean)/standarddeviation
It shows how far a data point is from the mean in terms of standard deviations.
73. What is the difference between classification and regression?
● Classification: Predicts discrete labels (e.g., spam vs. not spam).
● Regression: Predicts continuous values (e.g., house prices).
74. How do you create a scatter plot in Python?
Using matplotlib:
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.show()
75. What is the purpose of "Logistic Regression"?
Used for binary classification tasks where the output is either 0 or 1, it predicts probabilities based on a
logistic function.
76. What is the difference between bagging and boosting?
● Bagging: Reduces variance by training multiple models independently and averaging predictions
(e.g., Random Forest).
● Boosting: Reduces bias by sequentially training models, each correcting the previous one's errors
(e.g., XGBoost, AdaBoost).
77. What is a box plot and how do you interpret it?
A box plot shows the distribution of data based on quartiles. It highlights the median, interquartile range
(IQR), and outliers.
78. What are the key assumptions of linear regression?
1. Linearity
2. Independence
3. Homoscedasticity (constant variance)
4. Normality of errors
79. What is the purpose of cross-validation in machine learning?
Cross-validation splits the data into multiple subsets to evaluate the model’s performance across different
data sets, helping to detect overfitting.
80. How do you perform a Chi-Square test in statistics?
The Chi-Square test evaluates if there is a significant association between categorical variables.
Formula:
χ2=Σ[(O−E)2/E]χ² = Σ[(O - E)² / E] χ2=Σ[(O−E)2/E]
Where O = observed, E = expected frequency.
81. What are the benefits of using Power BI over Excel for reporting?
Power BI provides more powerful data processing, visualization, and interactive dashboards, allowing
integration from multiple data sources with real-time updates.
82. What is the purpose of a confusion matrix in machine learning?
A confusion matrix is a table that helps evaluate the performance of a classification model by comparing
actual and predicted values.
83. How would you deal with missing values in a time series dataset?
● Interpolate missing values using linear or forward fill.
● Use time-based imputation methods to fill in missing data points.
84. How do you calculate precision, recall, and F1-score in a classification problem?
● Precision = TP / (TP + FP)
● Recall = TP / (TP + FN)
● F1-score = 2 * (Precision * Recall) / (Precision + Recall)
85. How would you visualize a time series dataset in Python?
Using matplotlib or seaborn to plot data over time
import matplotlib.pyplot as plt
plt.plot(time, data)
plt.show()
86. What is an ROC curve and what does it represent?
An ROC (Receiver Operating Characteristic) curve is used to evaluate classification models. It plots the True
Positive Rate (TPR) vs. False Positive Rate (FPR) at various thresholds.
87. What is multicollinearity and how do you detect it?
Multicollinearity occurs when two or more predictor variables are highly correlated. It can be detected using
the Variance Inflation Factor (VIF).
88. What is the difference between deep learning and machine learning?
● Machine Learning: Uses algorithms to learn patterns from data.
● Deep Learning: A subset of ML with neural networks that learn from large amounts of data,
especially for complex tasks like image recognition.
89. What is PCA (Principal Component Analysis)?
PCA is a dimensionality reduction technique that transforms correlated variables into a smaller set of
uncorrelated variables called principal components.
90. What are precision-recall curves, and when should they be used?
Precision-recall curves are used in classification problems with imbalanced datasets. They help evaluate the
trade-off between precision and recall at different thresholds.
91. How would you handle large datasets in Python?
● Use dask for parallel processing of large datasets.
● Use chunksize while reading large CSV files to avoid memory overload.
92. How do you handle categorical variables in Excel?
By creating dummy variables (one-hot encoding) or using VLOOKUP to map categories to numeric values.
93. What is a time series analysis and when would you use it?
Time series analysis is used to analyze data points collected or recorded at specific time intervals. It’s useful
for forecasting and detecting trends.
94. What is the difference between L1 and L2 regularization?
● L1 Regularization: Adds the absolute value of the coefficients to the loss function, promoting
sparsity (Lasso).
● L2 Regularization: Adds the squared value of the coefficients, reducing large coefficients but not
zeroing them out (Ridge).
95. What is clustering in machine learning?
Clustering is an unsupervised learning technique used to group similar data points together based on
certain characteristics (e.g., K-means, DBSCAN).
96. What is the difference between a sample and a population in statistics?
● Population: Entire set of data points you are interested in.
● Sample: A subset of the population used for analysis.
97. What is the significance of the p-value in hypothesis testing?
The p-value measures the strength of evidence against the null hypothesis. A lower p-value indicates
stronger evidence that the null hypothesis can be rejected.
98. How do you identify the best-fit line in a scatter plot?
By applying linear regression or visually plotting the line that minimizes the residuals between actual and
predicted data points.
99. How do you evaluate the performance of a regression model?
● R² (R-squared) value indicates the proportion of variance explained by the model.
● Mean Absolute Error (MAE) and Mean Squared Error (MSE) for error measurement.
100. How would you interpret a correlation coefficient of 0.8?
A correlation coefficient of 0.8 indicates a strong positive linear relationship between two variables.