Cheatsheets- https://www.emmading.com/free-data-science-interview-resources?
cid=091a9f1e-
48c3-4e52-9936-4a568ab0cc30
Also in EDA find out relationship, see below
The Data Science Lifecycle consists of several structured steps, starting from obtaining raw data to
deploying a model and monitoring its performance. Below is a detailed breakdown of the entire
lifecycle of a Data Science project:
1. Problem Definition & Business Understanding
Before diving into the data, it's crucial to define the problem statement and understand business
objectives.
Key Steps:
Identify the business problem (e.g., predicting customer churn, fraud detection).
Define KPIs (Key Performance Indicators) to measure success.
Engage with stakeholders (business teams, clients, etc.) to understand requirements.
Define constraints (e.g., computing resources, accuracy expectations).
📌 Example: A bank wants to predict whether a customer will default on a loan. The success metric
could be improving loan approval accuracy while minimizing false positives.
2. Data Collection (Raw Data Acquisition)
Gather raw data from various sources. The quality of data affects model performance.
Data Sources:
Databases: SQL, NoSQL (MySQL, PostgreSQL, MongoDB).
APIs & Web Scraping: Using REST APIs, BeautifulSoup, Scrapy.
Files: CSV, Excel, JSON, XML.
Logs & Sensor Data: System logs, IoT data.
Third-Party Data: Public datasets (Kaggle, UCI Machine Learning, Open Data).
📌 Example: For a loan default prediction model, data might be collected from bank transaction
records, credit scores, customer demographics, and employment history.
3. Data Exploration & Cleaning (Data Preprocessing)
Raw data is often messy! Cleaning and structuring the data is essential.
Key Steps:
✅ Handling Missing Data: Impute (mean/median/mode) or drop missing values.
✅ Handling Outliers: Use box plots, Z-score, IQR methods to detect and remove them.
✅ Data Type Conversion: Convert categorical to numerical (one-hot encoding, label encoding).
✅ Dealing with Duplicates: Remove redundant entries.
✅ Data Transformation: Scaling (MinMax, StandardScaler), normalization, log transformation.
📌 Example: If a customer’s income is missing, we might fill it using the median salary for similar
profiles.
4. Exploratory Data Analysis (EDA)
EDA helps understand patterns, relationships, and distributions within data.
Key Techniques:
📊 Univariate Analysis – Histograms, box plots, KDE plots.
📉 Bivariate Analysis – Correlation heatmaps, scatter plots.
📈 Multivariate Analysis – Pairplots, PCA for dimensionality reduction.
📌 Feature Engineering – Creating new variables (e.g., credit-to-income ratio).
📌 Example: We might discover that customers with higher credit scores rarely default—a key insight
for model training.
5. Feature Selection & Engineering
Choosing the right features improves model performance.
Feature Selection Techniques:
Statistical Methods: Correlation, Chi-square test, ANOVA.
Dimensionality Reduction: PCA, t-SNE, LDA.
Domain Knowledge: Using business insights to engineer new features.
📌 Example: Instead of using raw salary and loan amount, we create a new feature: Debt-to-Income
Ratio.
6. Model Selection & Training
Choose the right model based on the problem type.
Types of Models:
Supervised Learning:
o Classification: Logistic Regression, Random Forest, XGBoost, Neural Networks.
o Regression: Linear Regression, Decision Trees, Gradient Boosting.
Unsupervised Learning:
o Clustering: K-Means, DBSCAN.
o Anomaly Detection: Isolation Forest, Autoencoders.
Deep Learning:
o CNNs for image classification, LSTMs for time series.
📌 Example: For loan default prediction (binary classification), we might start with Logistic Regression
and later try Random Forest for better performance.
7. Model Evaluation
Evaluate the model’s performance using appropriate metrics.
Common Metrics:
Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.
Regression: RMSE (Root Mean Squared Error), R² (Coefficient of Determination).
Clustering: Silhouette Score, Davies–Bouldin Index.
📌 Example: A high recall is crucial in fraud detection since missing fraud cases is costly.
8. Hyperparameter Tuning & Optimization
Optimize the model by tuning hyperparameters.
Techniques:
Grid Search: Exhaustive search over parameter combinations.
Random Search: Randomly selects parameter values for faster tuning.
Bayesian Optimization: Uses probability-based methods for efficient tuning.
AutoML: Automated hyperparameter tuning using libraries like TPOT, AutoKeras.
📌 Example: Optimizing the number of trees in a Random Forest model to improve accuracy.
9. Model Deployment
Deploy the trained model into a real-world environment.
Deployment Methods:
🚀 Web Services: Flask, FastAPI, Django REST API.
☁️Cloud Deployment: AWS SageMaker, Google AI Platform, Azure ML.
Edge Deployment: Deploying on mobile devices or IoT devices.
📌 Example: A fraud detection model might be integrated into a banking system to flag suspicious
transactions in real time.
10. Monitoring & Maintenance
Once deployed, the model must be monitored for performance degradation.
Key Tasks:
📉 Drift Detection: Monitor if input data distribution changes over time.
📊 Performance Tracking: Log model accuracy, precision, recall, etc.
🔄 Retraining: Periodically retrain the model with new data.
📌 Example: If a credit scoring model starts underperforming, we update it with the latest customer
data.
11. Model Explainability & Ethical Considerations
Ensure transparency and fairness in AI models.
Techniques:
SHAP & LIME: Explain how the model makes decisions.
Bias Detection: Check if the model unfairly favors certain groups.
Regulatory Compliance: GDPR, AI Ethics Guidelines.
📌 Example: If a loan approval model discriminates based on gender, we need to adjust feature
weights.
12. Continuous Improvement
The model lifecycle never stops!
Gather feedback from business users.
Improve the feature set with new data.
Optimize performance with better models or tuning.
📌 Example: A recommendation system like Netflix’s movie recommendations continuously updates
based on user interactions.
Exploratory Data Analysis (EDA) _Detailed
EDA is about understanding your data before modeling. It helps uncover patterns, spot anomalies,
and guide feature selection.
Practical Steps:
✅ Understand Data Types
Identify numerical (e.g., age, salary) vs. categorical (e.g., gender, country) columns.
Check if any column is wrongly classified (e.g., dates stored as text).
✅ Check for Missing Values
Identify missing values in each column.
Decide whether to remove rows/columns or fill missing data (e.g., mean/median for
numbers, mode for categories).
Scenario Best Strategy
<5% missing values Mode, mean, or median imputation
>50% missing in a column Drop the column
Categorical missing values Fill with mode or "Unknown"
Time-series data Forward fill, backward fill, or interpolation
High correlation with other features Predictive imputation (ML models)
✅ Look for Duplicates
Check for repeated rows and remove them if necessary.
✅ Identify Outliers
Use box plots or histograms to detect extreme values.
Decide whether to remove, transform, or cap them.
✅ Check Distribution of Data
Use histograms or KDE plots to understand how numerical data is spread.
See if data is skewed (right or left).
✅ Analyze Relationships Between Variables
Use scatter plots for numerical variables (e.g., income vs. spending).
Use correlation heatmaps to see which variables are strongly related.
Use pivot tables or group-by functions to analyze categorical variables.
✅ Segment Data for Better Insights
Group data based on categories (e.g., analyze customer behavior by region or age group).
✅ Check for Imbalanced Data (for Classification Problems)
See if one class dominates the others (e.g., 90% "No Fraud" vs. 10% "Fraud").
Consider balancing techniques (oversampling, undersampling).
Feature Selection
Once you understand your data, the next step is choosing the right features and creating new ones
to improve model performance.
Feature Selection (Keeping Only Useful Variables)
✅ Remove Unnecessary Columns
Drop columns that don’t contribute (e.g., ID numbers, unnecessary text fields).
✅ Check for High Correlation
Remove redundant features (e.g., if "height" and "BMI" are strongly correlated, keep one).
✅ Use Statistical Tests
Use techniques like ANOVA, chi-square tests, or mutual information to pick the best
features.
✅ Use Automated Methods
Try feature selection methods like Recursive Feature Elimination (RFE) or LASSO Regression.
Feature Engineering (Creating New Features)
✅ Convert Categorical Data to Numeric
Use One-Hot Encoding (turn categories into multiple columns).
Use Label Encoding (assign numbers to categories).
✅ Create Meaningful Features
Combine existing columns (e.g., "loan amount" ÷ "income" to get debt-to-income ratio).
Extract insights from dates (e.g., "year of joining" → "years of experience").
✅ Transform Features for Better Distribution
Apply log transformation if a feature is highly skewed.
Use scaling techniques (MinMaxScaler, StandardScaler) for better model performance.
✅ Handle Text Data (If Needed)
Convert text into numerical data using TF-IDF or Word Embeddings.