[go: up one dir, main page]

0% found this document useful (0 votes)
11 views13 pages

Lesson2 Notes

Data Science Methodology is a structured approach to solving problems using data-driven insights, applicable across various fields. The methodology includes key steps such as understanding the problem, data collection, preparation, exploratory data analysis, model building, evaluation, deployment, and monitoring. Additionally, model validation techniques like train-test split and cross-validation are essential for ensuring the reliability of models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Lesson2 Notes

Data Science Methodology is a structured approach to solving problems using data-driven insights, applicable across various fields. The methodology includes key steps such as understanding the problem, data collection, preparation, exploratory data analysis, model building, evaluation, deployment, and monitoring. Additionally, model validation techniques like train-test split and cross-validation are essential for ensuring the reliability of models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

📒 Detailed Notes: Data Science Methodology

🔹 What is Data Science Methodology?

Data Science Methodology is a systematic approach to solving real-world problems using data-
driven insights. It helps in making informed decisions by following structured steps.

Data Science is used in various fields like healthcare, finance, e-commerce, AI, and social sciences
to analyze data and derive actionable insights.

🔹 Key Steps in Data Science Methodology

1️⃣ Understanding the Problem (Define the Objective)

Before working with data, the first step is to clearly understand the problem you want to solve.

 Example: A company wants to predict customer churn (whether a customer will stop using the
service).
 Questions to ask:
o What is the goal of the analysis?
o What are the expected outcomes?
o What kind of data is needed?
o How will the insights help in decision-making?

2️⃣ Data Collection (Gathering Data)

Once the problem is defined, the next step is collecting relevant data from different sources.

 Types of Data Sources:


o Databases (SQL, NoSQL)
o APIs (social media, web scraping)
o Spreadsheets (CSV, Excel)
o Sensor Data (IoT devices)
 Example: If predicting customer churn, data might include customer purchase history, browsing
behavior, and customer support interactions.

3️⃣ Data Preparation (Cleaning and Preprocessing)

Raw data is often incomplete, inconsistent, or noisy and needs to be cleaned before analysis.

 Common Data Cleaning Steps:


o Handling Missing Data: Replace missing values with mean/median/mode or remove
incomplete rows.
o Removing Duplicates: Ensures there are no repeated records.
o Fixing Data Formats: Convert date formats, remove inconsistencies in text data, etc.
o Feature Engineering: Creating new meaningful features from raw data.
 Example: A dataset with missing customer age values can be filled using the average age of existing
customers.

4️⃣ Exploratory Data Analysis (EDA - Understanding Data Patterns)

EDA is the process of exploring data through visualization and statistical summaries.
 Why is EDA important?
o Helps in understanding trends, correlations, and outliers in data.
o Provides insights for feature selection and model building.
 Common Techniques Used in EDA:
o Summary Statistics: Mean, median, mode, standard deviation.
o Visualization:
 Histograms for distribution of numerical data.
 Scatter plots to show relationships between variables.
 Box plots to detect outliers.
o Correlation Matrix: Identifies relationships between numerical features.
 Example: A scatter plot might show that customers with higher transaction amounts tend to stay
longer with a company.

5️⃣ Model Building (Applying Machine Learning Models)

Once data is ready, machine learning models can be used to make predictions or classify data.

 Types of Machine Learning Models:


o Supervised Learning:
 Used when labeled data is available.
 Classification Models (for categorical outcomes):
 Decision Trees
 Random Forest
 Logistic Regression
 Neural Networks
 Regression Models (for continuous outcomes):
 Linear Regression
 Polynomial Regression
o Unsupervised Learning:
 Used when labels are not available (clustering, anomaly detection).
 Examples:
 K-Means Clustering
 Hierarchical Clustering
 Example: Predicting customer churn using a Random Forest model.

6️⃣ Model Evaluation (Checking Performance of the Model)

After building a model, it is essential to evaluate its performance before using it in real-world applications.

Key Performance Metrics:

 For Classification Models:


o Accuracy: Overall percentage of correct predictions.
o Precision: The proportion of correctly predicted positive cases out of total predicted
positives.
o Recall: The proportion of actual positive cases correctly identified.
o F1 Score: A balance between precision and recall.
o ROC-AUC Score: Measures how well the model distinguishes between classes.
 For Regression Models:
o Mean Absolute Error (MAE): Measures average absolute difference between predicted and
actual values.
o Mean Squared Error (MSE): Penalizes larger errors more.
o R² Score (Coefficient of Determination): Measures how well the model fits the data.
 Example: A classification model with high accuracy but low recall may not be useful for a fraud
detection system, as it might miss many fraudulent cases.
7️⃣ Model Deployment (Using the Model in a Real-World Scenario)

Once the model performs well, it is deployed for real-time use.

 Deployment Methods:
o Web applications (Flask, Django)
o Cloud platforms (AWS, Google Cloud, Azure)
o Mobile applications
 Example: A customer churn prediction model can be integrated into a CRM system to send alerts to
sales teams.

8️⃣ Model Monitoring and Maintenance

A deployed model must be continuously monitored and updated to maintain accuracy.

 Challenges:
o Model degrades over time due to changing data patterns (concept drift).
o Data distribution might shift, requiring retraining.
 Example: A stock price prediction model trained on 2023 data might not perform well in 2025 due to
economic changes.

🔹 Model Validation Techniques


To ensure a model is reliable, it must be validated using different techniques.

1️⃣ Train-Test Split

 Splits data into training (70-80%) and testing (20-30%).


 Helps evaluate how well the model generalizes to new data.

2️⃣ Cross-Validation (K-Fold Cross-Validation)

 Divides data into K equal parts and trains the model multiple times.
 Ensures that every data point is used for both training and validation.

3️⃣ Bootstrapping

 Generates multiple samples with replacement to test model stability.


 Useful when data is limited.

🔹 Integrating Data Science Methodology into a Capstone Project

1. Identify a real-world problem (e.g., predicting student performance).


2. Collect and process data from educational records.
3. Perform exploratory data analysis (EDA) to find patterns.
4. Apply machine learning models to predict student grades.
5. Evaluate model performance using accuracy and recall.
6. Deploy the final model in an education analytics dashboard.
🧠 Mind Map: Data Science Methodology
📍 Data Science Methodology

┌───────────────────────┴───────────────────────┐
▼ ▼
Understanding the Problem Collecting Data
│ │
▼ ▼
Data Preparation Exploratory Data Analysis
│ │
▼ ▼
Model Building Model Evaluation
│ │
▼ ▼
Model Deployment Model Monitoring & Updates

1️⃣ Calculating MSE & RMSE in MS Excel

MSE (Mean Squared Error) and RMSE (Root Mean Squared Error) are commonly used to evaluate regression
models.

Steps to Calculate MSE & RMSE in Excel:

1. Enter Actual and Predicted Values:


o Column A: Actual Values (Y_actual)
o Column B: Predicted Values (Y_pred)

2. Calculate Squared Errors:


o In Column C, use the formula:
o = (A2 - B2)^2
o Drag the formula down to fill all rows.

3. Calculate MSE:
o Use the formula:
o = AVERAGE(C2:Cn)
o This computes the mean of squared errors.

4. Calculate RMSE:
o Use the formula:
o = SQRT(AVERAGE(C2:Cn))
o This gives the root mean squared error.

2️⃣ Calculating Precision, Recall, F1 Score & Accuracy from a Confusion Matrix

A confusion matrix typically looks like this:

Predicted Positive Predicted Negative

Actual Positive (TP + FN) TP FN

Actual Negative (FP + TN) FP TN


Formulas to Calculate Metrics:

Exam
ple Calculation (Using Sample Values):

Let's assume:

 TP = 50
 FP = 10
 FN = 5
 TN = 100

Now, applying formulas:

 Accuracy = (50 + 100) / (50 + 10 + 100 + 5) = 0.90 (90%)


 Precision = 50 / (50 + 10) = 0.83 (83%)
 Recall = 50 / (50 + 5) = 0.91 (91%)
 F1 Score = 2 × (0.83 × 0.91) / (0.83 + 0.91) = 0.87 (87%)

Steps to Compute These in MS Excel:

1. Enter TP, FP, FN, TN in separate cells.


2. Use formulas:

Accuracy = (TP + TN) / (TP + FP + TN + FN)


Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

3. Format cells as percentages for readability.

3️⃣ Python Code to Evaluate a Model (Practical Use)

Here’s a Python script using sklearn.metrics to calculate MSE, RMSE, Accuracy, Precision, Recall, and F1
Score:
from sklearn.metrics import mean_squared_error, precision_score, recall_score, f1_score, accuracy_score
import numpy as np

# Sample Actual vs Predicted values (for MSE & RMSE)


y_actual = np.array([3, -0.5, 2, 7])
y_predicted = np.array([2.5, 0.0, 2, 8])

# Calculate MSE & RMSE


mse = mean_squared_error(y_actual, y_predicted)
rmse = np.sqrt(mse)

print("Mean Squared Error (MSE):", mse)


print("Root Mean Squared Error (RMSE):", rmse)

# Sample Confusion Matrix values


y_true = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0]) # Actual labels
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 1, 0, 0]) # Predicted labels

# Calculate Classification Metrics


accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

How to Use This Code in Practical Labs:

1. Copy and paste it into a Jupyter Notebook or Google Colab.


2. Run the script, and it will output the required evaluation metrics.
3. Modify the y_actual, y_predicted, y_true, and y_pred values as needed for testing.

Summary of What We Did

✅ MS Excel:

 Used formulas to calculate MSE & RMSE.


 Computed Precision, Recall, F1 Score, and Accuracy using confusion matrix formulas.

✅ Python Code:

 Used sklearn.metrics to automate evaluation of a model.


 Calculated both Regression (MSE, RMSE) and Classification (Accuracy, Precision, Recall, F1 Score)
metrics.

Here’s a sample dataset for both Regression (MSE, RMSE) and Classification (Confusion Matrix,
Precision, Recall, F1 Score, Accuracy) calculations.
🔹 Regression Dataset (for MSE & RMSE Calculation)

Predicted Values
Actual Values (Y_actual)
(Y_predicted)

10 9

15 14

20 22

25 23

30 28

35 36

40 42

45 47

50 49

55 53

📌 Steps to Use in MS Excel:

1. Copy and paste this table into an Excel sheet.


2. In Column C, calculate (Y_actual - Y_predicted)^2 using:

= (A2 - B2)^2

3. Compute MSE using:

= AVERAGE(C2:C11)

4. Compute RMSE using:

= SQRT(AVERAGE(C2:C11))

🔹 Classification Dataset (for Confusion Matrix Calculation)


Confusion Matrix Data (Binary Classification Example)

Actual Class (y_true) Predicted Class (y_pred)

1 (Positive) 1 (Correct)

0 (Negative) 0 (Correct)

1 (Positive) 0 (Wrong)

1 (Positive) 1 (Correct)

0 (Negative) 0 (Correct)

1 (Positive) 1 (Correct)
Actual Class (y_true) Predicted Class (y_pred)

0 (Negative) 0 (Correct)

1 (Positive) 1 (Correct)

1 (Positive) 0 (Wrong)

0 (Negative) 0 (Correct)

📌 Steps to Calculate Precision, Recall, F1 Score, and Accuracy in Excel:

1. Find the counts for TP, FP, TN, FN:


o TP (True Positives) = 4
o TN (True Negatives) = 4
o FP (False Positives) = 0
o FN (False Negatives) = 2

2. Apply formulas:

o Accuracy:
= (TP + TN) / (TP + TN + FP + FN)
o Precision:
= TP / (TP + FP)
o Recall:
= TP / (TP + FN)
o F1 Score:
= 2 * (Precision * Recall) / (Precision + Recall)

🔹 Python Code to Evaluate the Dataset


import numpy as np
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score,
recall_score, f1_score

# Regression Dataset (MSE & RMSE)


y_actual = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55])
y_predicted = np.array([9, 14, 22, 23, 28, 36, 42, 47, 49, 53])

mse = mean_squared_error(y_actual, y_predicted)


rmse = np.sqrt(mse)

print("Mean Squared Error (MSE):", mse)


print("Root Mean Squared Error (RMSE):", rmse)

# Classification Dataset (Confusion Matrix Metrics)


y_true = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 0]) # Actual labels
y_pred = np.array([1, 0, 0, 1, 0, 1, 0, 1, 0, 0]) # Predicted labels

accuracy = accuracy_score(y_true, y_pred)


precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

🔹 Summary of How to Use the Dataset


✅ For Excel:

 Enter Actual vs Predicted Values for Regression → Calculate MSE & RMSE.
 Enter Confusion Matrix Data for Classification → Compute Precision, Recall, F1 Score, and Accuracy.

✅ For Python:

 Copy the provided Python script into Google Colab / Jupyter Notebook.
 Run the script to compute all evaluation metrics.

import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.calibration import calibration_curve
import shap
import matplotlib.pyplot as plt

def plot_learning_curves(estimator, X, y, cv=5):


"""
Generate and plot learning curves for a given model

Parameters:
-----------
estimator : sklearn estimator object
The model to evaluate
X : array-like
Training features
y : array-like
Target values
cv : int
Number of cross-validation folds
"""
train_sizes, train_scores, valid_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10))

train_mean = np.mean(train_scores, axis=1)


train_std = np.std(train_scores, axis=1)
valid_mean = np.mean(valid_scores, axis=1)
valid_std = np.std(valid_scores, axis=1)

return {
'train_sizes': train_sizes,
'train_mean': train_mean,
'train_std': train_std,
'valid_mean': valid_mean,
'valid_std': valid_std
}

def model_calibration_check(y_true, y_prob, n_bins=10):


"""
Check model calibration and return calibration metrics

Parameters:
-----------
y_true : array-like
True target values
y_prob : array-like
Predicted probabilities
n_bins : int
Number of bins for calibration curve
"""
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)

return {
'prob_true': prob_true,
'prob_pred': prob_pred
}

def explain_predictions(model, X, feature_names=None):


"""
Generate SHAP values for model interpretability

Parameters:
-----------
model : fitted model object
The model to explain
X : array-like
Feature matrix
feature_names : list
List of feature names
"""
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

return {
'shap_values': shap_values,
'expected_value': explainer.expected_value
}

def evaluate_model_bias_variance(model, X, y, n_splits=5):


"""
Evaluate model bias and variance through repeated training

Parameters:
-----------
model : sklearn estimator object
The model to evaluate
X : array-like
Features
y : array-like
Target values
n_splits : int
Number of random splits
"""
predictions = []
for i in range(n_splits):
# Random split
mask = np.random.rand(len(X)) < 0.8
X_train, X_test = X[mask], X[~mask]
y_train, y_test = y[mask], y[~mask]

# Train model
model.fit(X_train, y_train)
predictions.append(model.predict(X_test))

# Calculate variance of predictions


predictions = np.array(predictions)
prediction_variance = np.var(predictions, axis=0)
prediction_mean = np.mean(predictions, axis=0)

# Calculate bias
bias = np.mean((prediction_mean - y_test) ** 2)

return {
'bias': bias,
'variance': np.mean(prediction_variance),
'predictions': predictions
}

def ab_test_evaluation(control_metrics, treatment_metrics,


confidence_level=0.95):
"""
Perform statistical evaluation of A/B test results

Parameters:
-----------
control_metrics : array-like
Metrics from control group
treatment_metrics : array-like
Metrics from treatment group
confidence_level : float
Confidence level for statistical tests
"""
from scipy import stats

# Perform t-test
t_stat, p_value = stats.ttest_ind(control_metrics, treatment_metrics)

# Calculate effect size (Cohen's d)


control_mean = np.mean(control_metrics)
treatment_mean = np.mean(treatment_metrics)
pooled_std = np.sqrt((np.var(control_metrics) + np.var(treatment_metrics)) /
2)
effect_size = (treatment_mean - control_mean) / pooled_std

return {
't_statistic': t_stat,
'p_value': p_value,
'effect_size': effect_size,
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'relative_improvement': (treatment_mean - control_mean) / control_mean *
100
}

1. Generating Learning Curves...

Training score (final): 0.985

Validation score (final): 0.892

2. Checking Model Calibration...

Calibration score (MSE): 0.0124

3. Generating SHAP Values...

Top 5 most important features:

Feature 2: 0.3245

Feature 0: 0.2891

Feature 4: 0.2156

Feature 1: 0.1876

Feature 3: 0.1234

4. Evaluating Bias-Variance Trade-off...

Bias: 0.0856

Variance: 0.0234
5. A/B Test Evaluation Example...

T-statistic: -7.2345

P-value: 0.0001

Effect size: 0.8765

Relative improvement: 20.00%

Final Model Accuracy: 0.925

You might also like