📘 Course Introduction – Data Analysis
with Python
👨🏫 Instructor: Joseph
🔑 Overview
This course teaches how to analyze data with Python using industry-standard libraries and
apply machine learning models on real-world datasets.
🛠 Key Libraries Covered
● NumPy → Numerical computations
● Pandas → Data wrangling & analysis
● Scikit-learn (sklearn) → Machine learning models
📂 Module Breakdown
Module 1: Getting Started with Data
● Understand dataset characteristics
● Overview of Python packages for data analysis
● Import & start analyzing data
Module 2: Data Wrangling & Preprocessing
● Handling missing values
● Data formatting
● Data normalization
Module 3: Exploratory Data Analysis (EDA)
● Descriptive statistics
● GroupBy operations
● Correlation & other statistics
Module 4: Regression & Prediction Models
● Linear regression
● Polynomial regression
● Pipelines
● Model evaluation (in-sample & prediction)
● Decision making
Module 5: Model Evaluation & Refinement
● Overfitting vs. underfitting
● Model selection
● Ridge regression
● Grid search
Final Project
● Hands-on with a real-world dataset
● Apply full workflow: preprocessing → analysis → modeling → evaluation
📋 Prerequisites
● Python programming knowledge
● High school-level math
✅ Key Outcome: By the end, you’ll know how to import, wrangle, analyze, and model
real-world datasets using Python.
📊 Used Car Prices Dataset (Jeffrey C. Schlimmer)
🔹 Dataset Format
● Open dataset (CSV format – comma separated values).
● Each line = 1 row of data.
i
● Note: First row is not a header, but actual data.
🔹 Documentation of Columns (26 total)
Each column = feature/attribute of the car.
A few key attributes explained:
1. Symboling
○ Insurance risk level indicator.
○ Scale: -3 (very safe) → +3 (very risky).
○ Adjusted based on automobile risk.
2. Normalized Losses
○ Relative average loss payment per insured vehicle per year.
○ Normalized by size classification (2-door small, station wagon, sports car, etc.).
○ Range: 65 – 256.
3. Other Attributes
○ Make, body style, engine type, horsepower, dimensions, etc.
○ Easy to understand (check documentation for full details).
🔹 Target Variable
● 26th Attribute = Price
● This is the label (value to predict).
● Predictors = all other variables (e.g., symboling, normalized losses, make, etc.).
🔹 Goal of Project
● Build a model to predict car price using the 25 other features.
🔹 Extra Notes
● Dataset is from 1985 → car prices appear lower than today.
● Purpose: Learn data analysis & prediction techniques, not actual market values.
✅ These notes capture the dataset structure, key attributes, target label, and project
objective clearly.
Here are clear, structured notes from your lesson on Python packages for Data Analysis 👇
📌 Python Libraries for Data Analysis
🔹 What is a Python Library?
● A library = collection of functions & methods that let you perform tasks without writing
code from scratch.
● They contain built-in modules offering specific functionalities.
● Many libraries exist for data analysis, math, visualization, and machine learning.
📊 Groups of Python Data Analysis Libraries
1️⃣ Scientific Computing Libraries
● Pandas
○ Provides data structures & tools for manipulation and analysis.
○ Core structure: DataFrame (2D table with labeled rows & columns).
○ Key features: fast access, easy indexing, structured data handling.
● NumPy
○ Works with arrays (inputs & outputs).
○ Can handle matrices with small code changes.
○ Enables fast array processing (better performance than lists).
● SciPy
○ Builds on NumPy.
○ Provides functions for advanced math problems (linear algebra, optimization,
integration, etc.).
○ Includes tools for data visualization.
2️⃣ Data Visualization Libraries
● Matplotlib
○ Most well-known Python plotting library.
○ Creates graphs, charts, and plots.
○ Highly customizable for styling and formatting.
● Seaborn
○ Built on Matplotlib.
○ Higher-level, simpler to use.
○ Generates heatmaps, time series, violin plots, etc.
○ Great for statistical visualization.
3️⃣ Machine Learning & Statistical Modeling Libraries
● Scikit-learn (sklearn)
○ Built on NumPy, SciPy, and Matplotlib.
○ Provides tools for:
■ Regression
■ Classification
■ Clustering
■ Other ML tasks
● Statsmodels
○ Focused on statistical analysis.
○ Lets you:
■ Explore datasets
■ Estimate statistical models
■ Perform statistical tests
✅ Summary:
● Pandas, NumPy, SciPy → Scientific computing & manipulation
● Matplotlib, Seaborn → Visualization
● Scikit-learn, Statsmodels → Machine learning & statistics
📌 Reading Data Using Python’s Pandas
Package
🔹 Data Acquisition
● Data acquisition = loading and reading data into a notebook from different sources.
● Two important factors:
1. Format → how the data is encoded (file type).
■ Examples: CSV, JSON, XLSX, HDF, etc.
2. File Path → where the data is stored.
■ Could be local (your computer) or online (web address).
🔹 Example: Used Car Dataset
● Found online (CSV format → values separated by commas).
● Each row = one data point (car), with multiple properties/features.
🔹 Reading Data with Pandas
● Use pandas.read_csv() to load CSV files into a DataFrame.
Basic steps (3 lines of code):
import pandas as pd
file_path = "your_file.csv"
df = pd.read_csv(file_path)
●
🔹 Handling Missing Headers
● By default, read_csv assumes header row exists.
If dataset has no headers, specify:
df = pd.read_csv(file_path, header=None)
●
● Pandas will then assign default integer headers (0,1,2,3,...).
🔹 Previewing Data
● To quickly check:
○ [Link](n) → shows first n rows (default = 5).
○ [Link](n) → shows last n rows.
● Useful for verifying dataset was read correctly.
🔹 Adding Custom Column Names
If headers are stored elsewhere (e.g., another file):
headers = ["col1", "col2", "col3", ...]
[Link] = headers
●
● This replaces default integer headers with meaningful names.
🔹 Exporting Data
To save a DataFrame to CSV:
df.to_csv("[Link]", index=False)
●
● index=False prevents writing row numbers.
● Pandas also supports other formats (JSON, Excel, HDF, etc.) with similar syntax.
✅ Summary:
● Use Pandas for fast & easy data loading.
● Key function: read_csv() for CSVs.
● Use head() / tail() to preview data.
● Assign headers if missing.
● Export modified DataFrame with to_csv().
Here are clean, structured notes for the video on Exploring Data with Pandas 👇
📌 Exploring Data with Pandas
🔹 Checking Data Types
● Pandas stores data mainly as:
○ object → like Python str
○ float → numeric decimal values
○ int → integer values
○ datetime → for time series
● ⚠️ Pandas auto-detects types on import → sometimes incorrect.
○ Example: Car price should be float, but may load as object.
○ Solution: manually convert with astype().
👉 Why check types?
1. Ensure correctness (e.g., numeric fields not misread as text).
2. Determines what functions you can apply (math only works on numeric).
🔑 Command:
[Link] # returns column names + their data types
🔹 Statistical Summary with describe()
● Provides quick summary stats for numeric columns:
○ count, mean, std (standard deviation), min, max, 25%, 50%, 75%.
● Detects outliers, deviations, and anomalies.
👉 Example:
[Link]()
● By default → only numeric columns.
● To include all columns (numeric + object/string):
[Link](include="all")
🔹 Special Stats for Object Columns
When using include="all", string/object columns return:
● unique → number of distinct values
● top → most frequent value
● freq → count of most frequent value
⚠️ Some values appear as NaN (Not a Number) if a statistic doesn’t apply.
🔹 Dataset Overview with info()
● Gives a concise summary of the DataFrame:
○ Index range
○ Column names & data types
○ Non-null value counts
○ Memory usage
👉 Example:
[Link]()
✅ Summary:
● Always check column types with dtypes (fix mismatches early).
● Use describe() for statistical summaries (detect outliers).
● Add include="all" for both numeric + object columns.
● Use info() for a quick dataset snapshot (columns, null counts, memory).
Got it ✅
Here are well-structured notes from your video + transcript on Accessing Databases with
Python (DB API):
📘 Notes: Accessing Databases with
Python
1. Introduction
● Databases = powerful tools for data scientists.
● Python connects to databases using APIs.
● Typical usage: write Python code in Jupyter Notebook to interact with DBMS.
2. What is an API?
● API (Application Programming Interface): a set of functions to access a service.
● SQL API:
○ Provides function calls as interface for DBMS.
○ Used to:
1. Send SQL queries.
2. Retrieve results.
3. Check status / handle errors.
3. SQL API Basic Flow
1. Application calls API to connect to DBMS.
2. SQL statement is built as a text string in a buffer.
3. API call passes SQL statement to DBMS.
4. Application makes API calls to check status & errors.
5. Ends with API call to disconnect from DB.
4. Python DB API (PEP 249)
● Standard API in Python for relational databases.
● Benefit: write one program that works with many DBs.
● Two main objects:
○ Connection Object → connect/manage transactions.
○ Cursor Object → run queries & fetch results.
5. Connection Object Methods
● cursor() → returns new cursor object.
● commit() → saves (commits) all pending transactions.
● rollback() → undoes changes back to start of transaction.
● close() → closes database connection (important to free resources).
6. Cursor Object
● Works like a text cursor → scans through result sets.
● Used for:
○ Running queries.
○ Fetching results into the application.
7. Typical Python DB API Workflow
1. Import database module (e.g., import sqlite3).
Connect to DB using connect() → returns connection object.
conn = [Link]("my_database.db")
2.
Create cursor using cursor().
cur = [Link]()
3.
Execute SQL query.
[Link]("SELECT * FROM users")
4.
Fetch results.
rows = [Link]()
5.
Commit (if needed).
[Link]()
6.
Close connection.
[Link]()
7.
8. Key Takeaways
● DB API standard = portability across databases.
● Connection = gateway to database.
● Cursor = runs queries + fetches data.
● Always close connections to avoid resource leaks.
Lesson Summary
Congratulations! You have completed this lesson. At this point in the course, you know:
● Each line in a dataset is a row, and commas separate the values.
● To understand the data, you must analyze the attributes for each column of data.
● Python libraries are collections of functions and methods that facilitate various
functionalities without writing code from scratch and are categorized into
Scientific Computing, Data Visualization, and Machine Learning Algorithms.
● Many data science libraries are interconnected; for instance, Scikit-learn is built
on top of NumPy, SciPy, and Matplotlib.
● The data format and the file path are two key factors for reading data with
Pandas.
● The read_CSV method in Pandas can read files in CSV format into a Pandas
DataFrame.
● Pandas has unique data types like object, float, Int, and datetime.
● Use the dtype method to check each column’s data type; misclassified data types
might need manual correction.
● Knowing the correct data types helps apply appropriate Python functions to
specific columns.
● Using Statistical Summary with describe() provides count, mean, standard
deviation, min, max, and quartile ranges for numerical columns.
● You can also use include='all' as an argument to get summaries for object-type
columns.
● The statistical summary helps identify potential issues like outliers needing
further attention.
● Using the info() Method gives an overview of the top and bottom 30 rows of the
DataFrame, useful for quick visual inspection.
● Some statistical metrics may return "NaN," indicating missing values, and the
program can’t calculate statistics for that specific data type.
● Python can connect to databases through specialized code, often written in
Jupyter notebooks.
● SQL Application Programming Interfaces (APIs) and Python DB APIs (most often
used) facilitate the interaction between Python and the DBMS.
● SQL APIs connect to DBMS with one or more API calls, build SQL statements as
a text string, and use API calls to send SQL statements to the DBMS and retrieve
results and statuses.
● DB-API, Python's standard for interacting with relational databases, uses
connection objects to establish and manage database connections and cursor
objects to run queries and scroll through the results.
● Connection Object methods include the cursor(), commit(), rollback(), and close()
commands.
● You can import the database module, use the Connect API to open a connection,
and then create a cursor object to run queries and fetch results.
● Remember to close the database connection to free up resources.
Module 2
📌 Data Pre-processing (Data Wrangling /
Data Cleaning)
🔹 What is Data Pre-processing?
● Process of converting raw data into a clean, structured format for analysis.
● Makes data ready for further analysis and modeling.
● Also called data cleaning or data wrangling.
Topics in this Module
1. Handling Missing Values
● Missing values occur when entries in a dataset are left empty.
● Must be identified and properly handled (remove, replace, or impute).
2. Data Formatting
● Data may come in different formats/units/conventions (e.g., miles vs kilometers,
dollars vs euros).
● Pandas methods can standardize values into a common format, unit, or convention.
3. Data Normalization
● Different numerical columns may have different ranges.
● Direct comparisons are not meaningful.
● Normalization brings all values into a similar scale/range.
● Techniques:
○ Centering: subtracting mean
○ Scaling: dividing by standard deviation or max value
4. Data Binning
● Process of grouping continuous values into discrete categories.
● Makes comparisons between groups easier.
● Example: Age values → bins like Child, Teen, Adult, Senior.
5. Categorical Variables
● Many datasets have categorical data (e.g., car body style, fuel type).
● Must be converted into numeric form for statistical modeling (e.g., one-hot encoding).
Working with Pandas Columns
● Operations are usually applied along columns (each row = one sample).
● Each column is a Pandas Series.
● Example: Access a column → df['symboling'] or df['body-style']
You can manipulate column values directly:
df['symboling'] = df['symboling'] + 1
● ➝ Adds 1 to every value in the column.
✅ Summary:
Data preprocessing ensures your dataset is clean, standardized, and ready for analysis. It
includes handling missing values, formatting, normalization, binning, and encoding categorical
variables.
👇
Here are well-organized notes from your transcript on Missing Values in Data
Pre-processing
📌 Missing Values in Data Pre-processing
🔹 What are Missing Values?
● A feature is said to have a missing value when no data is stored for it in a particular
observation.
● Common representations:
○ ?, N/A, 0, blank cell, or NaN (Not a Number).
● Example: normalized_losses column has missing values → shown as NaN.
Strategies to Handle Missing Values
1. Recover the data
○ If possible, ask the data provider to fill in missing values.
2. Remove the data
○ Drop rows with missing values (good if only a few are missing).
○ Drop entire columns if too many values are missing.
○ ⚠️ Removes information → should minimize impact.
3. Replace (Impute) missing data
○ Keeps dataset intact, but introduces estimation (less accurate).
○ Numerical data → replace with mean/median (e.g., avg. normalized_losses
= 4500).
○ Categorical data → replace with mode (most common value, e.g.,
"gasoline").
○ Domain knowledge → sometimes additional info helps make better guesses
(e.g., older cars may have higher losses).
4. Leave missing values as is
○ Sometimes useful to keep missing data for analysis.
Handling Missing Values in Python (Pandas)
🔹 Dropping Missing Data
# Drop rows with NaN
[Link](axis=0, inplace=True)
# Drop columns with NaN
[Link](axis=1, inplace=True)
● axis=0 → drop rows
● axis=1 → drop columns
● inplace=True → modifies DataFrame directly
🔹 Replacing Missing Data
# Replace NaN with mean of a column
mean_value = df['normalized_losses'].mean()
df['normalized_losses'].replace([Link], mean_value, inplace=True)
● First calculate mean/median/mode.
● Then replace NaN values with that value.
✅ Summary
● Missing values are common in real-world datasets.
● Handling strategies: recover, drop, replace, or leave as is.
● In Pandas:
○ Use .dropna() to remove rows/columns.
○ Use .replace() or .fillna() to impute values.
👉 Pro tip: In practice, fillna() is often used instead of replace() for missing values in
Pandas. Example:
df['normalized_losses'].fillna(mean_value, inplace=True)
📝 Notes on Data Formatting & Pandas
🔹 What is Data Formatting?
● Definition: Bringing data into a common standard so it’s consistent and comparable.
● Why needed?
○ Data comes from different sources → different formats/units/conventions.
○ Ensures consistency and cleanliness for analysis.
● Example:
○ "New York City" may appear as: N.Y., Ny, NY, New York.
○ Sometimes useful (e.g., fraud detection, anomaly spotting).
○ But usually → need to treat all as same entity.
🔹 Example: Unit Conversion
● Dataset feature: city-miles per gallon (mpg)
● Problem: Different countries use different units.
● Conversion needed → liters per 100 km (metric).
● Formula:
L/100km=235mpgL/100km = \frac{235}{mpg}
● In Pandas (1 line code):
df["city-L/100km"] = 235 / df["city-mpg"]
● Rename column with:
[Link](columns={"city-mpg": "city-L/100km"}, inplace=True)
🔹 Data Types in Pandas
● Common types:
○ object → text, words, strings
○ int64 → integers
○ float64 → real numbers (decimals)
● Problem: Sometimes data is imported with wrong type.
○ Example: price column stored as object instead of int/float.
○ This can cause errors → valid numbers treated like missing data.
🔹 Checking & Converting Data Types
● Check types:
[Link]
● Convert types:
df["price"] = df["price"].astype("int") # or "float"
● Ensures correct interpretation during analysis/modeling.
✅ Key Takeaway:
● Data formatting = making data consistent, usable, and correct.
● Use Pandas methods:
○ rename() → rename columns
○ astype() → convert data types
○ dtypes → check current data types
This video transcript is explaining data normalization in a very step-by-step way. Let me break
it down clearly for you:
🔹 What is Data Normalization?
● A data preprocessing technique where we adjust numerical features so they share a
common scale.
● Without normalization, features with larger ranges (like income) dominate features with
smaller ranges (like age), which can bias models like linear regression, k-NN, etc.
● It makes comparisons fair and models more stable.
🔹 Why Normalize? (Examples from transcript)
1. Car dataset example
○ Length ranges: 150–250
○ Width/Height ranges: 50–100
→ Different scales can distort analysis. Normalization makes them comparable.
2. Age vs Income example
○ Age: 0–100
○ Income: 20,000–500,000
○ Income values are ~1000× larger. A regression model will give more weight to
income even if it’s not inherently more important.
→ After normalization, both are brought to the same scale.
🔹 Three Common Normalisation Techniques
1. Simple Feature Scaling
xnew=xxmaxx_{new} = \frac{x}{x_{max}}
○ Divides by the maximum value.
○ Values range: 0 → 1.
○ Example: pandas["Length"] / pandas["Length"].max().
2. Min-Max Normalization
xnew=x−xminxmax−xminx_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}
○ Shifts and scales values into [0, 1].
○ Example: (df["Length"] - df["Length"].min()) /
(df["Length"].max() - df["Length"].min()).
3. Z-Score Normalization (Standardization)
xnew=x−μσx_{new} = \frac{x - \mu}{\sigma}
○ Subtract mean (μ) and divide by standard deviation (σ).
○ Values are centered around 0, typically between -3 and +3.
○ Example: (df["Length"] - df["Length"].mean()) /
df["Length"].std().
✅ Key Takeaway:
Normalization is crucial in machine learning and statistical modeling to ensure features
contribute fairly. The choice of technique depends on the model:
● Min-Max → when features need bounded values (e.g., neural networks).
● Z-Score → when we assume data is normally distributed.
● Simple Scaling → quick but less robust.
🔹 What is Binning?
● Definition: Binning means grouping a continuous range of numerical values into
intervals (called bins).
● Example: Instead of using raw ages like 2, 7, 12, 16, you group them into ranges:
○ 0–5 → Bin 1
○ 6–10 → Bin 2
○ 11–15 → Bin 3
● Why it’s useful?
○ Simplifies data → easier to analyze.
○ Sometimes improves accuracy of predictive models.
○ Helps understand distribution of data (which ranges most data points fall into).
🔹 Example with Car Dataset
● Attribute: Price
● Range: 5,188 to 45,400
● Has 201 unique values → hard to analyze directly.
● Using binning, we split them into 3 bins:
○ Low Price
○ Medium Price
○ High Price
🔹 How to Do It in Python
1. Use [Link]
○ To generate equally spaced dividers for bins.
○ Since we want 3 bins, we need 4 divider points.
import numpy as np
bins = [Link](min(df["price"]), max(df["price"]), 4)
print(bins)
2. This gives 4 numbers equally spaced between the min and max price.
Create Labels for Bins
group_names = ['Low', 'Medium', 'High']
3.
4. Apply [Link]
○ Segments the data into bins and assigns labels.
df['price_binned'] = [Link](df['price'], bins, labels=group_names, include_lowest=True)
5.
Check Distribution with Histogram
df['price_binned'].value_counts().plot(kind='bar')
6. → Shows how many cars fall into each bin.
🔹 Result (as explained in video)
● Most cars fall into Low Price.
● Very few cars are in High Price.
● Helps you quickly see price distribution instead of analyzing all 201 unique values.
Here are clear, structured notes from the transcript you provided on converting categorical
variables into quantitative variables (One-Hot Encoding):
📘 Notes: Converting Categorical
Variables to Quantitative Variables
🔹 Why Convert?
● Most statistical & ML models do not accept strings/objects as input.
● Models require numeric input for training & predictions.
● Example: In the car dataset, the feature fuel type = "gas" or "diesel" (categorical,
string).
🔹 One-Hot Encoding (OHE)
● Definition: Encoding categorical variables by creating new binary features (0/1) for
each unique category.
● Process:
○ Each unique value in the categorical feature becomes a new column.
○ If the observation has that category → column = 1
○ Otherwise → column = 0
🔹 Example: Fuel Type
Original Feature:
fuel_type: ["gas", "diesel", "gas", ...]
After One-Hot Encoding:
Car Fuel Type Gas Diese
l
A gas 1 0
B diesel 0 1
C gas 1 0
D diesel 0 1
🔹 Implementation in Python
✅ Using Pandas get_dummies()
import pandas as pd
# Example DataFrame
df = [Link]({"fuel": ["gas", "diesel", "gas", "diesel"]})
# Apply One-Hot Encoding
dummy_variable_one = pd.get_dummies(df["fuel"])
print(dummy_variable_one)
Output:
diesel gas
0 0 1
1 1 0
2 0 1
3 1 0
🔹 Key Points
● This process = One-Hot Encoding.
● Creates dummy variables (binary 0/1).
● Easy in Python using pd.get_dummies().
● Helps convert categorical data → numerical for ML models.
Lesson Summary
Congratulations! You have completed this lesson. At this point in the course, you know:
● Data formatting is critical for making data from various sources consistent and
comparable.
● Master the techniques in Python to convert units of measurement, like
transforming "city miles per gallon" to "city-liters per 100 kilometers" for ease of
comparison and analysis.
● Acquire skills to identify and correct data types in Python, ensuring the data is
accurately represented for subsequent statistical analyses.
● Data normalization helps make variables comparable and helps eliminate
inherent biases in statistical models.
● You can apply Feature Scaling, Min-Max, and Z-Score to normalize data and
apply each technique in Python using pandas’ methods.
● Binning is a method of data pre-processing to improve model accuracy and data
visualization.
● Run binning techniques in Python using numpy's "linspace" and pandas' "cut"
methods, particularly for numerical variables like "price."
● Utilize histograms to visualize the distribution of binned data and gain insights
into feature distributions.
● Statistical models generally require numerical inputs, making it necessary to
convert categorical variables like "fuel type" into numerical formats.
● You can implement the one-hot encoding technique in Python using pandas’
get_dummies method to transform categorical variables into a format suitable for
machine learning models.
Nice — you’ve basically built a data wrangling cheat sheet for Python with pandas.
Let me reframe this in a clear way so you can use it as a quick reference when working
on projects:
📌 Data Analysis with Python – Data
Wrangling Cheat Sheet
🔹 Replace Missing Data with Most Frequent Entry (Mode)
MostFrequentEntry = df['attribute_name'].value_counts().idxmax()
df['attribute_name'].replace([Link], MostFrequentEntry, inplace=True)
🔹 Replace Missing Data with Mean
AverageValue = df['attribute_name'].astype(<data_type>).mean(axis=0)
df['attribute_name'].replace([Link], AverageValue, inplace=True)
🔹 Fix Data Types
df[['attribute1_name', 'attribute2_name', ...]] = \
df[['attribute1_name', 'attribute2_name', ...]].astype('data_type')
# data_type could be int, float, str, etc.
🔹 Normalize Data (Min-Max Normalization)
df['attribute_name'] = df['attribute_name'] / df['attribute_name'].max()
🔹 Binning (Convert Continuous Data into Categories)
bins = [Link](min(df['attribute_name']), max(df['attribute_name']), n)
# n = number of bins
GroupNames = ['Group1', 'Group2', 'Group3', ...]
df['binned_attribute_name'] = [Link](df['attribute_name'], bins,
labels=GroupNames, include_lowest=True)
🔹 Change Column Name
[Link](columns={'old_name': 'new_name'}, inplace=True)
🔹 Indicator Variables (One-Hot Encoding)
dummy_variable = pd.get_dummies(df['attribute_name'])
df = [Link]([df, dummy_variable], axis=1)
👉 This cheat sheet basically covers the core preprocessing steps in pandas:
● Handling missing values
● Type conversions
● Normalization
● Binning
● Renaming columns
● Encoding categorical variables
Would you like me to also add real-world examples for each method (e.g., car dataset
with “fuel type”, “horsepower”, etc.) so it’s easier to remember?
Module 2
In this module, you will build essential skills in exploratory data analysis (EDA) using
Python. You will learn to perform computations on the data to calculate basic descriptive
statistical information, such as mean, median, mode, and quartile values, and use that
information to better understand the distribution of the data. You will learn how to group
data to better visualize patterns, use the Pearson correlation method to compare two
continuous numerical variables, and apply the chi-square test to assess associations
between categorical variables and interpret the results. Further, you will be provided
with a cheat sheet that will serve as a quick reference for commonly used EDA
functions and methods.
Learning Objectives
● Explain the role of EDA in understanding dataset structure, patterns, and
potential outliers before applying statistical models
● Interpret summary statistics to describe the central tendency and spread of a
numerical variable in a dataset
● Apply the groupby() function in Pandas to compare aggregated statistics across
different categories
● Select and generate appropriate plot types in Python to visualize relationships or
distributions in a dataset
● Identify relationships between two continuous variables by analyzing correlation
coefficients and interpreting the direction and strength of the relationship
● Explain the use of the Pearson correlation coefficient in quantifying linear
relationships between two numerical variables
● Use the chi-square test to evaluate whether two categorical variables are
statistically dependent using observed and expected frequency tables
● Perform EDA on the Used Car Pricing dataset using summary statistics,
visualizations, and correlation analysis in Python
● Apply EDA techniques to the Laptop Pricing dataset to uncover trends, outliers,
and variable relationships using Pandas and visualization libraries
Here’s a clear, structured note version of your module summary 👇
Exploratory Data Analysis (EDA) with
Python
Definition
Exploratory Data Analysis (EDA) is the process of analyzing data to:
● Summarize main characteristics
● Gain better understanding of the dataset
● Uncover relationships between variables
● Identify important features for solving the problem
Main Question: What factors have the most impact on car price?
Techniques Covered in this Module
1. Descriptive Statistics
○ Provide a short summary of dataset characteristics.
○ Examples: mean, median, mode, standard deviation, variance, min, max.
2. Grouping Data (GroupBy)
○ Allows organizing data based on categories.
○ Useful for transforming and summarizing datasets.
○ Example: comparing car prices by brand or body style.
3. ANOVA (Analysis of Variance)
○ A statistical method to compare the means of multiple groups.
○ Helps determine if certain categorical variables significantly affect car price.
4. Correlation Analysis
○ Measures the strength of relationships between numerical variables.
○ Example: relationship between engine size and car price.
5. Advanced Correlation
○ Pearson Correlation: Measures linear relationship between two variables.
(Values between -1 and 1)
○ Correlation Heatmaps: Visual representation of correlation between multiple
variables, easier to spot patterns.
✅ By the end of this module, you will be able to:
● Summarize data with descriptive stats
● Transform data using GroupBy
● Test variable significance with ANOVA
● Explore relationships using correlation & heatmaps
Here’s a well-structured summary note for your video on Descriptive Statistics 👇
Exploratory Data Analysis (EDA) –
Descriptive Statistics
1. Purpose
● First step before building complex models.
● Helps summarize dataset characteristics and understand data distribution.
● Identifies patterns, outliers, and relationships between variables.
2. Descriptive Statistics in Pandas
● [Link]()
○ Automatically computes statistics for all numerical variables.
○ Shows:
■ count → total number of entries
■ mean → average value
■ std → standard deviation
■ min, max → extreme values
■ 25%, 50% (median), 75% → quartiles
○ Skips NaN values automatically.
3. Summarizing Categorical Variables
● df['column'].value_counts()
○ Counts occurrences of each category.
○ Example: drive-wheels feature:
■ Front-wheel drive: 118 cars
■ Rear-wheel drive: 75 cars
■ Four-wheel drive: 8 cars
4. Box Plots (Visualization of Numerical Data)
● Shows distribution of data:
○ Median → central data point
○ Quartiles (25%, 75%) → spread of middle data
○ IQR (Inter-Quartile Range) = Q3 - Q1
○ Extremes → 1.5 × IQR beyond Q1 and Q3
○ Outliers → data points beyond extremes
● Useful for spotting outliers, skewness, and group comparisons.
● Example: Price vs. Drive-wheels →
○ Rear-wheel drive prices differ significantly.
○ Front-wheel & four-wheel drive prices overlap.
5. Scatter Plots (Continuous Variables Relationship)
● Shows relationship between two numerical variables.
● Predictor variable → x-axis (e.g., engine size)
● Target variable → y-axis (e.g., price)
● Example: Engine size vs. Price → Positive linear relationship.
Python Example:
import [Link] as plt
[Link](df['engine-size'], df['price'])
[Link]("Engine Size")
[Link]("Price")
[Link]("Engine Size vs Price")
[Link]()
✅ Takeaways:
● Use describe() for numeric summaries.
● Use value_counts() for categorical summaries.
● Use box plots for distribution & outliers.
● Use scatter plots to visualise relationships.
Groupby
🔹 The Problem
We want to analyze how the drive system (FWD, RWD, 4WD) and body style (sedan,
convertible, hatchback, etc.) affect the average price of vehicles.
Instead of looking at each row individually, we want to group them by categories and compare.
🔹 Step 1: Grouping Data with groupby
In Pandas, groupby() lets us split the dataset into groups based on categorical variables.
Example:
import pandas as pd
# Step 1: Select the relevant columns
df_test = df[['drive-wheels', 'body-style', 'price']]
# Step 2: Group by drive-wheels and body-style
grouped_test = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()
print(grouped_test)
✅ What this does:
● Groups vehicles by drive-wheels and body-style.
● Calculates the mean price for each subgroup.
🔹 Step 2: Creating a Pivot Table
A pivot table rearranges the grouped data into a grid (like Excel).
pivot_table = grouped_test.pivot(index='drive-wheels', columns='body-style', values='price')
print(pivot_table)
✅ What this does:
● Rows = drive-wheels
● Columns = body-style
● Values = average price
Now you can easily see which combination (e.g., rwd + convertible) has the highest
average price.
🔹 Step 3: Visualizing with a Heatmap
Tables are informative but not visually intuitive. A heatmap makes patterns clearer.
import [Link] as plt
[Link](pivot_table, cmap='RdBu')
[Link]()
[Link](range(len(pivot_table.columns)), pivot_table.columns)
[Link](range(len(pivot_table.index)), pivot_table.index)
[Link]()
✅ What this does:
● Converts the pivot table into a colored grid.
● Higher prices → one end of the color spectrum
● Lower prices → other end
🔹 Insights from Example
● RWD convertibles & hardtops → highest average price 💰
● 4WD hatchbacks → lowest average price 💸
● Heatmap shows top section (RWD) more expensive, bottom (FWD/4WD) cheaper.
Perfect 👍 Thanks for sharing all this content! Let me turn it into a well-structured cheat sheet
/ best notes on Data Visualization in Python so you can revise quickly before applying it.
📊 Data Visualization in Python – Cheat
Sheet
🔹 Libraries to Import
# Matplotlib
import [Link] as plt
%matplotlib inline # (for Jupyter notebooks)
# Seaborn
import seaborn as sns
📌 Matplotlib Functions
1. Line Plot
● Shows relationship between x (independent) and y (dependent).
[Link](x, y)
2. Scatter Plot
● Relationship between two variables.
[Link](x, y)
● Options: change color, size, shape of markers.
3. Histogram
● Shows frequency distribution of values.
[Link](x, bins=10, edgecolor='black')
4. Bar Plot
● Visualizes categorical data (counts or averages).
[Link](x, height, width=0.8)
5. Pseudocolor Plot (Heatmap-like)
● Displays matrix/grid data with color intensity.
● Often used to visualize pivot tables.
[Link](C, cmap='RdBu')
[Link]()
📌 Seaborn Functions
1. Regression Plot
● Scatter plot + regression line (with confidence interval).
[Link](x='var1', y='var2', data=df)
2. Box & Whisker Plot
● Shows distribution, quartiles, and outliers.
[Link](x='category', y='value', data=df)
3. Residual Plot
● Checks quality of regression fit.
[Link](x='var1', y='var2', data=df)
4. KDE Plot (Kernel Density Estimate)
● Smooth probability density curve.
[Link](x)
5. Distribution Plot
● Combines Histogram + KDE (optionally).
[Link](x, hist=False) # only KDE curve
✅ Summary
● Matplotlib → Low-level, flexible (line, scatter, bar, histogram, pcolor).
● Seaborn → High-level, prettier (regression, boxplot, residuals, KDE, dist).
● Use Matplotlib for basics, Seaborn for advanced/statistical visualizations.
📊 Correlation in EDA (Exploratory Data
Analysis)
🔹 What is Correlation?
● Definition: A statistical metric that measures how strongly two variables are
interdependent.
● In simple terms → If one variable changes, does the other change as well?
Examples:
● Smoking ↔ Lung cancer → Higher smoking = higher chance of lung cancer.
● Umbrella ↔ Rain → More rain → More umbrellas used.
⚠️ Important:
● Correlation ≠ Causation
○ Umbrella doesn’t cause rain, nor rain causes umbrellas → they are just
correlated.
🔹 Correlation in Data Science
● We mostly focus on correlation to identify potential predictors of a target variable (like
car price).
● Helps us understand relationships and decide which features may influence the
outcome.
🔹 Examples in Car Price Prediction
1. Engine Size vs. Price
○ Scatter plot + Regression line (linear fit)
○ Steep positive slope → Positive linear correlation
○ 📈 As engine size ↑ → price ↑
○ ✅ Good predictor of car price.
2. Highway Miles per Gallon (mpg) vs. Price
○ Steep negative slope → Negative linear correlation
○ 📉 As mpg ↑ → price ↓
○ ✅ Still a good predictor of car price, despite being negative.
3. Peak RPM vs. Price
○ Relationship is weak → No clear pattern.
○ Both low and high RPM values → low/high prices.
○ ❌ Weak correlation → Not useful as a predictor.
🔹 Visualizing Correlation
● Scatter plots + regression line (using Seaborn regplot)
● Slope of regression line indicates type:
○ Positive → 📈
○ Negative → 📉
○ Flat/irregular → weak or no correlation
✅ Key Takeaways
● Correlation measures strength and direction of variable relationships.
● Strong correlations (positive/negative) are useful predictors.
● Weak correlations should generally be ignored.
● Always remember: Correlation ≠ Causation.
📌 Correlation in Exploratory Data
Analysis (EDA)
1. Pearson Correlation
● Measures strength & direction of correlation between continuous numerical variables.
● Provides two outputs:
○ Correlation Coefficient (r):
■ +1 → Strong positive correlation
■ -1 → Strong negative correlation
■ 0 → No correlation
○ p-value:
■ < 0.001 → Strong certainty
■ 0.001 – 0.05 → Moderate certainty
■ 0.05 – 0.1 → Weak certainty
■ 0.1 → No certainty
2. Interpreting Correlation
● Strong correlation → when |r| ≈ 1 and p-value < 0.001
● Example:
○ Horsepower vs Car Price → r ≈ 0.8 (strong positive)
○ Very small p-value (<0.001) → high certainty
3. Correlation Heatmap
● Visual tool showing correlations among variables.
● Diagonal values = 1 (self-correlation).
● Color scheme represents strength:
○ Dark red → Strong positive correlation
○ Dark blue → Strong negative correlation
● Helps identify which variables are most related to Price.
✅ Key Takeaway:
● Use Pearson correlation + p-value to measure strength & reliability of variable
relationships.
● Heatmaps provide a clear visual summary of correlations, especially for target
variables like car price.
📊 Chi-Square Test for Categorical
Variables
🔹 1. Introduction
● Statistical test to check if there is a significant association between two categorical
variables.
● Common in social sciences, marketing, healthcare, education, quality control.
● Non-parametric → no assumption of data distribution.
🔹 2. Hypotheses
● Null Hypothesis (H₀): No association between variables (differences due to chance).
● Alternative Hypothesis (H₁): Significant association exists.
🔹 3. Formula
χ2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
Where:
● OiO_i = Observed frequency
● EiE_i = Expected frequency
Ei=(row total×column total)grand totalE_i = \frac{(row\ total \times column\ total)}{grand\ total}
Degrees of freedom:
df=(r−1)×(c−1)df = (r-1) \times (c-1)
(r = rows, c = columns)
🔹 4. Decision Rule
● Compare calculated χ² with critical value from chi-square table.
● Or use p-value:
○ If p < α (e.g., 0.05) → Reject H₀ (significant association).
○ If p ≥ α → Fail to reject H₀ (no significant association).
🔹 5. Python Implementation
import pandas as pd
from [Link] import chi2_contingency
# Contingency table
data = [[20, 30], # Male: [Like, Dislike]
[25, 25]] # Female: [Like, Dislike]
df = [Link](data, columns=["Like", "Dislike"], index=["Male", "Female"])
# Perform Chi-Square Test
chi2, p, dof, expected = chi2_contingency(df)
print("Chi-square Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)
print("Expected Frequencies:\n", expected)
✅ Output Example:
● χ² = 1.008
● df = 1
● p = 0.316 (> 0.05) → Fail to reject H₀
🔹 6. Practical Examples
a) Weak Association (Gender vs Product Preference)
● χ² = 1.008, df = 1, p = 0.316 → No significant association
b) Strong Association (Smoking vs Lung Disease)
● χ² = 44.33, df = 1, p << 0.05 → Strong significant association
🔹 7. Applications
● Market Research → Customer demographics vs product preference
● Healthcare → Patient characteristics vs disease incidence
● Social Sciences → Education vs voting behavior
● Education → Teaching methods vs performance
● Quality Control → Manufacturing conditions vs defects
✅ Key Takeaway:
The Chi-Square test compares observed vs expected frequencies to detect relationships
between categorical variables. A large χ² and small p-value → strong evidence of
association.
Data Analysis with Python cheet sheet
Here are clear, structured notes from your video on Model Development:
Module 4
📘 Model Development Notes
🔹 What is a Model (Estimator)?
● A model/estimator = mathematical equation that predicts a value (dependent variable)
from one or more independent variables (features).
● Example:
○ Feature (Independent Variable): Highway MPG
○ Target (Dependent Variable): Car Price
🔹 Why More Data Matters
● The more relevant features, the more accurate the model.
● Example:
○ Two cars are identical except color (pink vs red).
○ If color is not included as a feature → Model predicts the same price.
○ But in reality, pink cars sell for less → Prediction becomes inaccurate.
🔹 Types of Models Covered
1. Simple Linear Regression
○ One independent variable.
○ Example: Predict price using only highway MPG.
2. Multiple Linear Regression
○ Multiple independent variables.
○ Example: Predict price using MPG, horsepower, and weight together.
3. Polynomial Regression
○ Models non-linear relationships by including higher-order terms (e.g., x2,x3x^2,
x^3).
🔹 Key Takeaway
● Better models = More relevant features + Appropriate regression type.
● The goal: Accurately predict car prices and even determine a fair value for used
cars.
📘 Linear Regression Notes
🔹 1. Introduction
● Linear Regression → Predict a target (dependent variable, y) from one or more
predictors (independent variables, x).
● Two types:
○ Simple Linear Regression (SLR): One predictor variable.
○ Multiple Linear Regression (MLR): Two or more predictor variables.
🔹 2. Simple Linear Regression (SLR)
✅ Definition
● Models relationship between x (predictor) and y (target) using a straight line:
y=b0+b1x+ϵy = b_0 + b_1x + \epsilon
○ b0b_0: Intercept
○ b1b_1: Slope
○ ϵ\epsilon: Error (noise)
✅ Example
● Predicting car price from highway MPG.
● Equation (from video):
Price=38,423.31−821.73×(Highway MPG)\text{Price} = 38,423.31 - 821.73 \times
(\text{Highway MPG})
✅ Process
1. Collect training points (data).
2. Fit model → find b0,b1b_0, b_1.
3. Prediction: Input x → output y^\hat{y} (predicted price).
4. Compare y^\hat{y} vs actual y → error = noise or model limitation.
✅ Noise
● Random values added to account for uncertainty.
● Usually small, centered around 0.
✅ Implementation in Python
from sklearn.linear_model import LinearRegression
# Create model
lm = LinearRegression()
# Fit model
[Link](X, y) # X = predictor, y = target
# Get parameters
lm.intercept_ # b0
lm.coef_ # b1
# Prediction
yhat = [Link](X)
🔹 3. Multiple Linear Regression (MLR)
✅ Definition
● Relationship between y (dependent variable) and multiple predictors (x1, x2, …, xn).
y=b0+b1x1+b2x2+...+bnxn+ϵy = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n + \epsilon
✅ Visualization
● With 2 predictors (x1, x2), values can be shown on a 2D plane, and predictions
(y^\hat{y}) are represented as heights (3D visualization).
✅ Example
● Predict car price using predictors: horsepower, curb-weight, engine-size, highway MPG.
● Equation looks like:
Price=b0+b1(Horsepower)+b2(Curb-weight)+b3(Engine-size)+b4(Highway
MPG)\text{Price} = b_0 + b_1(\text{Horsepower}) + b_2(\text{Curb-weight}) +
b_3(\text{Engine-size}) + b_4(\text{Highway MPG})
✅ Implementation in Python
# Select multiple predictors
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
# Fit model
[Link](Z, y)
# Parameters
lm.intercept_ # b0
lm.coef_ # [b1, b2, b3, b4]
# Prediction
yhat = [Link](Z)
🔹 4. Key Takeaways
● SLR → one predictor, simpler, but limited.
● MLR → more predictors → better accuracy, but risk of overfitting.
● Models are only approximations → Predictions ≠ always exact (noise, missing features,
assumptions).
● In Python (scikit-learn): fit() finds parameters, predict() makes predictions.
📘 Model Evaluation Using Visualization
🔹 1. Purpose of Visualization in Regression
● Helps evaluate model fit & assumptions.
● Main tools:
○ Regression plots
○ Residual plots
○ Distribution plots
🔹 2. Regression Plot
● Shows the relationship between independent (x) and dependent (y) variables.
● Components:
○ x-axis: Independent variable (feature).
○ y-axis: Dependent variable (target).
○ Points: Actual data samples.
○ Line: Predicted values (fitted line).
✅ Implementation (Seaborn)
import seaborn as sns
[Link](x="feature", y="target", data=df)
🔹 3. Residual Plot
● Residual = Actual value – Predicted value.
● x-axis: Independent variable.
● y-axis: Residual (error).
✅ What to Look For:
1. Ideal case (good linear fit):
○ Residuals randomly scattered.
○ Zero mean, evenly distributed, constant variance.
○ No curvature.
2. Problematic cases:
○ Curvature → Suggests non-linear relationship.
○ Changing variance (residuals spread increases with x) → Linear model not
valid.
○ Patterns → Model is missing important features.
✅ Implementation (Seaborn)
[Link](x=df["feature"], y=df["target"])
🔹 4. Distribution Plot
● Compares distribution of actual vs predicted values.
● Good for evaluating models with multiple features.
● If predicted distribution ≈ actual distribution → good model fit.
✅ Observations
● Inaccuracies → Predicted values deviate in certain ranges.
○ Example: Predictions for 40k–50k price inaccurate.
○ Predictions for 10k–20k price close to actual values.
✅ Implementation (Seaborn)
import seaborn as sns
# Actual values
[Link](y, hist=False, color="r", label="Actual")
# Predicted values
[Link](yhat, hist=False, color="b", label="Predicted")
🔹 5. Key Insights
● Regression plots → Check overall trend & correlation.
● Residual plots → Validate assumptions of linearity & constant variance.
● Distribution plots → Compare predicted vs actual values across ranges.
● Good model → Random residuals, predicted values match actual distribution.
📘 Polynomial Regression & Pipelines
🔹 1. Why Polynomial Regression?
● Linear regression assumes a straight-line relationship.
● When data shows curvilinear relationships, linear regression may fail.
● Polynomial regression = Transform predictor variables into polynomial terms, then
apply linear regression.
🔹 2. Polynomial Regression Basics
● Polynomial regression is still a linear model (linear in parameters).
● Forms:
○ Quadratic (2nd order): Includes x2x^2.
○ Cubic (3rd order): Includes x3x^3.
○ Higher-order: More flexibility, but risk of overfitting.
👉 Key Point: The degree of the polynomial greatly affects model fit.
🔹 3. Example (1D Polynomial Regression)
Symbolic model (3rd order example):
y=−1.557x3+204.8x2+8965x+1.37×105y = -1.557x^3 + 204.8x^2 + 8965x + 1.37 \times 10^5
● Captures non-linear patterns.
● Too high a degree → model becomes overly complex.
🔹 4. Multidimensional Polynomial Regression
● With multiple features, polynomial expansion creates interaction terms.
Example (2D, 2nd order):
y=b0+b1x1+b2x2+b3x12+b4x22+b5x1x2y = b_0 + b_1x_1 + b_2x_2 + b_3x_1^2 +
b_4x_2^2 + b_5x_1x_2
● NumPy’s polyfit → works only for 1D.
● For higher dimensions → use Scikit-learn’s PolynomialFeatures.
✅ Implementation
from [Link] import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
🔹 5. Feature Scaling / Normalization
● Polynomial expansion increases feature magnitude → need normalization.
● StandardScaler commonly used.
✅ Implementation
from [Link] import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_poly)
🔹 6. Pipelines
● Writing separate code for transformation, normalization, and regression → repetitive.
● Pipeline automates this sequence:
1. Polynomial transformation
2. Normalization
3. Regression
✅ Implementation
from [Link] import Pipeline
from [Link] import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
("scale", StandardScaler()),
("poly", PolynomialFeatures(degree=2)),
("model", LinearRegression())
])
[Link](x_train, y_train)
y_pred = [Link](x_test)
● Advantages:
○ Simplifies code.
○ Reduces errors.
○ Makes experiments easier (just change pipeline parameters).
🔹 7. Key Insights
● Polynomial regression handles non-linear data.
● Still linear in coefficients, but with polynomial features.
● Higher degree ≠ always better → balance fit & complexity.
● Pipelines automate preprocessing + modeling → clean, efficient workflow.
⚡ This gives you a full workflow:
Linear Regression → Polynomial Regression → Feature Scaling → Pipelines 🚀
Kernel Density Estimation (KDE) Plots for Model
Evaluation
📘 Model Evaluation (Numerical)
We can evaluate models numerically to measure how well they fit the data. Two important
measures are:
1. Mean Squared Error (MSE)
● Definition: Average of the squared differences between actual value yy and predicted
value y^\hat{y}.
MSE=1n∑(y−y^)2MSE = \frac{1}{n}\sum (y - \hat{y})^2
● Steps:
○ Find error = y−y^y - \hat{y}
○ Square it
○ Take the mean of all squared errors
● Interpretation:
○ Small MSE → good model fit (predictions are close to actual values).
○ Large MSE → poor fit.
● Python Implementation:
from [Link] import mean_squared_error
MSE = mean_squared_error(y_actual, y_pred)
2. R-Squared (R2R^2)
● Also called: Coefficient of Determination
● Definition: Measures how close the data is to the fitted regression line.
R2=1−MSEregressionMSEaverageR^2 = 1 - \frac{MSE_{regression}}{MSE_{average}}
● Where:
○ MSEregressionMSE_{regression}: Error from regression line
○ MSEaverageMSE_{average}: Error if we only used the mean of yy
● Range: Usually between 0 and 1
●
○ R2=1R^2 = 1 → Perfect fit (model explains all variation)
○ R2=0R^2 = 0 → Model no better than mean
○ R2<0R^2 < 0 (rare) → Model worse than mean
● Interpretation Example:
○ R2=0.49659R^2 = 0.49659 → About 49.659% of variation in target is explained
by the model.
● Python Implementation:
●
from sklearn.linear_model import LinearRegression
model = LinearRegression()
[Link](X, y)
R2 = [Link](X, y)
3. Visualization vs Numerical
● Visualization shows how well line fits data (plots).
● MSE and R2R^2 give numerical, objective metrics to compare models.
✅ Quick Summary:
● Use MSE → check how far predictions are from actual values.
● Use R2R^2 → check how much variance in data is explained by the model.
● Closer MSE → 0, and R2R^2 → 1, the better the fit.
✅
👇
Got it Here are well-structured notes from your transcript on Prediction & Decision
Making
📘 Prediction & Decision Making
1. Model Correctness Check
● Ensure results make sense:
○ Predictions should not be negative, unrealistically high, or too low.
○ Check coefficients (.coef_) for logical impact of features.
■ Example: Increase of 1 mpg (highway) → car price decreases by
~$821 (reasonable).
● Always combine:
○ Visualization
○ Numerical evaluation (MSE, R²)
○ Model comparison
2. Prediction Example
● Model trained with fit()
● Predict price for highway mpg = 30 → $13,771.30
● ✅ Seems reasonable (not extreme).
3. Unrealistic Predictions
● Sometimes predictions are nonsense:
○ E.g., mpg range 0–100 → negative prices.
● Causes:
○ No data in that range
○ Linear assumption may be invalid
● Conclusion: Only trust model in ranges where realistic data exists.
4. Generating Sequences for Predictions
● Use [Link](start, stop, step)
○ Example: [Link](1, 101, 1) → sequence from 1 to 100
● Predictions on this sequence → NumPy array (may include negative values if out of
range).
5. Visualization for Model Validation
● Regression plot: Shows overall trend (good for polynomial regression).
● Residual plot:
●
○ If residuals show curvature → model may need non-linear fit.
● Distribution plot (multiple regression):
○ E.g., predicted prices $30K–$50K inaccurate → model may need more data or
non-linear terms.
6. Numerical Evaluation
(a) Mean Squared Error (MSE)
● Smaller MSE → better fit.
● Example MSEs:
○ 3,495 → very close predictions
○ 3,652 → still reasonable
○ 12,870 → much worse fit
(b) R-Squared (R²)
● Measures % of variance explained.
● Example values:
○ 0.9986 → Excellent fit
○ 0.9226 → Still strong linear relation
○ 0.806 → Messy but clear relation
○ 0.61 → Weak but upward trend visible
● Acceptable threshold varies by field: Some authors accept R² ≥ 0.10.
7. Model Comparisons
● Simple Linear Regression (SLR) vs Multiple Linear Regression (MLR):
○ MLR usually has lower MSE (more variables reduce errors).
○ Polynomial regression also lowers MSE.
● Important: Lower MSE ≠ always better fit. Must balance complexity vs interpretability.
● Inverse relation: More variables → MSE ↓, R² ↑.
✅ Quick Takeaway:
● Always check if predictions are logical.
● Use visual + numerical checks (MSE, R², residuals).
● Be cautious: more variables → lower errors, but may not mean better or more valid
model.
Cheat Sheet: Model
Development
Lesson Summary
Congratulations! You have completed this lesson. At this point in the course, you know:
● Linear regression refers to using one independent variable to make a prediction.
● You can use multiple linear regression to explain the relationship between one
continuous target y variable and two or more predictor x variables.
● Simple linear regression, or SLR, is a method used to understand the
relationship between two variables, the predictor independent variable x and the
target dependent variable y.
● Use the regplot and residplot functions in the Seaborn library to create regression
and residual plots, which help you identify the strength, direction, and linearity of
the relationship between your independent and dependent variables.
● When using residual plots for model evaluation, residuals should ideally have
zero mean, appear evenly distributed around the x-axis, and have consistent
variance. If these conditions are not met, consider adjusting your model.
● Use distribution plots for models with multiple features: Learn to construct
distribution plots to compare predicted and actual values, particularly when your
model includes more than one independent variable. Know that this can offer
deeper insights into the accuracy of your model across different ranges of values.
● The order of the polynomials affects the fit of the model to your data. Apply
Python's polyfit function to develop polynomial regression models that suit your
specific dataset.
● To prepare your data for more accurate modeling, use feature transformation
techniques, particularly using the preprocessing library in scikit-learn, transform
your data using polynomial features, and use the modules like StandardScaler to
normalize the data.
● Pipelines allow you to simplify how you perform transformations and predictions
sequentially, and you can use pipelines in scikit-learn to streamline your
modeling process.
● You can construct and train a pipeline to automate tasks such as normalization,
polynomial transformation, and making predictions.
● To determine the fit of your model, you can perform sample evaluations by using
the Mean Square Error (MSE), using Python’s mean_squared_error function from
scikit-learn, and using the score method to obtain the R-squared value.
● A model with a high R-squared value close to 1 and a low MSE is generally a
good fit, whereas a model with a low R-squared and a high MSE may not be
useful.
● Be alert to situations where your R-squared value might be negative, which can
indicate overfitting.
● When evaluating models, use visualization and numerical measures and
compare different models.
● The mean square error is perhaps the most intuitive numerical measure for
determining whether a model is good.
● A distribution plot is a suitable method for multiple linear regression.
● An acceptable r-squared value depends on what you are studying and your use
case.
● To evaluate your model’s fit, apply visualization, methods like regression and
residual plots, and numerical measures such as the model's coefficients for
sensibility:
● Use Mean Square Error (MSE) to measure the average of the squares of the
errors between actual and predicted values and examine R-squared to
understand the proportion of the variance in the dependent variable that is
predictable from the independent variables.
● When analyzing residual plots, residuals should be randomly distributed around
zero for a good model. In contrast, a residual plot curve or inaccuracies in certain
ranges suggest non-linear behavior or the need for more data.
Module 5
📘 Model Evaluation: Train-Test Split &
Cross-Validation
🔹 1. Why Evaluate Models?
● In-sample evaluation → how well the model fits training data.
○ Problem: Doesn’t show performance on new/unseen data.
● Out-of-sample evaluation → measures how well the model generalizes to new data.
○ Done using train-test split or cross-validation.
🔹 2. Train-Test Split
● Dataset is split into:
○ Training set → to build the model.
○ Test set → to evaluate model performance.
● Typical split: 70% training, 30% testing.
● Done using train_test_split from sklearn.model_selection.
✅ Example
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.3,
random_state=42)
● Parameters:
○ test_size=0.3 → 30% of data for testing.
○ random_state → ensures reproducibility.
🔹 3. Generalization Error
● Definition: How well the model predicts unseen data.
● Observed in test set performance.
●
●
● Trade-off:
○ More training data → better accuracy, but less precision in error estimate.
○ More testing data → better precision, but less accurate model training.
🔹 4. Cross-Validation (CV)
● Solution to trade-off in train-test split.
● k-fold CV:
○ Split dataset into k folds.
○ Use (k-1) folds for training, 1 fold for testing.
○ Repeat until each fold has been used as test data.
○ Final performance = average of all folds.
✅ Example
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
lr = LinearRegression()
scores = cross_val_score(lr, x_data, y_data, cv=3) # 3-fold CV
mean_score = [Link](scores)
● cv=3 → 3 folds.
● Returns array of scores (e.g., R² values).
● Take mean for final evaluation.
🔹 5. Cross-Validation Prediction
● If we want the actual predictions (not just scores):
Use cross_val_predict.
✅ Example
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(lr, x_data, y_data, cv=3)
● Returns predictions for each fold.
● Useful for visualization (e.g., comparing actual vs predicted).
🔹 6. Key Takeaways
● Train-test split → simple, quick evaluation.
● Cross-validation → more robust, reduces bias/variance in error estimation.
● Generalization error = how well model works on unseen data.
● Use:
○ cross_val_score → to get performance metrics.
○ cross_val_predict → to get predicted values for analysis.
⚡ This completes the model evaluation workflow:
👉 Build model → Train-test split → Cross-validation → Error analysis 🚀
Perfect👌 — here are structured notes for the section on Polynomial Regression Model
Selection (Bias-Variance trade-off, underfitting, overfitting, and best polynomial order
selection):
📘 Model Selection: Polynomial
Regression
🔹 1. Goal of Model Selection
● Find the best polynomial order to fit the data.
● Too simple (low-order) → underfitting.
● Too complex (high-order) → overfitting.
🔹 2. Key Concepts
✅ Underfitting
● Model is too simple (e.g., linear fit).
● Fails to capture patterns in data →i high bias.
● Errors remain large even on training set.
✅ Overfitting
● Model is too complex (e.g., very high-order polynomial like degree 16).
● Captures noise instead of the true function.
● Training error low, but testing error high → high variance.
✅ Optimal Fit
● Best polynomial order = minimizes test error (MSE).
● Example: Order 8 gives lowest test MSE in synthetic dataset.
🔹 3. Error Behavior (Bias-Variance Trade-off)
● Training error always decreases as polynomial order increases.
● Testing error decreases initially, reaches minimum, then increases.
● The curve looks U-shaped → best order = minimum test error.
🔹 4. Irreducible Error
● Even at best polynomial order, some error remains.
● Sources:
○ Random noise in data (cannot be predicted).
○ Wrong assumption (e.g., data comes from sine wave, but we use polynomial).
🔹 5. Real Data Example (Horsepower → Car Price)
● Using polynomial fits:
○ Mean → poor prediction.
○ Linear / 2nd order → reasonable fit.
○ 3rd order → best fit (highest R²).
○ 4th order → sudden drop in predictions (erroneous, overfit).
✅ R² Evaluation
● R² closer to 1 = better fit.
● Tested multiple polynomial orders → order 3 optimal.
🔹 6. Implementation in Python
🔸 R² Comparison Across Polynomial Orders
from [Link] import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from [Link] import r2_score
orders = [1, 2, 3, 4]
r2_list = []
for order in orders:
poly = PolynomialFeatures(degree=order)
x_train_poly = poly.fit_transform(x_train.reshape(-1,1))
x_test_poly = poly.fit_transform(x_test.reshape(-1,1))
model = LinearRegression()
[Link](x_train_poly, y_train)
y_test_pred = [Link](x_test_poly)
r2_list.append(r2_score(y_test, y_test_pred))
print(r2_list)
🔹 7. Key Takeaways
● Order too low → underfit.
● Order too high → overfit.
● Best order = where test error (MSE) is lowest or R² is highest.
● Always evaluate on test data, not just training.
● There will always be irreducible error due to noise.
⚡ In short:
● Underfit = high bias.
● Overfit = high variance.
● Best model balances bias & variance using test error (MSE or R²).
Introduction to Ridge Regression
For models with multiple independent features and ones with polynomial feature
extrapolation, it is common to have colinear combinations of features. Left unchecked, this
multicollinearity of features can lead the model to overfit the training data. To control this,
the feature sets are typically regularized using hyperparameters.
Ridge regression is the process of regularizing the feature set using the hyperparameter
alpha. The upcoming video shows how Ridge regression can be utilized to regularize and
reduce standard errors and avoid over-fitting while using a regression model.
📘 Ridge Regression
🔹 1. Motivation
Ridge Regression (regularization to prevent overfitting in polynomial regression &
multiple features):
● Overfitting problem in:
○ High-order polynomials (curvy fits).
○ Models with many independent variables/features.
● Standard regression → large coefficients, especially for higher-order terms.
● Outliers make the problem worse (curve bends to fit noise).
🔹 2. Ridge Regression Basics
● Ridge regression = linear regression + penalty term.
● Adds constraint on coefficients → prevents them from becoming too large.
● Controlled by parameter α (alpha, λ in some texts).
Loss function:
Minimize ∑(yi−y^i)2+α∑βj2\text{Minimize } \sum (y_i - \hat{y}_i)^2 + \alpha \sum \beta_j^2
● First term = regular linear regression error (MSE).
● Second term = penalty on coefficient magnitudes.
🔹 3. Effect of Alpha (α)
● α = 0 → Ridge = Linear Regression (no regularization). → Overfitting risk.
● Small α (e.g., 0.001) → Slight penalty, reduces overfitting.
● Moderate α (e.g., 0.01 → 1) → Good balance, coefficients smaller, fit closer to real
function.
● Large α (e.g., 10) → Coefficients shrink close to 0 → underfitting.
✅ Key Trade-off
● Small α → model too flexible → overfit.
● Large α → model too rigid → underfit.
● Best α = chosen by validation data (cross-validation).
🔹 4. Cross-Validation for α Selection
1. Split data into:
○ Training set (fit model).
○ Validation set (tune α).
2. Try multiple α values (e.g., 0.001, 0.01, 0.1, 1, 10).
3. For each α:
○ Fit Ridge model.
○ Predict on validation data.
○ Compute R² (or MSE).
○ Store results.
4. Select α with highest validation R² (or lowest MSE).
🔹 5. Python Implementation (scikit-learn)
from sklearn.linear_model import Ridge
from [Link] import r2_score
alphas = [0.001, 0.01, 0.1, 1, 10]
r2_scores = []
for a in alphas:
ridge = Ridge(alpha=a)
[Link](x_train, y_train)
y_val_pred = [Link](x_val)
r2_scores.append(r2_score(y_val, y_val_pred))
best_alpha = alphas[r2_scores.index(max(r2_scores))]
print("Best alpha:", best_alpha)
🔹 6. Visualizing Ridge Regression Performance
● Plot R² vs. α:
○ Training R² (red curve) → decreases as α increases.
○ Validation R² (blue curve) → rises, peaks, then flattens/declines.
● Best α = where validation R² is maximized.
🔹 7. Used Car Example
● Dataset: multiple features + 2nd order polynomial.
● Training data (red) & validation data (blue).
● As α ↑ → validation R² improves, converges ~0.75.
● Beyond that, increasing α has little effect.
● Trade-off:
○ High α → prevents overfitting → generalizes better.
○ But test set R² decreases slightly (model less flexible).
🔹 8. Key Takeaways
● Ridge regression controls coefficient size → prevents overfitting.
● α tunes bias-variance trade-off:
○ Too small → overfit.
○ Too large → underfit.
● Cross-validation is essential for selecting α.
● Works especially well when:
○ Many correlated features.
○ High-order polynomial expansions.
⚡ In short:
Ridge regression shrinks coefficients to reduce variance (overfitting). The hyperparameter α
must be carefully chosen using validation data.
Here are clean and best-structured notes from your transcript on Grid Search in
Scikit-learn:
📌 Grid Search (Scikit-learn)
🔹 What is Grid Search?
● A method to automatically iterate over multiple hyperparameters using
cross-validation.
● It helps us find the best combination of hyperparameters for a model.
● Evaluates models with different hyperparameter values using metrics like:
○ Mean Squared Error (MSE)
○ R² Score (R²)
🔹 Hyperparameters
● Values set before training, not learned during training.
● Example: alpha in Ridge Regression, normalize option.
● Grid Search scans through different possible hyperparameter values.
🔹 Process of Grid Search
1. Start with one hyperparameter value → train model.
2. Try different hyperparameter values → retrain model.
3. Continue until all combinations are tested.
4. Each model produces an error (MSE or R²).
5. Select the hyperparameter that minimizes MSE / maximizes R².
🔹 Data Splitting
● Dataset is split into:
○ Training Set → train model.
○ Validation Set → evaluate hyperparameters.
○ Test Set → final performance check.
🔹 Implementation in Scikit-learn
Import Libraries
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
1.
Define Parameter Grid (Python dictionary inside a list)
parameters = [{'alpha': [0.1, 1, 10], 'normalize': [True, False]}]
2.
○ Key = parameter name
○ Value = possible options
Create Model & Grid Search Object
RR = Ridge()
grid = GridSearchCV(RR, parameters, cv=4, scoring='r2')
3.
○ RR = Ridge Regression object
○ parameters = dictionary of hyperparameter values
○ cv=4 = 4-fold cross-validation
○ scoring='r2' = evaluation metric
Fit the Model
[Link](X, y)
4.
Check Results
best_model = grid.best_estimator_ # Best hyperparameters
results = grid.cv_results_ # Detailed results
🔹 Outputs
● Best hyperparameter values → grid.best_params_
● Best model object → grid.best_estimator_
● Cross-validation scores → grid.cv_results_
🔹 Advantages
✅ Tests multiple hyperparameters quickly
✅ Finds best-performing model
✅ Easy implementation in few lines of code
📌 Example Parameter Grid for Ridge Regression:
parameters = [{'alpha': [0.01, 0.1, 1, 10], 'normalize': [True, False]}]
● Here:
○ alpha = regularization strength
○ normalize = whether to normalize input features or not
Cheat Sheet: Model Evaluation and Refinement
Lesson Summary
Congratulations! You have completed this lesson. At this point in the course, you know:
● How to split your data using the train_test_split() method into training and test
sets. You use the training set to train a model, discover possible predictive
relationships, and then use the test set to test your model to evaluate its
performance.
● How to use the generalization error to measure how well your data does at
predicting previously unseen data.
● How to use cross-validation by splitting the data into folds where you use some
of the folds as a training set, which we use to train the model, and the remaining
parts are used as a test set, which we use to test the model. You iterate through
the folds until you use each partition for training and testing. At the end, you
average results as the estimate of out-of-sample error.
● How to pick the best polynomial order and problems that arise when selecting the
wrong order polynomial by analyzing models that underfit and overfit your data.
● Select the best order of a polynomial to fit your data by minimizing the test error
using a graph comparing the mean square error to the order of the fitted
polynomials.
● You should use ridge regression when there is a strong relationship among the
independent variables.
● That ridge regression prevents overfitting.
● Ridge regression controls the magnitude of polynomial coefficients by introducing
a hyperparameter, alpha.
● To determine alpha, you divide your data into training and validation data.
Starting with a small value for alpha, you train the model, make a prediction using
the validation data, then calculate the R-squared and store the values. You
repeat the value for a larger value of alpha. You repeat the process for different
alpha values, training the model, and making a prediction. You select the value of
alpha that maximizes R-squared.
● That grid search allows you to scan through multiple hyperparameters using the
Scikit-learn library, which iterates over these parameters using cross-validation.
Based on the results of the grid search method, you select optimum
hyperparameter values.
● The GridSearchCV() method takes in a dictionary as its argument where the key
is the name of the hyperparameter, and the values are the hyperparameter
values you wish to iterate over.