📘 Class Notes + Colab Code: Pandas DataFrame
Basics
1. Introduction to Pandas
Pandas is a Python library for data analysis.
Provides two main data structures: - Series → one-dimensional (like a single column). - DataFrame → two-
dimensional (like an Excel spreadsheet).
Why use Pandas instead of spreadsheets? - Automation: repeat tasks easily. - Reproducibility: every step is
written in code. - Flexibility: works across OS, integrates with many data sources.
2. Load Your First Dataset
# Import pandas
import pandas as pd
# Load the Gapminder dataset (tab-separated file)
df = pd.read_csv("https://raw.githubusercontent.com/jennybc/gapminder/master/
data/gapminder.tsv", sep="\t")
# Print first few rows
print(df.head())
👉 Teaching Point: - .read_csv() loads CSV/TSV files. - Always check .head() to preview data.
3. Inspect DataFrame Structure
# Type of object
print(type(df))
# Shape: rows and columns
print("Shape:", df.shape)
# Column names
print("Columns:", df.columns)
1
# Data types
print(df.dtypes)
# More detailed info
print(df.info())
👉 Teaching Point: - .shape is an attribute, not a method → no parentheses. - Columns can be
object , int64 , float64 .
4. Select Columns
# Single column → Series
country_series = df['country']
print(type(country_series))
# Single column → DataFrame
country_df = df[['country']]
print(type(country_df))
# Multiple columns
subset = df[['country', 'year', 'lifeExp']]
print(subset.head())
# Dot notation (shortcut)
print(df.country.head())
👉 Teaching Point: - df['col'] → Series - df[['col']] → DataFrame
5. Select Rows
# By label with .loc[]
print(df.loc[0]) # First row
print(df.loc[[0, 99]]) # Multiple rows
# By index with .iloc[]
print(df.iloc[0]) # First row
print(df.iloc[-1]) # Last row
print(df.iloc[[0, 99, 999]])
👉 Teaching Point: - .loc[] → uses labels (row index names). - .iloc[] → uses positions (row
numbers).
2
6. Subset Rows and Columns
# Select rows 0, 99, 999 and columns country, lifeExp, gdpPercap
print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])
# Same with iloc (by position)
print(df.iloc[[0, 99, 999], [0, 3, 5]])
7. Grouped and Aggregated Statistics
# Average life expectancy by year
print(df.groupby('year')['lifeExp'].mean())
# Average lifeExp and gdpPercap by year + continent
grouped = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
print(grouped.head())
# Flatten the grouped result
print(grouped.reset_index().head())
# Number of countries per continent
print(df.groupby('continent')['country'].nunique())
👉 Teaching Point: - .groupby() = split → apply → combine. - Use .mean() , .sum() , .count() ,
etc.
8. Basic Plotting
import matplotlib.pyplot as plt
# Global yearly life expectancy trend
global_yearly_life = df.groupby('year')['lifeExp'].mean()
# Plot
global_yearly_life.plot(title="Average Life Expectancy Over Time")
plt.xlabel("Year")
plt.ylabel("Life Expectancy")
plt.show()
3
👉 Teaching Point: - Pandas integrates with Matplotlib. - .plot() quickly visualizes trends.