[go: up one dir, main page]

0% found this document useful (0 votes)
15 views13 pages

LTI CheckList Assignment 1.ipynb - Colab

The document details a data preprocessing workflow for a customer dataset using pandas in Python. It includes loading the dataset, handling missing values through mean and mode imputation, and applying encoding techniques such as label encoding and one-hot encoding. The final output shows a cleaned dataset with no missing values and transformed categorical variables ready for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

LTI CheckList Assignment 1.ipynb - Colab

The document details a data preprocessing workflow for a customer dataset using pandas in Python. It includes loading the dataset, handling missing values through mean and mode imputation, and applying encoding techniques such as label encoding and one-hot encoding. The final output shows a cleaned dataset with no missing values and transformed categorical variables ready for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

7/26/25, 10:44 AM LTI CheckList assignment 1.

ipynb - Colab

import pandas as pd

# Load the dataset


df = pd.read_csv("Customer_Data - Customer_Data.csv")

# Quick overview
print(df.head())
print(df.info())

CustomerID Gender Age Country Subscribed MonthlyIncome Education \


0 1 Male 25.0 India Yes 50000.0 Graduate
1 2 Female 30.0 USA No 60000.0 Post-Graduate
2 3 Female 22.0 UK Yes NaN Undergraduate
3 4 Male 45.0 India No 45000.0 Graduate
4 5 Female NaN Germany Yes 70000.0 NaN

LoyaltyScore PreferredDevice TotalPurchases


0 7.0 Mobile 12
1 8.0 Laptop 15
2 6.0 Tablet 8
3 9.0 NaN 20
4 5.0 Laptop 10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 20 non-null int64
1 Gender 18 non-null object
2 Age 17 non-null float64
3 Country 20 non-null object
4 Subscribed 20 non-null object
5 MonthlyIncome 18 non-null float64
6 Education 18 non-null object
7 LoyaltyScore 18 non-null float64
8 PreferredDevice 18 non-null object
9 TotalPurchases 20 non-null int64
dtypes: float64(3), int64(2), object(5)
memory usage: 1.7+ KB
https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 1/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
None

import pandas as pd

# Load the dataset


df = pd.read_csv("Customer_Data - Customer_Data.csv")

# Mean imputation for numerical columns


df['Age'].fillna(df['Age'].mean(), inplace=True)
df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean(), inplace=True)
df['LoyaltyScore'].fillna(df['LoyaltyScore'].mean(), inplace=True)

# Mode imputation for categorical columns


df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['PreferredDevice'].fillna(df['PreferredDevice'].mode()[0], inplace=True)
df['Education'].fillna(df['Education'].mode()[0], inplace=True)

# Optional: Check if all missing values are filled


print("Missing values after imputation:\n", df.isnull().sum())

# Preview the first 5 rows


print("\nCleaned Dataset Preview:")
print(df.head())

Missing values after imputation:


CustomerID 0
Gender 0
Age 0
Country 0
Subscribed 0
MonthlyIncome 0
Education 0
LoyaltyScore 0
PreferredDevice 0
TotalPurchases 0
dtype: int64

Cleaned Dataset Preview:

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 2/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
CustomerID Gender Age Country Subscribed MonthlyIncome \
0 1 Male 25.000000 India Yes 50000.000000
1 2 Female 30.000000 USA No 60000.000000
2 3 Female 22.000000 UK Yes 56666.666667
3 4 Male 45.000000 India No 45000.000000
4 5 Female 33.352941 Germany Yes 70000.000000

Education LoyaltyScore PreferredDevice TotalPurchases


0 Graduate 7.0 Mobile 12
1 Post-Graduate 8.0 Laptop 15
2 Undergraduate 6.0 Tablet 8
3 Graduate 9.0 Mobile 20
4 Graduate 5.0 Laptop 10
/tmp/ipython-input-9-1909683163.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['Age'].fillna(df['Age'].mean(), inplace=True)
/tmp/ipython-input-9-1909683163.py:8: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean(), inplace=True)
/tmp/ipython-input-9-1909683163.py:9: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['LoyaltyScore'].fillna(df['LoyaltyScore'].mean(), inplace=True)
/tmp/ipython-input-9-1909683163.py:12: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through c
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 3/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
/tmp/ipython-input-9-1909683163.py:13: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through c
The behavior will change in pandas 3 0 This inplace method will never work because the intermediate object on which we are se

# Label Encoding for 'Subscribed' column


df['Subscribed'] = df['Subscribed'].map({'Yes': 1, 'No': 0})

from sklearn.preprocessing import LabelEncoder

# List of categorical columns to encode


categorical_cols = ['Gender', 'Education', 'PreferredDevice', 'Country']

# Apply Label Encoding


le = LabelEncoder()
for col in categorical_cols:
df[col] = le.fit_transform(df[col])

# One-Hot Encoding for Country, PreferredDevice, and Education


df = pd.get_dummies(df, columns=['Country', 'PreferredDevice', 'Education'])

# Calculate percentage of missing data in each column


missing_percentage = df.isnull().mean() * 100

# Display only columns with missing values


missing_percentage = missing_percentage[missing_percentage > 0]

print("Percentage of Missing Data in Each Column:\n")


print(missing_percentage)

Percentage of Missing Data in Each Column:

Series([], dtype: float64)

df.info(memory_usage='deep')

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 4/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 20 non-null int64
1 Gender 20 non-null int64
2 Age 20 non-null float64
3 Subscribed 20 non-null int64
4 MonthlyIncome 20 non-null float64
5 LoyaltyScore 20 non-null float64
6 TotalPurchases 20 non-null int64
7 Country_0 20 non-null bool
8 Country_1 20 non-null bool
9 Country_2 20 non-null bool
10 Country_3 20 non-null bool
11 PreferredDevice_0 20 non-null bool
12 PreferredDevice_1 20 non-null bool
13 PreferredDevice_2 20 non-null bool
14 Education_0 20 non-null bool
15 Education_1 20 non-null bool
16 Education_2 20 non-null bool
dtypes: bool(10), float64(3), int64(4)
memory usage: 1.4 KB

from sklearn.preprocessing import LabelEncoder

df_label = df.copy()
for col in ['Gender', 'Subscribed']: # Assume Country, PreferredDevice, Education already one-hot encoded
df_label[col] = LabelEncoder().fit_transform(df_label[col])

df_label.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 20 non-null int64
https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 5/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
1 Gender 20 non-null int64
2 Age 20 non-null float64
3 Subscribed 20 non-null int64
4 MonthlyIncome 20 non-null float64
5 LoyaltyScore 20 non-null float64
6 TotalPurchases 20 non-null int64
7 Country_0 20 non-null bool
8 Country_1 20 non-null bool
9 Country_2 20 non-null bool
10 Country_3 20 non-null bool
11 PreferredDevice_0 20 non-null bool
12 PreferredDevice_1 20 non-null bool
13 PreferredDevice_2 20 non-null bool
14 Education_0 20 non-null bool
15 Education_1 20 non-null bool
16 Education_2 20 non-null bool
dtypes: bool(10), float64(3), int64(4)
memory usage: 1.4 KB

df = pd.read_csv("Customer_Data - Customer_Data.csv")

# Fill missing values


df['Age'].fillna(df['Age'].mean(), inplace=True)
df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean(), inplace=True)
df['LoyaltyScore'].fillna(df['LoyaltyScore'].mean(), inplace=True)

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['PreferredDevice'].fillna(df['PreferredDevice'].mode()[0], inplace=True)
df['Education'].fillna(df['Education'].mode()[0], inplace=True)

# Now safe to apply one-hot encoding


df_ohe = pd.get_dummies(df, columns=['Country', 'PreferredDevice', 'Education'])
df_ohe.info(memory_usage='deep')

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 6/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
8 Country_India 20 non-null bool
9 Country_UK 20 non-null bool
10 Country_USA 20 non-null bool
11 PreferredDevice_Laptop 20 non-null bool
12 PreferredDevice_Mobile 20 non-null bool
13 PreferredDevice_Tablet 20 non-null bool
14 Education_Graduate 20 non-null bool
15 Education_Post-Graduate 20 non-null bool
16 Education_Undergraduate 20 non-null bool
dtypes: bool(10), float64(3), int64(2), object(2)
memory usage: 3.5 KB
/tmp/ipython-input-20-417497002.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['Age'].fillna(df['Age'].mean(), inplace=True)
/tmp/ipython-input-20-417497002.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['MonthlyIncome'].fillna(df['MonthlyIncome'].mean(), inplace=True)
/tmp/ipython-input-20-417497002.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['LoyaltyScore'].fillna(df['LoyaltyScore'].mean(), inplace=True)

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 7/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['PreferredDevice'].fillna(df['PreferredDevice'].mode()[0], inplace=True)
/tmp/ipython-input-20-417497002.py:8: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through ch
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] =

df['Education'].fillna(df['Education'].mode()[0], inplace=True)

# Count unique customer profiles


profile_counts = df.groupby(['Gender', 'Country', 'PreferredDevice']).size()

# Display the result


print(profile_counts)

# Optional: Count total number of unique combinations


print(f"\nTotal unique customer profiles: {profile_counts.shape[0]}")

Gender Country PreferredDevice


Female Germany Laptop 2
Tablet 1
India Laptop 1
Mobile 2
Tablet 2
UK Mobile 1
Tablet 1
USA Laptop 1
Tablet 1
Male Germany Laptop 1
India Mobile 2
UK Mobile 2
USA Laptop 1
Mobile 2
dtype: int64

Total unique customer profiles: 14


https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 8/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

# Behavioral analysis: Mean income and purchases grouped by subscription status


avg_behavior = df.groupby('Subscribed')[['MonthlyIncome', 'TotalPurchases']].mean()

print("Average MonthlyIncome and TotalPurchases by Subscription Status:")


print(avg_behavior)

Average MonthlyIncome and TotalPurchases by Subscription Status:


MonthlyIncome TotalPurchases
Subscribed
No 57777.777778 14.888889
Yes 55757.575758 12.636364

# Filter users aged below 30


young_users = df[df['Age'] < 30]

# Count device preferences


device_trend = young_users['PreferredDevice'].value_counts()

print("Preferred Devices Among Users Aged Below 30:")


print(device_trend)

Preferred Devices Among Users Aged Below 30:


PreferredDevice
Mobile 3
Tablet 3
Laptop 1
Name: count, dtype: int64

import seaborn as sns


import matplotlib.pyplot as plt

# Boxplot of LoyaltyScore by Gender


sns.boxplot(x='Gender', y='LoyaltyScore', data=df)
plt.title('Loyalty Score Distribution by Gender')
plt.ylabel('Loyalty Score')

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 9/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

plt.xlabel('Gender')
plt.show()

# Group by Education level and compute mean MonthlyIncome and LoyaltyScore


edu_insight = df.groupby('Education')[['MonthlyIncome', 'LoyaltyScore']].mean()

print("Average MonthlyIncome and LoyaltyScore by Education Level:")


print(edu_insight)

# Optional: sort by highest income or loyalty


edu_sorted_income = edu_insight.sort_values(by='MonthlyIncome', ascending=False)
edu_sorted_loyalty = edu_insight.sort_values(by='LoyaltyScore', ascending=False)
https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 10/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

print("\nSorted by Income:")
print(edu_sorted_income)

print("\nSorted by Loyalty:")
print(edu_sorted_loyalty)

Average MonthlyIncome and LoyaltyScore by Education Level:


MonthlyIncome LoyaltyScore
Education
Graduate 55000.000000 6.929293
Post-Graduate 58200.000000 7.600000
Undergraduate 59333.333333 7.000000

Sorted by Income:
MonthlyIncome LoyaltyScore
Education
Undergraduate 59333.333333 7.000000
Post-Graduate 58200.000000 7.600000
Graduate 55000.000000 6.929293

Sorted by Loyalty:
MonthlyIncome LoyaltyScore
Education
Post-Graduate 58200.000000 7.600000
Undergraduate 59333.333333 7.000000
Graduate 55000.000000 6.929293

# Aggregate total purchases and average income by country


country_stats = df.groupby('Country').agg({
'TotalPurchases': 'sum',
'MonthlyIncome': 'mean'
})

# Sort and get top 2 countries for each metric


top_purchases = country_stats.sort_values(by='TotalPurchases', ascending=False).head(2)
top_income = country_stats.sort_values(by='MonthlyIncome', ascending=False).head(2)

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 11/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab

print("🔝 Top 2 Countries by Total Purchases:\n", top_purchases)

print("\n💰 Top 2 Countries by Average Monthly Income:\n", top_income)

🔝 Top 2 Countries by Total Purchases:


TotalPurchases MonthlyIncome
Country
India 93 50428.571429
USA 74 60533.333333

💰 Top 2 Countries by Average Monthly Income:


TotalPurchases MonthlyIncome
Country
Germany 58 65250.000000
USA 74 60533.333333

# Aggregate total purchases and average income by country


country_stats = df.groupby('Country').agg({
'TotalPurchases': 'sum',
'MonthlyIncome': 'mean'
})

# Sort and get top 2 countries for each metric


top_purchases = country_stats.sort_values(by='TotalPurchases', ascending=False).head(2)
top_income = country_stats.sort_values(by='MonthlyIncome', ascending=False).head(2)

print("🔝 Top 2 Countries by Total Purchases:\n", top_purchases)

print("\n💰 Top 2 Countries by Average Monthly Income:\n", top_income)

🔝 Top 2 Countries by Total Purchases:


TotalPurchases MonthlyIncome
Country
India 93 50428.571429
USA 74 60533.333333

💰 Top 2 Countries by Average Monthly Income:


TotalPurchases MonthlyIncome

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 12/13
7/26/25, 10:44 AM LTI CheckList assignment 1.ipynb - Colab
Country
Germany 58 65250.000000
USA 74 60533.333333

https://colab.research.google.com/drive/1wuNW7PYu7fCu7ar7eUze2tJ5_Z4oLQOW#scrollTo=GndyhzlfJmGd&printMode=true 13/13

You might also like