8/23/24, 11:48 AM descriptive analytics.
ipynb - Colab
Descriptive analytis
1. statistics module
2. pandas
pandas
load dataset
import pandas as pd
data = pd.read_csv('/content/sample_data/Inc_Exp_Data.csv')
print('Dataset dimension:',data.shape)
print('Columns :\n',data.columns)
Dataset dimension: (50, 7)
Columns :
Index(['Mthly_HH_Income', 'Mthly_HH_Expense', 'No_of_Fly_Members',
'Emi_or_Rent_Amt', 'Annual_HH_Income', 'Highest_Qualified_Member',
'No_of_Earning_Members'],
dtype='object')
data.head()
Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_HH_I
0 5000 8000 3 2000
1 6000 7000 2 3000
2 10000 4500 2 0 1
3 10000 2000 1 0
4 12500 12000 2 3000 1
dataset contains 50 rows and 7 columns. 6 features are numeric and highetest-qualified feature
is string no null values
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mthly_HH_Income 50 non-null int64
1 Mthly_HH_Expense 50 non-null int64
2 No_of_Fly_Members 50 non-null int64
3 Emi_or_Rent_Amt 50 non-null int64
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 1/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
4 Annual_HH_Income 50 non-null int64
5 Highest_Qualified_Member 50 non-null object
6 No_of_Earning_Members 50 non-null int64
dtypes: int64(6), object(1)
memory usage: 2.9+ KB
data.describe()
Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_
count 50.000000 50.000000 50.000000 50.000000 5.0
mean 41558.000000 18818.000000 4.060000 3060.000000 4.9
std 26097.908979 12090.216824 1.517382 6241.434948 3.2
min 5000.000000 2000.000000 1.000000 0.000000 6.4
25% 23550.000000 10000.000000 3.000000 0.000000 2.5
50% 35000.000000 15500.000000 4.000000 0.000000 4.4
75% 50375.000000 25000.000000 5.000000 3500.000000 5.9
max 100000.000000 50000.000000 7.000000 35000.000000 1.4
visualize summary statistics
import matplotlib.pyplot as plt
bp = data[['Mthly_HH_Income', 'Mthly_HH_Expense','No_of_Fly_Members','No_of_Earning_Mem
plt.show()
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 2/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
monthly income, expenses and no_of_eraning_members have outliers
bp=plt.boxplot(data['Mthly_HH_Income'])
print(type(bp))
print(bp.keys())
<class 'dict'>
dict_keys(['whiskers', 'caps', 'boxes', 'medians', 'fliers', 'means'])
{'whiskers': [<matplotlib.lines.Line2D object at 0x7fb109d0e5c0>, <matplotlib.lines.L
remove outliers
mn, mx = [item.get_ydata()[1] for item in bp['caps']]
print('max:', mx,'min:', mn)
max: 90000 min: 5000
data.drop(data[data['Mthly_HH_Income']>mx].index, inplace=True)
plt.boxplot(data['Mthly_HH_Income'])
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 3/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
{'whiskers': [<matplotlib.lines.Line2D at 0x7fb109ec6020>,
<matplotlib.lines.Line2D at 0x7fb109ec62c0>],
'caps': [<matplotlib.lines.Line2D at 0x7fb109ec6560>,
<matplotlib.lines.Line2D at 0x7fb109ec6800>],
'boxes': [<matplotlib.lines.Line2D at 0x7fb109ec5d80>],
'medians': [<matplotlib.lines.Line2D at 0x7fb109ec6aa0>],
'fliers': [<matplotlib.lines.Line2D at 0x7fb109ec6d40>],
'means': []}
print('outliers in monthly income:\n',data[data['Mthly_HH_Income']>90000])
outliers in monthly income:
Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt \
46 98000 25000 5 0
47 100000 30000 6 0
48 100000 50000 4 20000
49 100000 40000 6 10000
Annual_HH_Income Highest_Qualified_Member No_of_Earning_Members
46 1152480 Professional 2
47 1404000 Graduate 3
48 1032000 Professional 2
49 1320000 Post-Graduate 1
print('Average & SD on monthly income: ',data['Mthly_HH_Income'].mean(), round(data['Mthl
print('Average & SD on monthly expenses: ',data['Mthly_HH_Expense'].mean(),round(data['Mt
print('Average & SD on annual income: ',data['Annual_HH_Income'].mean(),round(data['Annu
print('Average earning members: ',data['No_of_Earning_Members'].mean())
Average & SD on monthly income: 41558.0 26097.91
Average & SD on monthly expenses: 18818.0 12090.22
Average & SD on annual income: 490019.04 320135.79
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 4/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
Average earning members: 1.46
print('max & min monthly income: ',data['Mthly_HH_Income'].max(),data['Mthly_HH_Income'].
print('max & min monthly expenses: ',data['Mthly_HH_Expense'].max(),data['Mthly_HH_Expens
print('max & min earning members: ',data['No_of_Earning_Members'].max(),data['No_of_Earni
print('max & min annual income: ',data['Annual_HH_Income'].max(),data['Annual_HH_Income']
max & min monthly income: 100000 5000
max & min monthly expenses: 50000 2000
max & min earning members: 4 1
max & min annual income: 1404000 64200
categorical fields
print('Most occuring value in highest qualified : ',data['Highest_Qualified_Member'].mode
print('Most occuring value in no. of earning members: ',data['No_of_Earning_Members'].mod
Most occuring value in highest qualified : 0 Graduate
Name: Highest_Qualified_Member, dtype: object
Most occuring value in no. of earning members: 0 1
Name: No_of_Earning_Members, dtype: int64
visualizations
import matplotlib.pyplot as plt
earn_members = data['No_of_Earning_Members'].unique()
earn_members
array([1, 2, 3, 4])
earn_members = data['No_of_Earning_Members'].unique()
plt.hist(data['No_of_Earning_Members'])
plt.title('Number of earning members in the families')
plt.xlabel('No. of earning members')
plt.ylabel('Count of families')
plt.xticks(earn_members )
#plt.yticks(range(1,len(data),3))
plt.show()
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 5/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
family_members = data['No_of_Fly_Members'].unique()
plt.hist(data['No_of_Fly_Members'])
plt.title('Number of Flamily Members in the families')
plt.xlabel('No. of family Members')
plt.ylabel('Count of families')
plt.xticks(family_members )
plt.show()
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 6/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
bar chart
x = range(len(data))
idx = [i+0.4 for i in x]
plt.bar(x,data['No_of_Fly_Members'], width=0.4, label='Family members')
plt.bar(idx,data['No_of_Earning_Members'], width=0.4,label='Earning members')
plt.title('No. of family members & earning members in each family')
plt.legend()
plt.xticks(range(0,51,5))
plt.ylabel('Count')
plt.show()
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 7/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
plt.plot(data['Mthly_HH_Income'], label='Income')
plt.plot(data['Mthly_HH_Expense'], label='Expenditure')
plt.legend()
plt.title('Family Income vs Expenditure')
plt.ylabel('Amount ')
plt.show()
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 8/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
x = data['No_of_Earning_Members'].value_counts()
print(x)
plt.pie(x,labels=x.index, autopct='%.0f%%' )
plt.title('Proportion of No. of Earning members in the families ')
plt.show()
1 33
2 12
3 4
4 1
Name: No_of_Earning_Members, dtype: int64
x = data['Highest_Qualified_Member'].value_counts()
print(x)
plt.pie(x,labels=x.index,autopct='%.0f%%' )
plt.title('Proportion of highest qualified in the families ')
plt.show()
Graduate 19
Under-Graduate 10
Professional 10
Post-Graduate 6
Illiterate 5
Name: Highest_Qualified_Member, dtype: int64
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 9/9