AIML 202046702
ASSIGNMENT-1
importing required libraries
In [1]: import pandas as pd
import matplotlib.pyplot as plt
reading the dataset
In [2]: df=pd.read_csv('amazon.csv')
1. Display Top 5 Rows of The Dataset.
In [3]: df.head(5)
Out[3]: year state month number date
0 1998 Acre Janeiro 0.0 1998-01-01
1 1999 Acre Janeiro 0.0 1999-01-01
2 2000 Acre Janeiro 0.0 2000-01-01
3 2001 Acre Janeiro 0.0 2001-01-01
4 2002 Acre Janeiro 0.0 2002-01-01
2. Check Last 5 Rows.
In [4]: df.tail(5)
Out[4]: year state month number date
6449 2012 Tocantins Dezembro 128.0 2012-01-01
6450 2013 Tocantins Dezembro 85.0 2013-01-01
6451 2014 Tocantins Dezembro 223.0 2014-01-01
6452 2015 Tocantins Dezembro 373.0 2015-01-01
6453 2016 Tocantins Dezembro 119.0 2016-01-01
3. Find Shape of Our Dataset (Number of Rows
and Number of Columns).
In [5]: print('No. of rows: ',df.shape[0])
print('No. of columns: ',df.shape[1])
12202040501049 PARAM H DHOLAKIA
AIML 202046702
No. of rows: 6454
No. of columns: 5
4. Getting Information About Our Dataset Like
Total Number Rows, Total Number of Columns,
Datatypes of Each Column and Memory
Requirement.
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 6454 non-null int64
1 state 6454 non-null object
2 month 6454 non-null object
3 number 6454 non-null float64
4 date 6454 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 252.2+ KB
5. Check For Duplicate Data and Drop Them.
In [7]: df.columns
Out[7]: Index(['year', 'state', 'month', 'number', 'date'], dtype='object')
In [8]: duplicate=df[df.duplicated()]
duplicate
12202040501049 PARAM H DHOLAKIA
AIML 202046702
Out[8]: year state month number date
259 2017 Alagoas Janeiro 38.0 2017-01-01
2630 1998 Mato Grosso Janeiro 0.0 1998-01-01
2650 1998 Mato Grosso Fevereiro 0.0 1998-01-01
2670 1998 Mato Grosso Março 0.0 1998-01-01
2690 1998 Mato Grosso Abril 0.0 1998-01-01
2710 1998 Mato Grosso Maio 0.0 1998-01-01
3586 1998 Paraiba Janeiro 0.0 1998-01-01
3606 1998 Paraiba Fevereiro 0.0 1998-01-01
3621 2013 Paraiba Fevereiro 9.0 2013-01-01
3626 1998 Paraiba Março 0.0 1998-01-01
3646 1998 Paraiba Abril 0.0 1998-01-01
3666 1998 Paraiba Maio 0.0 1998-01-01
4542 1998 Rio Janeiro 0.0 1998-01-01
4562 1998 Rio Fevereiro 0.0 1998-01-01
4582 1998 Rio Março 0.0 1998-01-01
4585 2001 Rio Março 0.0 2001-01-01
4590 2006 Rio Março 8.0 2006-01-01
4602 1998 Rio Abril 0.0 1998-01-01
4608 2004 Rio Abril 3.0 2004-01-01
4613 2009 Rio Abril 1.0 2009-01-01
4622 1998 Rio Maio 0.0 1998-01-01
4631 2007 Rio Maio 2.0 2007-01-01
4632 2008 Rio Maio 0.0 2008-01-01
4645 2001 Rio Junho 13.0 2001-01-01
4781 1998 Rio Janeiro 0.0 1998-01-01
4800 2017 Rio Janeiro 28.0 2017-01-01
4801 1998 Rio Fevereiro 0.0 1998-01-01
4821 1998 Rio Março 0.0 1998-01-01
4841 1998 Rio Abril 0.0 1998-01-01
4861 1998 Rio Maio 0.0 1998-01-01
4864 2001 Rio Maio 4.0 2001-01-01
4910 2007 Rio Julho 7.0 2007-01-01
12202040501049 PARAM H DHOLAKIA
AIML 202046702
In [9]: df=df.drop_duplicates()
In [10]: df
Out[10]: year state month number date
0 1998 Acre Janeiro 0.0 1998-01-01
1 1999 Acre Janeiro 0.0 1999-01-01
2 2000 Acre Janeiro 0.0 2000-01-01
3 2001 Acre Janeiro 0.0 2001-01-01
4 2002 Acre Janeiro 0.0 2002-01-01
... ... ... ... ... ...
6449 2012 Tocantins Dezembro 128.0 2012-01-01
6450 2013 Tocantins Dezembro 85.0 2013-01-01
6451 2014 Tocantins Dezembro 223.0 2014-01-01
6452 2015 Tocantins Dezembro 373.0 2015-01-01
6453 2016 Tocantins Dezembro 119.0 2016-01-01
6422 rows × 5 columns
6. Check Null Values in The Dataset.
In [11]: #checks for total no.of null values for each column
df.isna().sum()
Out[11]: year 0
state 0
month 0
number 0
date 0
dtype: int64
7. Get Overall Statistics About the Dataframe.
In [12]: df.describe()
12202040501049 PARAM H DHOLAKIA
AIML 202046702
Out[12]: year number
count 6422.000000 6422.000000
mean 2007.490969 108.815178
std 5.731806 191.142482
min 1998.000000 0.000000
25% 2003.000000 3.000000
50% 2007.000000 24.497000
75% 2012.000000 114.000000
max 2017.000000 998.000000
8. Rename Month Names to English.
In [13]: df['month'].unique()
Out[13]: array(['Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho', 'Julho',
'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro'],
dtype=object)
In [14]: month_map={'Janeiro':'January','Fevereiro':'February','Março':'March','Abril':'A
'Agosto':'August', 'Setembro':'September', 'Outubro':'October', 'Novembro
In [15]: df['month']=df['month'].map(month_map)
df['month'].unique()
Out[15]: array(['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December'],
dtype=object)
9. Total Number of Fires Registered.
In [16]: print('Total fires registered: ',df.shape[0])
Total fires registered: 6422
10.In Which Month Maximum Number of Forest
Fires Were Reported?
In [17]: df.columns
Out[17]: Index(['year', 'state', 'month', 'number', 'date'], dtype='object')
In [18]: no_of_cases=df.groupby('month')['number'].sum().sort_values(ascending=False).ind
print(no_of_cases[0],' is the month with highest no. of cases')
July is the month with highest no. of cases
12202040501049 PARAM H DHOLAKIA
AIML 202046702
11.In Which Year Maximum Number of Forest Fires
Was Reported?
In [19]: no_of_cases=df.groupby('year')['number'].sum().sort_values(ascending=False).inde
print(no_of_cases[0],' is the year with highest no. of cases')
2003 is the year with highest no. of cases
12.In Which State Maximum Number of Forest
Fires Was Reported?
In [20]: no_of_cases=df.groupby('state')['number'].sum().sort_values(ascending=False).ind
print(no_of_cases[0],' is the state with highest no. of cases')
Mato Grosso is the state with highest no. of cases
13.Find Total Number of Fires Were Reported in
Amazonas.
In [21]: df.columns
Out[21]: Index(['year', 'state', 'month', 'number', 'date'], dtype='object')
In [22]: #extraxt rows with state Amazonas
df2=df[df['state']=='Amazonas']
In [23]: print("Total number of forest fires in Amazonas:",df2['number'].sum()) #Get tota
Total number of forest fires in Amazonas: 30650.129
14.Display Number of Fires Were Reported in
Amazonas (Year-Wise).
In [24]: df.columns
Out[24]: Index(['year', 'state', 'month', 'number', 'date'], dtype='object')
In [25]: df3=df[df['state']=='Amazonas'].groupby('year')['number'].sum()
df3
12202040501049 PARAM H DHOLAKIA
AIML 202046702
Out[25]: year
1998 946.000
1999 1061.000
2000 853.000
2001 1297.000
2002 2852.000
2003 1524.268
2004 2298.207
2005 1657.128
2006 997.640
2007 589.601
2008 2717.000
2009 1320.601
2010 2324.508
2011 1652.538
2012 1110.641
2013 905.217
2014 2385.909
2015 1189.994
2016 2060.972
2017 906.905
Name: number, dtype: float64
15.Display Number of Fires Were Reported in
Amazonas (Day-Wise).
In [26]: #extract rows with state amazonas
df2=df[df['state']=='Amazonas']
In [27]: #convert date column to date-time format
df2['date'] = pd.to_datetime(df2['date'])
df3=df2.groupby(df2['date'].dt.dayofweek)['number'].sum()
C:\Users\PARAM\AppData\Local\Temp\ipykernel_8680\3119725923.py:2: SettingWithCopy
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stabl
e/user_guide/indexing.html#returning-a-view-versus-a-copy
df2['date'] = pd.to_datetime(df2['date'])
In [28]: dict = {0: 'Sunday',1: 'Monday',2: 'Tuesday',3: 'Wednesday',4: 'Thursday',5: 'Fr
In [29]: #map numeric day to names of day
df3.index = df3.index.map(dict)
In [30]: df3
12202040501049 PARAM H DHOLAKIA
AIML 202046702
Out[30]: date
Sunday 1886.601
Monday 6474.217
Tuesday 3910.177
Wednesday 5754.802
Thursday 5446.480
Friday 4162.666
Saturday 3015.186
Name: number, dtype: float64
16.Find Total Number of Fires Were Reported In
2015 And Visualize Data Based on Each ‘Month’.
In [31]: #total fire reports in each month for 2015
df2=df[df['year']==2015].groupby('month')['number'].sum().reset_index()
In [32]: df2
Out[32]: month number
0 April 2573.000
1 August 4363.125
2 December 4088.522
3 February 2309.000
4 January 4635.000
5 July 4364.392
6 June 3260.552
7 March 2202.000
8 May 2384.000
9 November 4034.518
10 October 4499.525
11 September 2494.658
In [33]: plt.figure(figsize=(20, 5)) #to ensure image readability
plt.bar(df2['month'],df2['number'])
plt.show()
12202040501049 PARAM H DHOLAKIA
AIML 202046702
17.Find Average Number of Fires Were Reported
from Highest to Lowest (State-Wise).
In [34]: #Group the data by state and find average reports state-wise
df2=df.groupby('state')['number'].mean().reset_index()
In [35]: #sort values from highest to lowest average
df2.sort_values('number',ascending=False)
Out[35]: state number
20 Sao Paulo 213.896226
10 Mato Grosso 203.479975
4 Bahia 187.222703
15 Piau 158.174674
8 Goias 157.721841
11 Minas Gerais 156.800243
22 Tocantins 141.037176
3 Amazonas 128.243218
5 Ceara 127.314071
12 Paraiba 111.073979
9 Maranhao 105.142808
13 Pará 102.561272
14 Pernambuco 102.502092
18 Roraima 102.029598
19 Santa Catarina 101.924067
2 Amapa 91.345506
17 Rondonia 84.876272
0 Acre 77.255356
16 Rio 64.698515
7 Espirito Santo 27.389121
1 Alagoas 19.271967
6 Distrito Federal 14.899582
21 Sergipe 13.543933
18.To Find the State Names Where Fires Were
Reported In 'dec' Month.
12202040501049 PARAM H DHOLAKIA
AIML 202046702
In [36]: states=df[df['month']=='December']['state'].unique()
In [37]: print("List of states:")
for i in states:
print(i)
List of states:
Acre
Alagoas
Amapa
Amazonas
Bahia
Ceara
Distrito Federal
Espirito Santo
Goias
Maranhao
Mato Grosso
Minas Gerais
Pará
Paraiba
Pernambuco
Piau
Rio
Rondonia
Roraima
Santa Catarina
Sao Paulo
Sergipe
Tocantins
12202040501049 PARAM H DHOLAKIA