[go: up one dir, main page]

0% found this document useful (0 votes)
161 views53 pages

Assignment - 10 - Pandas

The document outlines a series of steps for analyzing the iris dataset using Pandas in Python. It includes loading the dataset, performing various data manipulations such as calculating statistics, filtering data, and adding new columns. Each step is accompanied by code snippets demonstrating the required operations.

Uploaded by

Mbg Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views53 pages

Assignment - 10 - Pandas

The document outlines a series of steps for analyzing the iris dataset using Pandas in Python. It includes loading the dataset, performing various data manipulations such as calculating statistics, filtering data, and adding new columns. Each step is accompanied by code snippets demonstrating the required operations.

Uploaded by

Mbg Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

pandas-assignment-day-1-iris

November 8, 2024

step 1. load the iris.csv dataset.


[11]: import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')
df

[11]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

1. Write a Pandas program to get the data types of the given csv data (iris.csv ) fields

1
[1]: ### code here
import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Display the data types of each column


print(df.dtypes)

Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

2. Write a Pandas program to find the sum, mean, max, min value of ‘SepalLengthCm’
column of dataframe
[2]: ### code here
import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Calculate sum, mean, max, and min for 'SepalLengthCm' column


sepal_length_sum = df['SepalLengthCm'].sum()
sepal_length_mean = df['SepalLengthCm'].mean()
sepal_length_max = df['SepalLengthCm'].max()
sepal_length_min = df['SepalLengthCm'].min()

# Display the results


print(f"Sum of SepalLengthCm: {sepal_length_sum}")
print(f"Mean of SepalLengthCm: {sepal_length_mean}")
print(f"Max of SepalLengthCm: {sepal_length_max}")
print(f"Min of SepalLengthCm: {sepal_length_min}")

Sum of SepalLengthCm: 876.5


Mean of SepalLengthCm: 5.843333333333334
Max of SepalLengthCm: 7.9
Min of SepalLengthCm: 4.3

3. Write a Pandas program to import iris data and skipping first twenty rows into a
Pandas dataframe.
[3]: ### code here
import pandas as pd

2
# Load the iris dataset from csv file, skipping the first 20 rows
df = pd.read_csv('iris.csv', skiprows=20)

# Display the first few rows of the dataframe to verify


print(df.head())

20 5.1 3.8 1.5 0.3 Iris-setosa


0 21 5.4 3.4 1.7 0.2 Iris-setosa
1 22 5.1 3.7 1.5 0.4 Iris-setosa
2 23 4.6 3.6 1.0 0.2 Iris-setosa
3 24 5.1 3.3 1.7 0.5 Iris-setosa
4 25 4.8 3.4 1.9 0.2 Iris-setosa

4. Write a Pandas program to display the last ten rows.


[4]: ### code here
import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Display the last 10 rows of the dataframe


print(df.tail(10))

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


140 141 6.7 3.1 5.6 2.4
141 142 6.9 3.1 5.1 2.3
142 143 5.8 2.7 5.1 1.9
143 144 6.8 3.2 5.9 2.3
144 145 6.7 3.3 5.7 2.5
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
140 Iris-virginica
141 Iris-virginica
142 Iris-virginica
143 Iris-virginica
144 Iris-virginica
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

3
5. Write a Pandas program to create a subtotal of “PetalLengthCm” against Species
from the dataframe.
[5]: ### code here
import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Group by 'Species' and calculate the sum of 'PetalLengthCm'


subtotal = df.groupby('Species')['PetalLengthCm'].sum()

# Display the subtotal


print(subtotal)

Species
Iris-setosa 73.2
Iris-versicolor 213.0
Iris-virginica 277.6
Name: PetalLengthCm, dtype: float64
6. Write a Pandas program to find details where “PetalLengthCm” > 2.
[6]: ### code here?
import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Filter the rows where 'PetalLengthCm' is greater than 2


filtered_data = df[df['PetalLengthCm'] > 2]

# Display the filtered data


print(filtered_data)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


50 51 7.0 3.2 4.7 1.4
51 52 6.4 3.2 4.5 1.5
52 53 6.9 3.1 4.9 1.5
53 54 5.5 2.3 4.0 1.3
54 55 6.5 2.8 4.6 1.5
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species

4
50 Iris-versicolor
51 Iris-versicolor
52 Iris-versicolor
53 Iris-versicolor
54 Iris-versicolor
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[100 rows x 6 columns]


7. Write a Pandas program find all records that include two specific Species.
[7]: import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Filter rows where 'Species' is either 'setosa' or 'versicolor'


filtered_data = df[df['Species'].isin(['setosa', 'versicolor'])]

# Display the filtered data


print(filtered_data)

Empty DataFrame
Columns: [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Index: []
8. Write a Pandas program to sort the records by the SepalLengthCm column.
[8]: import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Sort the records by 'SepalLengthCm' in ascending order


sorted_data = df.sort_values(by='SepalLengthCm')

# Display the sorted data


print(sorted_data)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


13 14 4.3 3.0 1.1 0.1
42 43 4.4 3.2 1.3 0.2
38 39 4.4 3.0 1.3 0.2
8 9 4.4 2.9 1.4 0.2

5
41 42 4.5 2.3 1.3 0.3
.. … … … … …
122 123 7.7 2.8 6.7 2.0
118 119 7.7 2.6 6.9 2.3
117 118 7.7 3.8 6.7 2.2
135 136 7.7 3.0 6.1 2.3
131 132 7.9 3.8 6.4 2.0

Species
13 Iris-setosa
42 Iris-setosa
38 Iris-setosa
8 Iris-setosa
41 Iris-setosa
.. …
122 Iris-virginica
118 Iris-virginica
117 Iris-virginica
135 Iris-virginica
131 Iris-virginica

[150 rows x 6 columns]

9. Write a Pandas program to insert a column in the sixth position of dataframe and
fill it with NaN values.
[9]: import pandas as pd
import numpy as np

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Insert a new column at the sixth position (index 5) and fill it with NaN␣
↪values

df.insert(5, 'NewColumn', np.nan)

# Display the first few rows of the updated dataframe


print(df.head())

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm NewColumn \


0 1 5.1 3.5 1.4 0.2 NaN
1 2 4.9 3.0 1.4 0.2 NaN
2 3 4.7 3.2 1.3 0.2 NaN
3 4 4.6 3.1 1.5 0.2 NaN
4 5 5.0 3.6 1.4 0.2 NaN

Species
0 Iris-setosa

6
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa

10. Write a Pandas program to add summation to a row of the dataframe.


[10]: import pandas as pd

# Load the iris dataset from csv file


df = pd.read_csv('iris.csv')

# Calculate the sum of the numeric columns


summation_row = df.sum(numeric_only=True)

# Append the summation row to the DataFrame


df.loc['Total'] = summation_row

# Display the updated DataFrame


print(df)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


0 1.0 5.1 3.5 1.4 0.2
1 2.0 4.9 3.0 1.4 0.2
2 3.0 4.7 3.2 1.3 0.2
3 4.0 4.6 3.1 1.5 0.2
4 5.0 5.0 3.6 1.4 0.2
… … … … … …
146 147.0 6.3 2.5 5.0 1.9
147 148.0 6.5 3.0 5.2 2.0
148 149.0 6.2 3.4 5.4 2.3
149 150.0 5.9 3.0 5.1 1.8
Total 11325.0 876.5 458.1 563.8 179.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
… …
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Total NaN

[151 rows x 6 columns]

7
[ ]:

8
pandas-assignment-day-2

November 8, 2024

[20]: import pandas as pd


import numpy as np
data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',␣
↪'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],


'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',␣
↪'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Write a Pandas program to create a dataframe by given dictionary and index as lables
[21]: # Create a DataFrame
df = pd.DataFrame(data, index=labels)

# Display the created DataFrame


print(df)

name score attempts qualify


a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

1. Write a Pandas program to select the rows where the score is missing, i.e. is NaN
[22]: # Create a DataFrame
df = pd.DataFrame(data)

# Select the rows where 'Score' is NaN


missing_score_rows = df[df['score'].isna()]

1
# Display the rows with missing scores
print(missing_score_rows)

name score attempts qualify


3 James NaN 3 no
7 Laura NaN 1 no

2. Write a Pandas program to change the score in row ‘d’ to 11.5.


[27]: # Create a DataFrame
df = pd.DataFrame(data, index=labels) # Setting custom index

# Change the score in row 'd' to 11.5


df.loc['d', 'score'] = 11.5

# Display the updated DataFrame


print(df)

name score attempts qualify


a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James 11.5 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

3. Write a Pandas program to select the ‘name’ and ‘score’ columns from the following
DataFrame.
[28]: # Create a DataFrame
df = pd.DataFrame(data)

# Select the 'name' and 'score' columns


selected_columns = df[['name', 'score']]

# Display the selected columns


print(selected_columns)

name score
0 Anastasia 12.5
1 Dima 9.0
2 Katherine 16.5
3 James NaN
4 Emily 9.0
5 Michael 20.0

2
6 Matthew 14.5
7 Laura NaN
8 Kevin 8.0
9 Jonas 19.0

4. Write a Pandas program to calculate the mean score for each different student in
DataFrame.
[29]: # Create a DataFrame
df = pd.DataFrame(data)

# Calculate the mean score for each student


mean_scores = df.groupby('name')['score'].mean()

# Display the mean scores


print(mean_scores)

name
Anastasia 12.5
Dima 9.0
Emily 9.0
James NaN
Jonas 19.0
Katherine 16.5
Kevin 8.0
Laura NaN
Matthew 14.5
Michael 20.0
Name: score, dtype: float64

5. Write a Pandas program to append a new row ‘k’ to data frame with any value for
each column. Now delete the row ‘c’ and return the original DataFrame.
[30]: ### code here
# Create a DataFrame
df = pd.DataFrame(data)

# Append a new row 'k' with values for each column


new_row = {'name': 'Frank', 'score': 82, 'age': 26}
df = df.append(new_row, ignore_index=True)

# Delete the row 'c' (Charlie) from the DataFrame


df = df.drop(index=2) # Assuming 'c' corresponds to the index of Charlie

# Display the modified DataFrame


print(df)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)

3
~\AppData\Local\Temp\ipykernel_11716\372884500.py in ?()
3 df = pd.DataFrame(data)
4
5 # Append a new row 'k' with values for each column
6 new_row = {'name': 'Frank', 'score': 82, 'age': 26}
----> 7 df = df.append(new_row, ignore_index=True)
8
9 # Delete the row 'c' (Charlie) from the DataFrame
10 df = df.drop(index=2) # Assuming 'c' corresponds to the index of Charlie

~\anaconda3\Lib\site-packages\pandas\core\generic.py in ?(self, name)


6295 and name not in self._accessors
6296 and self._info_axis.
↪_can_hold_identifiers_and_holds_name(name)

6297 ):
6298 return self[name]
-> 6299 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'

import the titanic dataset


[43]: import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')
df

[43]: survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. … … … … … … … … …
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female NaN 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15


0 man True NaN Southampton no False NaN
1 woman False C Cherbourg yes False NaN
2 woman False NaN Southampton yes True NaN
3 woman False C Southampton yes False NaN
4 man True NaN Southampton no True NaN

4
.. … … … … … … …
886 man True NaN Southampton no True NaN
887 woman False B Southampton yes True NaN
888 woman False NaN Southampton no False NaN
889 man True C Cherbourg yes True NaN
890 man True NaN Queenstown no True NaN

[891 rows x 16 columns]

1. Write a Pandas program to create a Pivot table and find the total fare amount
class wise, gender wise.
[31]: #### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Create a pivot table to find total fare amount class-wise and gender-wise
pivot_table = pd.pivot_table(df, values='fare', index='pclass', columns='sex',␣
↪aggfunc='sum', fill_value=0)

# Display the pivot table


print(pivot_table)

sex female male


pclass
1 9975.8250 8201.5875
2 1669.7292 2132.1125
3 2321.1086 4393.5865

2. Write a Pandas program to create a Pivot table and find survival rate by gender.
[32]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Create a pivot table to find survival rate by gender


pivot_table = pd.pivot_table(df,
values='survived',
index='sex',
aggfunc='mean') # Mean will give the survival␣
↪rate

# Display the pivot table

5
print(pivot_table)

survived
sex
female 0.742038
male 0.188908
3. Write a Pandas to create to find the total count no of people survived pclass wise, class wise
and gender wise.
[33]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Create a pivot table to find the total count of people who survived, grouped␣
↪by Pclass and Gender

pivot_table = pd.pivot_table(df,
values='survived',
index='pclass',
columns='sex',
aggfunc='sum',
fill_value=0) # Fill missing values with 0

# Display the pivot table


print(pivot_table)

sex female male


pclass
1 91 45
2 70 17
3 72 47
4. Write a Pandas program to create a Pivot table and calculate how many women and men were
in a particular cabin class.
[34]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Create a pivot table to count how many women and men were in each cabin class
pivot_table = pd.pivot_table(df,
values='class',
index='pclass',

6
columns='sex',
aggfunc='count',
fill_value=0) # Fill missing values with 0

# Display the pivot table


print(pivot_table)

sex female male


pclass
1 94 122
2 76 108
3 144 347

5. Write a Pandas program to count the number of missing values in each column.
[35]: #### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Count the number of missing (NaN) values in each column


missing_values = df.isnull().sum()

# Display the result


print(missing_values)

survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
Unnamed: 15 891
dtype: int64

6. Write a Pandas program to replace null values with the value from the previous
row or the next row in a given DataFrame.

7
[36]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Replace NaN values with the value from the previous row (forward fill)
df_ffill = df.ffill()

# Display the result


print("\nDataFrame after forward fill (previous row):\n", df_ffill)

DataFrame after forward fill (previous row):


survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. … … … … … … … … …
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female 19.0 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15


0 man True NaN Southampton no False NaN
1 woman False C Cherbourg yes False NaN
2 woman False C Southampton yes True NaN
3 woman False C Southampton yes False NaN
4 man True C Southampton no True NaN
.. … … … … … … …
886 man True C Southampton no True NaN
887 woman False B Southampton yes True NaN
888 woman False B Southampton no False NaN
889 man True C Cherbourg yes True NaN
890 man True C Queenstown no True NaN

[891 rows x 16 columns]


7. Write a Pandas program to replace nulls, with median or mean of the specified columns in a
given DataFrame.
[37]: ### code here
import pandas as pd

8
# Load the titanic dataset from csv file
df = pd.read_csv('titanic.csv')

# Replace NaN values with the mean of the respective column


df_mean_filled = df.copy()
df_mean_filled['age'] = df_mean_filled['age'].fillna(df['age'].mean())
df_mean_filled['fare'] = df_mean_filled['fare'].fillna(df['fare'].mean())

# Display the result


print("\nDataFrame after filling NaNs with mean:\n", df_mean_filled)

DataFrame after filling NaNs with mean:


survived pclass sex age sibsp parch fare embarked \
0 0 3 male 22.000000 1 0 7.2500 S
1 1 1 female 38.000000 1 0 71.2833 C
2 1 3 female 26.000000 0 0 7.9250 S
3 1 1 female 35.000000 1 0 53.1000 S
4 0 3 male 35.000000 0 0 8.0500 S
.. … … … … … … … …
886 0 2 male 27.000000 0 0 13.0000 S
887 1 1 female 19.000000 0 0 30.0000 S
888 0 3 female 29.699118 1 2 23.4500 S
889 1 1 male 26.000000 0 0 30.0000 C
890 0 3 male 32.000000 0 0 7.7500 Q

class who adult_male deck embark_town alive alone Unnamed: 15


0 Third man True NaN Southampton no False NaN
1 First woman False C Cherbourg yes False NaN
2 Third woman False NaN Southampton yes True NaN
3 First woman False C Southampton yes False NaN
4 Third man True NaN Southampton no True NaN
.. … … … … … … … …
886 Second man True NaN Southampton no True NaN
887 First woman False B Southampton yes True NaN
888 Third woman False NaN Southampton no False NaN
889 First man True C Cherbourg yes True NaN
890 Third man True NaN Queenstown no True NaN

[891 rows x 16 columns]


8. Write a Pandas program to select first 2 rows, 2 columns and specific two columns from the
dataframe.
[38]: #### code here
import pandas as pd

# Load the titanic dataset from csv file

9
df = pd.read_csv('titanic.csv')

# Select first 2 rows and 2 columns


first_2_rows_2_columns = df.iloc[:2, :2]

# Select specific columns 'Sex' and 'Age'


specific_columns = df[['sex', 'age']]

# Display the result


print("\nSpecific two columns 'Sex' and 'Age':\n", specific_columns)

# Display the result


print("\nFirst 2 rows and 2 columns:\n", first_2_rows_2_columns)

Specific two columns 'Sex' and 'Age':


sex age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
.. … …
886 male 27.0
887 female 19.0
888 female NaN
889 male 26.0
890 male 32.0

[891 rows x 2 columns]

First 2 rows and 2 columns:


survived pclass
0 0 3
1 1 1
9. Write a Pandas program to remove the duplicates from ‘embark_town’ column of titanic dataset.
[39]: #### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Remove duplicates from 'embark_town' column


embark_town_unique = df['embark_town'].drop_duplicates()

10
# Display the result
print("\n'Embark_town' column after removing duplicates:\n", embark_town_unique)

'Embark_town' column after removing duplicates:


0 Southampton
1 Cherbourg
5 Queenstown
61 NaN
Name: embark_town, dtype: object
10. Write a Pandas program to find out the survived is 1 details by the ‘Southampton’ in who is
the ‘child’ from dataset.
[40]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Filter passengers who survived, embarked from 'Southampton', and are children␣
↪(age < 18)

survived_children_southampton = df[ (df['survived'] == 1) &


(df['embark_town'] == 'Southampton') &
(df['age'] < 18) ]

# Display the filtered results


print("\nDetails of survived children from Southampton:\n",␣
↪survived_children_southampton)

Details of survived children from Southampton:


survived pclass sex age sibsp parch fare embarked class
\
10 1 3 female 4.00 1 1 16.7000 S Third
58 1 2 female 5.00 1 2 27.7500 S Second
68 1 3 female 17.00 4 2 7.9250 S Third
78 1 2 male 0.83 0 2 29.0000 S Second
84 1 2 female 17.00 0 0 10.5000 S Second
165 1 3 male 9.00 0 2 20.5250 S Third
172 1 3 female 1.00 1 1 11.1333 S Third
183 1 2 male 1.00 2 1 39.0000 S Second
184 1 3 female 4.00 0 2 22.0250 S Third
193 1 2 male 3.00 1 1 26.0000 S Second
220 1 3 male 16.00 0 0 8.0500 S Third
233 1 3 female 5.00 4 2 31.3875 S Third
237 1 2 female 8.00 0 2 26.2500 S Second

11
261 1 3 male 3.00 4 2 31.3875 S Third
305 1 1 male 0.92 1 2 151.5500 S First
340 1 2 male 2.00 1 1 26.0000 S Second
348 1 3 male 3.00 1 1 15.9000 S Third
407 1 2 male 3.00 1 1 18.7500 S Second
435 1 1 female 14.00 1 2 120.0000 S First
445 1 1 male 4.00 0 2 81.8583 S First
446 1 2 female 13.00 0 1 19.5000 S Second
479 1 3 female 2.00 0 1 12.2875 S Third
489 1 3 male 9.00 1 1 15.9000 S Third
504 1 1 female 16.00 0 0 86.5000 S First
530 1 2 female 2.00 1 1 26.0000 S Second
535 1 2 female 7.00 0 2 26.2500 S Second
549 1 2 male 8.00 1 1 36.7500 S Second
618 1 2 female 4.00 2 1 39.0000 S Second
689 1 1 female 15.00 0 1 211.3375 S First
720 1 2 female 6.00 0 1 33.0000 S Second
750 1 2 female 4.00 1 1 23.0000 S Second
751 1 3 male 6.00 0 1 12.4750 S Third
755 1 2 male 0.67 1 1 14.5000 S Second
777 1 3 female 5.00 0 0 12.4750 S Third
781 1 1 female 17.00 1 0 57.0000 S First
788 1 3 male 1.00 1 2 20.5750 S Third
802 1 1 male 11.00 1 2 120.0000 S First
831 1 2 male 0.83 1 1 18.7500 S Second
853 1 1 female 16.00 0 1 39.4000 S First
869 1 3 male 4.00 1 1 11.1333 S Third

who adult_male deck embark_town alive alone Unnamed: 15


10 child False G Southampton yes False NaN
58 child False NaN Southampton yes False NaN
68 woman False NaN Southampton yes False NaN
78 child False NaN Southampton yes False NaN
84 woman False NaN Southampton yes True NaN
165 child False NaN Southampton yes False NaN
172 child False NaN Southampton yes False NaN
183 child False F Southampton yes False NaN
184 child False NaN Southampton yes False NaN
193 child False F Southampton yes False NaN
220 man True NaN Southampton yes True NaN
233 child False NaN Southampton yes False NaN
237 child False NaN Southampton yes False NaN
261 child False NaN Southampton yes False NaN
305 child False C Southampton yes False NaN
340 child False F Southampton yes False NaN
348 child False NaN Southampton yes False NaN
407 child False NaN Southampton yes False NaN
435 child False B Southampton yes False NaN

12
445 child False A Southampton yes False NaN
446 child False NaN Southampton yes False NaN
479 child False NaN Southampton yes False NaN
489 child False NaN Southampton yes False NaN
504 woman False B Southampton yes True NaN
530 child False NaN Southampton yes False NaN
535 child False NaN Southampton yes False NaN
549 child False NaN Southampton yes False NaN
618 child False F Southampton yes False NaN
689 child False B Southampton yes False NaN
720 child False NaN Southampton yes False NaN
750 child False NaN Southampton yes False NaN
751 child False E Southampton yes False NaN
755 child False NaN Southampton yes False NaN
777 child False NaN Southampton yes True NaN
781 woman False B Southampton yes False NaN
788 child False NaN Southampton yes False NaN
802 child False B Southampton yes False NaN
831 child False NaN Southampton yes False NaN
853 woman False D Southampton yes False NaN
869 child False NaN Southampton yes False NaN
11. Write a Pandas program to filter all records starting from the 2nd row, access every 5th row
from the dataframe.
[41]: ### code here
import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Filter the DataFrame starting from the 2nd row (index 1) and access every 5th␣
↪row

filtered_df = df.iloc[1::5]

# Display the filtered DataFrame


print("\nFiltered DataFrame (starting from 2nd row, every 5th row):\n",␣
↪filtered_df)

Filtered DataFrame (starting from 2nd row, every 5th row):


survived pclass sex age sibsp parch fare embarked class \
1 1 1 female 38.0 1 0 71.2833 C First
6 0 1 male 54.0 0 0 51.8625 S First
11 1 1 female 58.0 0 0 26.5500 S First
16 0 3 male 2.0 4 1 29.1250 Q Third
21 1 2 male 34.0 0 0 13.0000 S Second

13
.. … … … … … … … … …
866 1 2 female 27.0 1 0 13.8583 C Second
871 1 1 female 47.0 1 1 52.5542 S First
876 0 3 male 20.0 0 0 9.8458 S Third
881 0 3 male 33.0 0 0 7.8958 S Third
886 0 2 male 27.0 0 0 13.0000 S Second

who adult_male deck embark_town alive alone Unnamed: 15


1 woman False C Cherbourg yes False NaN
6 man True E Southampton no True NaN
11 woman False C Southampton yes True NaN
16 child False NaN Queenstown no False NaN
21 man True D Southampton yes True NaN
.. … … … … … … …
866 woman False NaN Cherbourg yes False NaN
871 woman False D Southampton yes False NaN
876 man True NaN Southampton no True NaN
881 man True NaN Southampton no True NaN
886 man True NaN Southampton no True NaN

[178 rows x 16 columns]


12. Write a Pandas program to filter those records where embarked matches with multiple values
(C, Q) from the dataset.

[42]: #### code here


import pandas as pd

# Load the titanic dataset from csv file


df = pd.read_csv('titanic.csv')

# Filter records where 'embarked' matches either 'C' or 'Q'


filtered_records = df[df['embarked'].isin(['C', 'Q'])]

# Display the filtered records


print("\nFiltered records where embarked is 'C' or 'Q':\n", filtered_records)

Filtered records where embarked is 'C' or 'Q':


survived pclass sex age sibsp parch fare embarked class \
1 1 1 female 38.0 1 0 71.2833 C First
5 0 3 male NaN 0 0 8.4583 Q Third
9 1 2 female 14.0 1 0 30.0708 C Second
16 0 3 male 2.0 4 1 29.1250 Q Third
19 1 3 female NaN 0 0 7.2250 C Third
.. … … … … … … … … …
875 1 3 female 15.0 0 0 7.2250 C Third
879 1 1 female 56.0 0 1 83.1583 C First

14
885 0 3 female 39.0 0 5 29.1250 Q Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15


1 woman False C Cherbourg yes False NaN
5 man True NaN Queenstown no True NaN
9 child False NaN Cherbourg yes False NaN
16 child False NaN Queenstown no False NaN
19 woman False NaN Cherbourg yes True NaN
.. … … … … … … …
875 child False NaN Cherbourg yes True NaN
879 woman False C Cherbourg yes False NaN
885 woman False NaN Queenstown no False NaN
889 man True C Cherbourg yes True NaN
890 man True NaN Queenstown no True NaN

[245 rows x 16 columns]

[ ]:

15
pandas-assignment-day-3

November 8, 2024

[13]: import pandas as pd


import numpy as np
data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',␣
↪'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],


'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',␣
↪'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Write a Pandas program to create a dataframe by given dictionary and index as lables
[14]: ### code here
# Create a DataFrame
df = pd.DataFrame(data, index=labels)

# Display the created DataFrame


print(df)

name score attempts qualify


a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

1. Write a Pandas program to sort the DataFrame first by ‘name’ in descending order,
then by ‘score’ in ascending order
[63]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

1
# Sort the DataFrame first by 'name' in descending order, then by 'score' in␣
↪ascending order

sorted_df = df.sort_values(by=['name', 'score'], ascending=[False, True])

# Display the sorted DataFrame


print("\nSorted DataFrame:\n", sorted_df)

Sorted DataFrame:
name score attempts qualify
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
2 Katherine 16.5 2 yes
9 Jonas 19.0 1 yes
3 James NaN 3 no
4 Emily 9.0 2 no
1 Dima 9.0 3 no
0 Anastasia 12.5 1 yes
2. Write a Pandas program to replace the ‘qualify’ column contains the values ‘yes’ and ‘no’ with
True and False
[64]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Replace 'yes' with True and 'no' with False in the 'qualify' column
df['qualify'] = df['qualify'].replace({'yes': True, 'no': False})

# Display the updated DataFrame


print("\nUpdated DataFrame:\n", df)

Updated DataFrame:
name score attempts qualify
0 Anastasia 12.5 1 True
1 Dima 9.0 3 False
2 Katherine 16.5 2 True
3 James NaN 3 False
4 Emily 9.0 2 False
5 Michael 20.0 3 True
6 Matthew 14.5 1 True
7 Laura NaN 1 False
8 Kevin 8.0 2 False

2
9 Jonas 19.0 1 True
C:\Users\indra\AppData\Local\Temp\ipykernel_21872\2229088605.py:7:
FutureWarning: Downcasting behavior in `replace` is deprecated and will be
removed in a future version. To retain the old behavior, explicitly call
`result.infer_objects(copy=False)`. To opt-in to the future behavior, set
`pd.set_option('future.no_silent_downcasting', True)`
df['qualify'] = df['qualify'].replace({'yes': True, 'no': False})

3. Write a Pandas program to delete the ‘attempts’ column from the DataFrame
[65]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Delete the 'attempts' column


df = df.drop(columns=['attempts'])

# Display the updated DataFrame


print("\nUpdated DataFrame (after deleting 'attempts' column):\n", df)

Updated DataFrame (after deleting 'attempts' column):


name score qualify
0 Anastasia 12.5 yes
1 Dima 9.0 no
2 Katherine 16.5 yes
3 James NaN no
4 Emily 9.0 no
5 Michael 20.0 yes
6 Matthew 14.5 yes
7 Laura NaN no
8 Kevin 8.0 no
9 Jonas 19.0 yes

4. Write a Pandas program to insert a new column in existing DataFrame.


[66]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Insert a new column 'grade' based on the scores


# For example, we will assign grades based on scores
df['grade'] = ['A' if score >= 90 else 'B' if score >= 80 else 'C' for score in␣
↪df['score']]

# Display the updated DataFrame

3
print("\nUpdated DataFrame (after inserting 'grade' column):\n", df)

Updated DataFrame (after inserting 'grade' column):


name score attempts qualify grade
0 Anastasia 12.5 1 yes C
1 Dima 9.0 3 no C
2 Katherine 16.5 2 yes C
3 James NaN 3 no C
4 Emily 9.0 2 no C
5 Michael 20.0 3 yes C
6 Matthew 14.5 1 yes C
7 Laura NaN 1 no C
8 Kevin 8.0 2 no C
9 Jonas 19.0 1 yes C

5. Write a Pandas program to change the name ‘James’ to ‘Maxwell’ in name column
of the DataFrame.
[67]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Change 'James' to 'Maxwell' in the 'name' column


df['name'] = df['name'].replace('James', 'Maxwell')

# Alternatively, you can use boolean indexing:


# df.loc[df['name'] == 'James', 'name'] = 'Maxwell'

# Display the updated DataFrame


print("\nUpdated DataFrame (after changing 'James' to 'Maxwell'):\n", df)

Updated DataFrame (after changing 'James' to 'Maxwell'):


name score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 Maxwell NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

6. Write a Pandas program to rename columns of a given DataFrame

4
[68]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Method 1: Rename specific columns using the rename() method


df_renamed = df.rename(columns={'name': 'Full Name', 'score': 'Total Score'})

# Display the updated DataFrame after renaming


print("\nUpdated DataFrame (after renaming specific columns):\n", df_renamed)

# Method 2: Rename all columns by assigning a new list of column names


df.columns = ['Student Name', 'Total Marks', 'Total Attempts', 'Qualified']

# Display the updated DataFrame after renaming all columns


print("\nUpdated DataFrame (after renaming all columns):\n", df)

Updated DataFrame (after renaming specific columns):


Full Name Total Score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

Updated DataFrame (after renaming all columns):


Student Name Total Marks Total Attempts Qualified
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

7. Write a Pandas program to convert the datatype of a given column (floats to ints).

5
[69]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Fill NaN values with 0 or any other appropriate value


df['score'] = df['score'].fillna(0)

# Convert the 'Score' column from float to int


df['score'] = df['score'].astype(int)

# Display the updated DataFrame and data types


print("\nUpdated DataFrame (after converting 'score' to int):\n", df)
print("\nUpdated Data Types:\n", df.dtypes)

Updated DataFrame (after converting 'score' to int):


name score attempts qualify
0 Anastasia 12 1 yes
1 Dima 9 3 no
2 Katherine 16 2 yes
3 James 0 3 no
4 Emily 9 2 no
5 Michael 20 3 yes
6 Matthew 14 1 yes
7 Laura 0 1 no
8 Kevin 8 2 no
9 Jonas 19 1 yes

Updated Data Types:


name object
score int32
attempts int64
qualify object
dtype: object
8. Write a Pandas program to convert index in a column of the given dataframe.
[70]: ### code here

#Create a dataframe
df = pd.DataFrame(data, index=labels)

# Convert the index into a column


df_reset = df.reset_index()

# Display the updated DataFrame with index as a column


print("\nUpdated DataFrame (after converting index to column):\n", df_reset)

6
Updated DataFrame (after converting index to column):
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d James NaN 3 no
4 e Emily 9.0 2 no
5 f Michael 20.0 3 yes
6 g Matthew 14.5 1 yes
7 h Laura NaN 1 no
8 i Kevin 8.0 2 no
9 j Jonas 19.0 1 yes

9. Write a Pandas program to shuffle a DataFrame rows.


[71]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Shuffle the rows of the DataFrame


df_shuffled = df.sample(frac=1).reset_index(drop=True)

# Display the shuffled DataFrame


print("\nShuffled DataFrame:\n", df_shuffled)

Shuffled DataFrame:
name score attempts qualify
0 Jonas 19.0 1 yes
1 Laura NaN 1 no
2 James NaN 3 no
3 Michael 20.0 3 yes
4 Katherine 16.5 2 yes
5 Matthew 14.5 1 yes
6 Kevin 8.0 2 no
7 Emily 9.0 2 no
8 Dima 9.0 3 no
9 Anastasia 12.5 1 yes

10 Write a Pandas program to write a DataFrame to CSV file using tab separator
[72]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Specify the filename for the output CSV file

7
filename = 'output_data.tsv' # Using .tsv as an extension for tab-separated␣
↪values

# Write the DataFrame to a CSV file using tab as a separator


df.to_csv(filename, sep='\t', index=False)

print(f"DataFrame written to {filename} using tab as a separator.")

DataFrame written to output_data.tsv using tab as a separator.

[ ]:

[ ]:

[36]: import pandas as pd

student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed␣
↪Bernal', 'Kwame Morin'],

'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser␣
↪William', 'Madeeha Preston'],

'marks': [201, 200, 198, 219, 201]})

11. Write a Pandas program to join the two given dataframes along rows and assign
all data.
[37]: ### code here

# Joining the two DataFrames along rows


combined_student_data = pd.concat([student_data1, student_data2],␣
↪ignore_index=True)

# Display the combined DataFrame


print("Combined DataFrame:\n", combined_student_data)

Combined DataFrame:
student_id name marks
0 S1 Danniella Fenton 200
1 S2 Ryder Storey 210
2 S3 Bryce Jensen 190
3 S4 Ed Bernal 222
4 S5 Kwame Morin 199
5 S4 Scarlette Fisher 201

8
6 S5 Carla Williamson 200
7 S6 Dante Morse 198
8 S7 Kaiser William 219
9 S8 Madeeha Preston 201

12. Write a Pandas program to join the two given dataframes along columns and
assign all data.
[38]: ### code here

# Joining the two DataFrames along columns


combined_student_data = pd.concat([student_data1, student_data2], axis=1)

# Display the combined DataFrame


print("Combined DataFrame:\n", combined_student_data)

Combined DataFrame:
student_id name marks student_id name marks
0 S1 Danniella Fenton 200 S4 Scarlette Fisher 201
1 S2 Ryder Storey 210 S5 Carla Williamson 200
2 S3 Bryce Jensen 190 S6 Dante Morse 198
3 S4 Ed Bernal 222 S7 Kaiser William 219
4 S5 Kwame Morin 199 S8 Madeeha Preston 201

13. Write a Pandas program to join the two dataframes using the common column of
both dataframes.
[39]: ### code here

# Using the common column 'student_id' to merge both DataFrames


merged_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪suffixes=('_data1', '_data2'))

# Display the result of the merged DataFrame


print("Merged DataFrame:\n", merged_data)

Merged DataFrame:
student_id name_data1 marks_data1 name_data2 marks_data2
0 S4 Ed Bernal 222 Scarlette Fisher 201
1 S5 Kwame Morin 199 Carla Williamson 200
14. Write a Pandas program to join (left join) the two dataframes using keys from left dataframe
only.
[40]: #### code here

# Performing a left join using the common column 'student_id'


left_joined_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪how='left', suffixes=('_data1', '_data2'))

9
# Display the result of the left joined DataFrame
print("Left Joined DataFrame:\n", left_joined_data)

Left Joined DataFrame:


student_id name_data1 marks_data1 name_data2 marks_data2
0 S1 Danniella Fenton 200 NaN NaN
1 S2 Ryder Storey 210 NaN NaN
2 S3 Bryce Jensen 190 NaN NaN
3 S4 Ed Bernal 222 Scarlette Fisher 201.0
4 S5 Kwame Morin 199 Carla Williamson 200.0
15. Write a Pandas program to join two dataframes using keys from right dataframe only.
[41]: ### code here

# Performing a right join using the common column 'student_id'


right_joined_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪how='right', suffixes=('_data1', '_data2'))

# Display the result of the right joined DataFrame


print("Right Joined DataFrame:\n", right_joined_data)

Right Joined DataFrame:


student_id name_data1 marks_data1 name_data2 marks_data2
0 S4 Ed Bernal 222.0 Scarlette Fisher 201
1 S5 Kwame Morin 199.0 Carla Williamson 200
2 S6 NaN NaN Dante Morse 198
3 S7 NaN NaN Kaiser William 219
4 S8 NaN NaN Madeeha Preston 201

[ ]:

[ ]:

import UFO dataset


[76]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file


df = pd.read_csv('ufo_sighting_data.csv')
df

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\428872793.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')

10
[76]: Date_time city state/province country \
0 10/10/1949 20:30 san marcos tx us
1 10/10/1949 21:00 lackland afb tx NaN
2 10/10/1955 17:00 chester (uk/england) NaN gb
3 10/10/1956 21:00 edna tx us
4 10/10/1960 20:00 kaneohe hi us
… … … … …
80327 9/9/2013 21:15 nashville tn us
80328 9/9/2013 22:00 boise id us
80329 9/9/2013 22:00 napa ca us
80330 9/9/2013 22:20 vienna va us
80331 9/9/2013 23:00 edmond ok us

UFO_shape length_of_encounter_seconds described_duration_of_encounter \


0 cylinder 2700 45 minutes
1 light 7200 1-2 hrs
2 circle 20 20 seconds
3 circle 20 1/2 hour
4 light 900 15 minutes
… … … …
80327 light 600.0 10 minutes
80328 circle 1200.0 20 minutes
80329 other 1200.0 hour
80330 circle 5.0 5 seconds
80331 cigar 1020.0 17 minutes

description date_documented \
0 This event took place in early fall around 194… 4/27/2004
1 1949 Lackland AFB&#44 TX. Lights racing acros… 12/16/2005
2 Green/Orange circular disc over Chester&#44 En… 1/21/2008
3 My older brother and twin sister were leaving … 1/17/2004
4 AS a Marine 1st Lt. flying an FJ4B fighter/att… 1/22/2004
… … …
80327 Round from the distance/slowly changing colors… 9/30/2013
80328 Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r… 9/30/2013
80329 Napa UFO&#44 9/30/2013
80330 Saw a five gold lit cicular craft moving fastl… 9/30/2013
80331 2 witnesses 2 miles apart&#44 Red &amp; White… 9/30/2013

latitude longitude
0 29.8830556 -97.941111
1 29.38421 -98.581082
2 53.2 -2.916667
3 28.9783333 -96.645833
4 21.4180556 -157.803611
… … …
80327 36.165833 -86.784444

11
80328 43.613611 -116.202500
80329 38.297222 -122.284444
80330 38.901111 -77.265556
80331 35.652778 -97.477778

[80332 rows x 11 columns]

16. Write a Pandas program to create the todays date.


[42]: ### code here

import pandas as pd

# Create today's date using pd.Timestamp


today_date = pd.Timestamp.now()

# Alternatively, create today's date using pd.to_datetime


today_date_alt = pd.to_datetime("today")

# Display today's date


print("Today's date using pd.Timestamp:", today_date)
print("Today's date using pd.to_datetime:", today_date_alt)

Today's date using pd.Timestamp: 2024-10-11 15:08:13.536164


Today's date using pd.to_datetime: 2024-10-11 15:08:13.560834
17. Write a Pandas program to get the current date, oldest date and number of days between
Current date and oldest date of Ufo dataset.
[73]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file


df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime


df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce') # Use␣
↪errors='coerce' to handle invalid parsing

# Get the current date as a Timestamp


current_date = pd.Timestamp.now()

# Get the oldest date from the dataset


oldest_date = df['Date_time'].min() # This is a Timestamp

# Convert oldest_date to date for calculation


oldest_date_as_date = oldest_date.date()

# Calculate the number of days between current date and oldest date

12
number_of_days = (current_date.date() - oldest_date_as_date).days # Convert␣
↪current_date to date

# Display results
print("Current date:", current_date.date())
print("Oldest date:", oldest_date_as_date)
print("Number of days between current date and oldest date:", number_of_days)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\560927568.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Current date: 2024-10-11
Oldest date: 1906-11-11
Number of days between current date and oldest date: 43069

[74]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file


df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime


df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce') #␣
↪Use errors='coerce' to handle invalid parsing

# Get the current date as a Timestamp


current_date = pd.Timestamp.now()

# Get the oldest date from the dataset


oldest_date = df['Date_time'].min() # This is a Timestamp

# Calculate the number of days between current date and oldest date
number_of_days = (current_date - oldest_date).days # Both are Timestamps

# Display results
print("Current date:", current_date)
print("Oldest date:", oldest_date)
print("Number of days between current date and oldest date:", number_of_days)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\2957761521.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Current date: 2024-10-11 16:31:57.747711
Oldest date: 1906-11-11 00:00:00
Number of days between current date and oldest date: 43069

13
18. Write a Pandas program to get all the info of the dataframe between 1950-10-10 and 1960-10-10.
[75]: ### code here

import pandas as pd

# load the ufo_sightnig_data dataset from CSV file


df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime


df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce')

# Define the start and end dates


start_date = '1950-10-10'
end_date = '1960-10-10'

# Filter the DataFrame for dates between the specified range


filtered_df = df[(df['Date_time'] >= start_date) & (df['Date_time'] <=␣
↪end_date)]

# Display the filtered DataFrame


print("Filtered DataFrame between 1950-10-10 and 1960-10-10:")
print(filtered_df)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\122481063.py:6: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Filtered DataFrame between 1950-10-10 and 1960-10-10:
Date_time city state/province country \
2 1955-10-10 17:00:00 chester (uk/england) NaN gb
3 1956-10-10 21:00:00 edna tx us
480 1952-10-01 03:30:00 fukuoka (japan) NaN NaN
481 1952-10-01 12:00:00 kansas city mo us
482 1954-10-01 19:00:00 flatwoods wv us
… … … … …
79262 1960-09-05 21:00:00 buffalo ny us
79668 1958-09-07 19:00:00 arthur nd us
80101 1952-09-09 20:00:00 philadelphia pa us
80102 1954-09-09 12:30:00 beaumont tx us
80103 1956-09-09 05:55:00 norfolk va us

UFO_shape length_of_encounter_seconds described_duration_of_encounter \


2 circle 20 20 seconds
3 circle 20 1/2 hour
480 disk 1200 about 20 mins
481 disk 90 90 sec.

14
482 circle 60 1 minute
… … … …
79262 oval 180.0 3 minutes
79668 unknown 900.0 5-15 min.
80101 circle 180.0 3 minutes
80102 disk 300.0 5 minutes
80103 cigar 90.0 1.5 minutes

description date_documented \
2 Green/Orange circular disc over Chester&#44 En… 1/21/2008
3 My older brother and twin sister were leaving … 1/17/2004
480 UFO seen by multiple U. S. military personnel;… 12/7/2006
481 1952 daylight sighting of multiple discs in fo… 10/31/2008
482 I saw the craft go across the horizon. It app… 4/12/2013
… … …
79262 Precise movements of a &quot;craft&quot; appar… 9/15/2005
79668 Two lights of alternating color traveling and … 10/31/2003
80101 saucers in a line over Phila Pa. 12/12/2009
80102 Aprox. 30 Disk shaped UFOs fell out of clouds … 1/17/2004
80103 Unidentified Object Hovering over Interstate 2… 2/24/2007

latitude longitude
2 53.2 -2.916667
3 28.9783333 -96.645833
480 33.590355 130.401716
481 39.0997222 -94.578333
482 38.7230556 -80.650000
… … …
79262 42.886389 -78.878611
79668 47.104167 -97.217778
80101 39.952222 -75.164167
80102 30.085833 -94.101667
80103 36.846667 -76.285556

[444 rows x 11 columns]

[ ]:

15
pandas-assignment-day-4

November 8, 2024

import the Diamond dataset


[24]: import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')
df

[24]: Unnamed: 0 carat cut color clarity depth table price x \


0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89
2 3 0.23 Good E VS1 56.9 65.0 327 4.05
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
… … … … … … … … … …
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83

y z
0 3.98 2.43
1 3.84 2.31
2 4.07 2.31
3 4.23 2.63
4 4.35 2.75
… … …
53935 5.76 3.50
53936 5.75 3.61
53937 5.68 3.56
53938 6.12 3.74
53939 5.87 3.64

[53940 rows x 11 columns]

1. Write a Pandas program to find the number of rows and columns and data type of each column
of diamonds Dataframe.

1
[4]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Get the number of rows and columns


num_rows, num_columns = df.shape

# Get the data type of each column


column_data_types = df.dtypes

# Display the results


print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")
print("\nData type of each column:")
print(column_data_types)

Number of rows: 53940


Number of columns: 11

Data type of each column:


Unnamed: 0 int64
carat float64
cut object
color object
clarity object
depth float64
table float64
price int64
x float64
y float64
z float64
dtype: object

2. Write a Pandas program to summarize only ‘object’ columns of the diamonds


Dataframe.
[5]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Summarize only the 'object' columns


object_columns_summary = df.select_dtypes(include=['object']).describe()

2
# Display the summary of object columns
print("Summary of object columns:")
print(object_columns_summary)

Summary of object columns:


cut color clarity
count 53940 53940 53940
unique 5 7 8
top Ideal G SI1
freq 21551 11292 13065
3. Write a Pandas program to remove the second column of the diamonds Dataframe. (don’t use
original dataset)

[6]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Create a copy of the original DataFrame


df_copy = df.copy()

# Remove the second column (index 1)


df_copy.drop(df_copy.columns[1], axis=1, inplace=True)

# Display the first few rows of the modified DataFrame


print("DataFrame after removing the second column:")
print(df_copy.head())

DataFrame after removing the second column:


Unnamed: 0 cut color clarity depth table price x y z
0 1 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
4. Write a Pandas program to remove multiple rows at once (axis=0 refers to rows) from diamonds
dataframe. (dont use original dataset)

[7]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

3
df = pd.read_csv('diamonds.csv')

# Create a copy of the original DataFrame


df_copy = df.copy()

# Define the indices of the rows to be removed (e.g., removing rows 0, 1, and 2)
rows_to_remove = [0, 1, 2]

# Remove the specified rows


df_copy.drop(rows_to_remove, axis=0, inplace=True)

# Display the first few rows of the modified DataFrame


print("DataFrame after removing specified rows:")
print(df_copy.head())

DataFrame after removing specified rows:


Unnamed: 0 carat cut color clarity depth table price x \
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
5 6 0.24 Very Good J VVS2 62.8 57.0 336 3.94
6 7 0.24 Very Good I VVS1 62.3 57.0 336 3.95
7 8 0.26 Very Good H SI1 61.9 55.0 337 4.07

y z
3 4.23 2.63
4 4.35 2.75
5 3.96 2.48
6 3.98 2.47
7 4.11 2.53
5. Write a Pandas program to sort the ‘cut’ Series in ascending order (returns a Series) of diamonds
Dataframe.
[8]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Sort the 'cut' Series in ascending order and return it as a Series


sorted_cut_series = df['cut'].sort_values()

# Display the sorted Series


print("Sorted 'cut' Series in ascending order:")
print(sorted_cut_series)

Sorted 'cut' Series in ascending order:

4
3850 Fair
51464 Fair
51466 Fair
10237 Fair
10760 Fair

7402 Very Good
43101 Very Good
16893 Very Good
16898 Very Good
21164 Very Good
Name: cut, Length: 53940, dtype: object
6. Write a Pandas program to sort the entire diamonds DataFrame by the ‘carat’ Series in ascending
and descending order.
[9]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Sort the DataFrame by the 'carat' Series in ascending order


sorted_df_ascending = df.sort_values(by='carat', ascending=True)

# Sort the DataFrame by the 'carat' Series in descending order


sorted_df_descending = df.sort_values(by='carat', ascending=False)

# Display the first few rows of the sorted DataFrames


print("DataFrame sorted by 'carat' in ascending order:")
print(sorted_df_ascending.head())

print("\nDataFrame sorted by 'carat' in descending order:")


print(sorted_df_descending.head())

DataFrame sorted by 'carat' in ascending order:


Unnamed: 0 carat cut color clarity depth table price x \
31593 31594 0.2 Premium E VS2 61.1 59.0 367 3.81
31597 31598 0.2 Ideal D VS2 61.5 57.0 367 3.81
31596 31597 0.2 Premium F VS2 62.6 59.0 367 3.73
31595 31596 0.2 Ideal E VS2 59.7 55.0 367 3.86
31594 31595 0.2 Premium E VS2 59.7 62.0 367 3.84

y z
31593 3.78 2.32
31597 3.77 2.33
31596 3.71 2.33

5
31595 3.84 2.30
31594 3.80 2.28

DataFrame sorted by 'carat' in descending order:


Unnamed: 0 carat cut color clarity depth table price x \
27415 27416 5.01 Fair J I1 65.5 59.0 18018 10.74
27630 27631 4.50 Fair J I1 65.8 58.0 18531 10.23
27130 27131 4.13 Fair H I1 64.8 61.0 17329 10.00
25999 26000 4.01 Premium J I1 62.5 62.0 15223 10.02
25998 25999 4.01 Premium I I1 61.0 61.0 15223 10.14

y z
27415 10.54 6.98
27630 10.16 6.72
27130 9.85 6.43
25999 9.94 6.24
25998 10.10 6.17
7. Write a Pandas program to filter the DataFrame rows to only show carat weight at least 0.3.
[10]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Filter the DataFrame for rows where the carat weight is at least 0.3
filtered_df = df[df['carat'] >= 0.3]

# Display the filtered DataFrame


print("Filtered DataFrame with carat weight at least 0.3:")
print(filtered_df)

Filtered DataFrame with carat weight at least 0.3:


Unnamed: 0 carat cut color clarity depth table price x \
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
10 11 0.30 Good J SI1 64.0 55.0 339 4.25
13 14 0.31 Ideal J SI2 62.2 54.0 344 4.35
15 16 0.32 Premium E I1 60.9 58.0 345 4.38
16 17 0.30 Ideal I SI2 62.0 54.0 348 4.31
… … … … … … … … … …
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83

6
y z
4 4.35 2.75
10 4.28 2.73
13 4.37 2.71
15 4.42 2.68
16 4.34 2.68
… … …
53935 5.76 3.50
53936 5.75 3.61
53937 5.68 3.56
53938 6.12 3.74
53939 5.87 3.64

[52341 rows x 11 columns]


8. Write a Pandas program to find the details of the diamonds where length>5, width>5 and
depth>5.
[11]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Filter the DataFrame for diamonds where length, width, and depth are all␣
↪greater than 5

# Note: The column names in the diamonds dataset for length, width, and depth␣
↪are 'x', 'y', and 'z'

filtered_diamonds = df[(df['x'] > 5) & (df['y'] > 5) & (df['z'] > 5)]

# Display the filtered DataFrame


print("Diamonds where length > 5, width > 5, and depth > 5:")
print(filtered_diamonds)

Diamonds where length > 5, width > 5, and depth > 5:


Unnamed: 0 carat cut color clarity depth table price x \
11778 11779 1.83 Fair J I1 70.0 58.0 5083 7.34
13002 13003 2.14 Fair J I1 69.4 57.0 5405 7.74
13118 13119 2.15 Fair J I1 65.5 57.0 5430 8.01
13562 13563 1.96 Fair F I1 66.6 60.0 5554 7.59
13757 13758 2.22 Fair J I1 66.7 56.0 5607 8.04
… … … … … … … … … …
27748 27749 2.00 Very Good G SI1 63.5 56.0 18818 7.90
27749 27750 2.29 Premium I VS2 60.8 60.0 18823 8.50
48410 48411 0.51 Very Good E VS1 61.8 54.7 1970 5.12
49189 49190 0.51 Ideal E VS1 61.8 55.0 2075 5.15
49905 49906 0.50 Very Good G VVS1 63.7 58.0 2180 5.01

7
y z
11778 7.28 5.12
13002 7.70 5.36
13118 7.95 5.23
13562 7.56 5.04
13757 8.02 5.36
… … …
27748 7.97 5.04
27749 8.47 5.16
48410 5.15 31.80
49189 31.80 5.12
49905 5.04 5.06

[1457 rows x 11 columns]


9. Write a Pandas program to calculate the mean of each row of diamonds DataFrame.
[12]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Calculate the mean of each row


# You may want to exclude non-numeric columns (like 'cut', 'color', 'clarity')␣
↪before calculating the mean

row_means = df.select_dtypes(include='number').mean(axis=1)

# Add the row means to the DataFrame (optional)


df['Row_Mean'] = row_means

# Display the original DataFrame with the new Row_Mean column


print("Diamonds DataFrame with Row Means:")
print(df[['carat', 'x', 'y', 'z', 'Row_Mean']].head()) # Displaying relevant␣
↪columns for clarity

Diamonds DataFrame with Row Means:


carat x y z Row_Mean
0 0.23 3.95 3.98 2.43 56.76125
1 0.21 3.89 3.84 2.31 57.38125
2 0.23 4.05 4.07 2.31 57.82000
3 0.29 4.20 4.23 2.63 58.71875
4 0.31 4.34 4.35 2.75 59.13125
10. Write a Pandas program to calculate the mean of price for each cut and find maximum top 3
of diamonds DataFrame.

8
[13]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Calculate the mean price for each cut


mean_price_by_cut = df.groupby('cut')['price'].mean()

# Find the top 3 cuts with the highest mean price


top_3_cuts = mean_price_by_cut.nlargest(3)

# Display the results


print("Top 3 cuts with the highest mean price:")
print(top_3_cuts)

Top 3 cuts with the highest mean price:


cut
Premium 4584.257704
Fair 4358.757764
Very Good 3981.759891
Name: price, dtype: float64

11. Write a Pandas program to calculate count, minimum, maximum price for each
cut of diamonds DataFrame.
[14]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Calculate count, minimum, and maximum price for each cut


cut_price_stats = df.groupby('cut')['price'].agg(['count', 'min', 'max'])

# Display the results


print("Count, minimum, and maximum price for each cut:")
print(cut_price_stats)

Count, minimum, and maximum price for each cut:


count min max
cut
Fair 1610 337 18574
Good 4906 327 18788
Ideal 21551 326 18806
Premium 13791 326 18823

9
Very Good 12082 336 18818
12. Write a Pandas program to display and count the unique values in cut series of diamonds
DataFrame.
[15]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Display unique values in the 'cut' series


unique_cuts = df['cut'].unique()
print("Unique values in the 'cut' series:")
print(unique_cuts)

# Count unique values in the 'cut' series


cut_counts = df['cut'].value_counts()
print("\nCount of unique values in the 'cut' series:")
print(cut_counts)

Unique values in the 'cut' series:


['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

Count of unique values in the 'cut' series:


cut
Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: count, dtype: int64
13. Write a Pandas program to count the number of missing values in each Series of diamonds
DataFrame.
[16]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Count the number of missing values in each Series (column)


missing_values_count = df.isnull().sum()

# Display the count of missing values


print("Number of missing values in each column:")

10
print(missing_values_count)

Number of missing values in each column:


Unnamed: 0 0
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
price 0
x 0
y 0
z 0
dtype: int64
14. Write a Pandas program to calculate the multiply of x, y and z for each cut of diamonds
DataFrame.
[17]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Calculate the product of x, y, and z and create a new column 'xyz_product'


df['xyz_product'] = df['x'] * df['y'] * df['z']

# Group by 'cut' and calculate the sum of the 'xyz_product' for each cut
cut_xyz_product = df.groupby('cut')['xyz_product'].sum()

# Display the results


print("Total product of x, y, and z for each cut:")
print(cut_xyz_product)

Total product of x, y, and z for each cut:


cut
Fair 2.655704e+05
Good 6.684782e+05
Ideal 2.486876e+06
Premium 2.000414e+06
Very Good 1.582739e+06
Name: xyz_product, dtype: float64

15. Write a Pandas program to read rows 0 through 2 (inclusive), columns ‘color’ and
‘price’ of diamonds DataFrame.

11
[18]: ## code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Select rows 0 through 2 (inclusive) and columns 'color' and 'price'


selected_data = df.loc[0:2, ['color', 'price']]

# Display the selected data


print("Selected rows and columns from the diamonds DataFrame:")
print(selected_data)

Selected rows and columns from the diamonds DataFrame:


color price
0 E 326
1 E 326
2 E 327
16. Write a Pandas program to read rows in positions 0 and 1, columns in positions 0 and 3 of
diamonds DataFrame.
[19]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Select rows in positions 0 and 1 and columns in positions 0 and 3


selected_data = df.iloc[0:2, [0, 3]] # 0:2 selects rows 0 and 1; [0, 3]␣
↪selects columns 0 and 3

# Display the selected data


print("Selected rows and columns from the diamonds DataFrame:")
print(selected_data)

Selected rows and columns from the diamonds DataFrame:


Unnamed: 0 color
0 1 E
1 2 E

17. Write a Pandas program to get randomly sample rows from diamonds DataFrame.
[20]: ### code here

import pandas as pd

12
# Load the diamonds dataset from CSV file
df = pd.read_csv('diamonds.csv')

# Get a random sample of 5 rows from the diamonds DataFrame


random_sample = df.sample(n=5, random_state=1) # Set random_state for␣
↪reproducibility

# Display the random sample


print("Random sample of rows from the diamonds DataFrame:")
print(random_sample)

Random sample of rows from the diamonds DataFrame:


Unnamed: 0 carat cut color clarity depth table price x \
2714 2715 0.33 Ideal H SI1 61.7 55.0 564 4.43
14653 14654 1.20 Ideal I VS2 62.1 57.0 5914 6.78
52760 52761 0.62 Ideal D VS1 61.0 57.0 2562 5.51
48658 48659 0.34 Good H VS2 63.1 56.0 537 4.41
14812 14813 1.20 Ideal E SI2 62.5 55.0 5964 6.77

y z
2714 4.46 2.74
14653 6.71 4.19
52760 5.54 3.37
48658 4.46 2.80
14812 6.84 4.25
18. Write a Pandas program to get sample 75% of the diamonds DataFrame’s rows without re-
placement and store the remaining 25% of the rows in another DataFrame.
[21]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Sample 75% of the rows without replacement


sampled_df = df.sample(frac=0.75, random_state=1) # frac=0.75 means 75%

# Store the remaining 25% of the rows in another DataFrame


remaining_df = df.drop(sampled_df.index)

# Display the shapes of the sampled and remaining DataFrames


print("Shape of the sampled DataFrame (75%):", sampled_df.shape)
print("Shape of the remaining DataFrame (25%):", remaining_df.shape)

# Optional: Display the first few rows of each DataFrame


print("\nSampled DataFrame (75%):")

13
print(sampled_df.head())

print("\nRemaining DataFrame (25%):")


print(remaining_df.head())

Shape of the sampled DataFrame (75%): (40455, 11)


Shape of the remaining DataFrame (25%): (13485, 11)

Sampled DataFrame (75%):


Unnamed: 0 carat cut color clarity depth table price x \
2714 2715 0.33 Ideal H SI1 61.7 55.0 564 4.43
14653 14654 1.20 Ideal I VS2 62.1 57.0 5914 6.78
52760 52761 0.62 Ideal D VS1 61.0 57.0 2562 5.51
48658 48659 0.34 Good H VS2 63.1 56.0 537 4.41
14812 14813 1.20 Ideal E SI2 62.5 55.0 5964 6.77

y z
2714 4.46 2.74
14653 6.71 4.19
52760 5.54 3.37
48658 4.46 2.80
14812 6.84 4.25

Remaining DataFrame (25%):


Unnamed: 0 carat cut color clarity depth table price x \
9 10 0.23 Very Good H VS1 59.4 61.0 338 4.00
14 15 0.20 Premium E SI2 60.2 62.0 345 3.79
15 16 0.32 Premium E I1 60.9 58.0 345 4.38
18 19 0.30 Good J SI1 63.8 56.0 351 4.23
20 21 0.30 Good I SI2 63.3 56.0 351 4.26

y z
9 4.05 2.39
14 3.75 2.27
15 4.42 2.68
18 4.26 2.71
20 4.30 2.71
19. Write a Pandas program to read the diamonds DataFrame and detect duplicate color.
[22]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Check for duplicates in the 'color' column

14
duplicates = df[df['color'].duplicated(keep=False)] # keep=False marks all␣
↪duplicates as True

# Display the duplicate colors


print("Duplicate colors in the diamonds DataFrame:")
print(duplicates[['color']].drop_duplicates())

Duplicate colors in the diamonds DataFrame:


color
0 E
3 I
4 J
7 H
12 F
25 G
28 D
20. Write a Pandas program to count the duplicate rows of diamonds DataFrame.
[23]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file


df = pd.read_csv('diamonds.csv')

# Count duplicate rows in the DataFrame


duplicate_count = df.duplicated().sum() # Returns the count of duplicate rows

# Display the count of duplicate rows


print("Number of duplicate rows in the diamonds DataFrame:", duplicate_count)

Number of duplicate rows in the diamonds DataFrame: 0

[ ]:

15

You might also like