0% found this document useful (0 votes)

161 views53 pages

Assignment - 10 - Pandas

The document outlines a series of steps for analyzing the iris dataset using Pandas in Python. It includes loading the dataset, performing various data manipulations such as calculating statistics, filtering data, and adding new columns. Each step is accompanied by code snippets demonstrating the required operations.

Uploaded by

Mbg Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views53 pages

Assignment - 10 - Pandas

Uploaded by

Mbg Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

pandas-assignment-day-1-iris

November 8, 2024

step 1. load the iris.csv dataset.

[11]: import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')
df

[11]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

1. Write a Pandas program to get the data types of the given csv data (iris.csv ) fields

1
[1]: ### code here
import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Display the data types of each column

print(df.dtypes)

Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

2. Write a Pandas program to find the sum, mean, max, min value of ‘SepalLengthCm’
column of dataframe
[2]: ### code here
import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Calculate sum, mean, max, and min for 'SepalLengthCm' column

sepal_length_sum = df['SepalLengthCm'].sum()
sepal_length_mean = df['SepalLengthCm'].mean()
sepal_length_max = df['SepalLengthCm'].max()
sepal_length_min = df['SepalLengthCm'].min()

# Display the results

print(f"Sum of SepalLengthCm: {sepal_length_sum}")
print(f"Mean of SepalLengthCm: {sepal_length_mean}")
print(f"Max of SepalLengthCm: {sepal_length_max}")
print(f"Min of SepalLengthCm: {sepal_length_min}")

Sum of SepalLengthCm: 876.5

Mean of SepalLengthCm: 5.843333333333334
Max of SepalLengthCm: 7.9
Min of SepalLengthCm: 4.3

3. Write a Pandas program to import iris data and skipping first twenty rows into a
Pandas dataframe.
[3]: ### code here
import pandas as pd

2
# Load the iris dataset from csv file, skipping the first 20 rows
df = pd.read_csv('iris.csv', skiprows=20)

# Display the first few rows of the dataframe to verify

print(df.head())

20 5.1 3.8 1.5 0.3 Iris-setosa

0 21 5.4 3.4 1.7 0.2 Iris-setosa
1 22 5.1 3.7 1.5 0.4 Iris-setosa
2 23 4.6 3.6 1.0 0.2 Iris-setosa
3 24 5.1 3.3 1.7 0.5 Iris-setosa
4 25 4.8 3.4 1.9 0.2 Iris-setosa

4. Write a Pandas program to display the last ten rows.

[4]: ### code here
import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Display the last 10 rows of the dataframe

print(df.tail(10))

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

140 141 6.7 3.1 5.6 2.4
141 142 6.9 3.1 5.1 2.3
142 143 5.8 2.7 5.1 1.9
143 144 6.8 3.2 5.9 2.3
144 145 6.7 3.3 5.7 2.5
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
140 Iris-virginica
141 Iris-virginica
142 Iris-virginica
143 Iris-virginica
144 Iris-virginica
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

3
5. Write a Pandas program to create a subtotal of “PetalLengthCm” against Species
from the dataframe.
[5]: ### code here
import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Group by 'Species' and calculate the sum of 'PetalLengthCm'

subtotal = df.groupby('Species')['PetalLengthCm'].sum()

# Display the subtotal

print(subtotal)

Species
Iris-setosa 73.2
Iris-versicolor 213.0
Iris-virginica 277.6
Name: PetalLengthCm, dtype: float64
6. Write a Pandas program to find details where “PetalLengthCm” > 2.
[6]: ### code here?
import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Filter the rows where 'PetalLengthCm' is greater than 2

filtered_data = df[df['PetalLengthCm'] > 2]

# Display the filtered data

print(filtered_data)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

50 51 7.0 3.2 4.7 1.4
51 52 6.4 3.2 4.5 1.5
52 53 6.9 3.1 4.9 1.5
53 54 5.5 2.3 4.0 1.3
54 55 6.5 2.8 4.6 1.5
.. … … … … …
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species

4
50 Iris-versicolor
51 Iris-versicolor
52 Iris-versicolor
53 Iris-versicolor
54 Iris-versicolor
.. …
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[100 rows x 6 columns]

7. Write a Pandas program find all records that include two specific Species.
[7]: import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Filter rows where 'Species' is either 'setosa' or 'versicolor'

filtered_data = df[df['Species'].isin(['setosa', 'versicolor'])]

# Display the filtered data

print(filtered_data)

Empty DataFrame
Columns: [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Index: []
8. Write a Pandas program to sort the records by the SepalLengthCm column.
[8]: import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Sort the records by 'SepalLengthCm' in ascending order

sorted_data = df.sort_values(by='SepalLengthCm')

# Display the sorted data

print(sorted_data)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

13 14 4.3 3.0 1.1 0.1
42 43 4.4 3.2 1.3 0.2
38 39 4.4 3.0 1.3 0.2
8 9 4.4 2.9 1.4 0.2

5
41 42 4.5 2.3 1.3 0.3
.. … … … … …
122 123 7.7 2.8 6.7 2.0
118 119 7.7 2.6 6.9 2.3
117 118 7.7 3.8 6.7 2.2
135 136 7.7 3.0 6.1 2.3
131 132 7.9 3.8 6.4 2.0

Species
13 Iris-setosa
42 Iris-setosa
38 Iris-setosa
8 Iris-setosa
41 Iris-setosa
.. …
122 Iris-virginica
118 Iris-virginica
117 Iris-virginica
135 Iris-virginica
131 Iris-virginica

[150 rows x 6 columns]

9. Write a Pandas program to insert a column in the sixth position of dataframe and
fill it with NaN values.
[9]: import pandas as pd
import numpy as np

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Insert a new column at the sixth position (index 5) and fill it with NaN␣
↪values

df.insert(5, 'NewColumn', np.nan)

# Display the first few rows of the updated dataframe

print(df.head())

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm NewColumn \

0 1 5.1 3.5 1.4 0.2 NaN
1 2 4.9 3.0 1.4 0.2 NaN
2 3 4.7 3.2 1.3 0.2 NaN
3 4 4.6 3.1 1.5 0.2 NaN
4 5 5.0 3.6 1.4 0.2 NaN

Species
0 Iris-setosa

6
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa

10. Write a Pandas program to add summation to a row of the dataframe.

[10]: import pandas as pd

# Load the iris dataset from csv file

df = pd.read_csv('iris.csv')

# Calculate the sum of the numeric columns

summation_row = df.sum(numeric_only=True)

# Append the summation row to the DataFrame

df.loc['Total'] = summation_row

# Display the updated DataFrame

print(df)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

0 1.0 5.1 3.5 1.4 0.2
1 2.0 4.9 3.0 1.4 0.2
2 3.0 4.7 3.2 1.3 0.2
3 4.0 4.6 3.1 1.5 0.2
4 5.0 5.0 3.6 1.4 0.2
… … … … … …
146 147.0 6.3 2.5 5.0 1.9
147 148.0 6.5 3.0 5.2 2.0
148 149.0 6.2 3.4 5.4 2.3
149 150.0 5.9 3.0 5.1 1.8
Total 11325.0 876.5 458.1 563.8 179.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
… …
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Total NaN

[151 rows x 6 columns]

7
[ ]:

8
pandas-assignment-day-2

November 8, 2024

[20]: import pandas as pd

import numpy as np
data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',␣
↪'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',␣
↪'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Write a Pandas program to create a dataframe by given dictionary and index as lables
[21]: # Create a DataFrame
df = pd.DataFrame(data, index=labels)

# Display the created DataFrame

print(df)

name score attempts qualify

a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

1. Write a Pandas program to select the rows where the score is missing, i.e. is NaN
[22]: # Create a DataFrame
df = pd.DataFrame(data)

# Select the rows where 'Score' is NaN

missing_score_rows = df[df['score'].isna()]

1
# Display the rows with missing scores
print(missing_score_rows)

name score attempts qualify

3 James NaN 3 no
7 Laura NaN 1 no

2. Write a Pandas program to change the score in row ‘d’ to 11.5.

[27]: # Create a DataFrame
df = pd.DataFrame(data, index=labels) # Setting custom index

# Change the score in row 'd' to 11.5

df.loc['d', 'score'] = 11.5

# Display the updated DataFrame

print(df)

name score attempts qualify

a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James 11.5 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

3. Write a Pandas program to select the ‘name’ and ‘score’ columns from the following
DataFrame.
[28]: # Create a DataFrame
df = pd.DataFrame(data)

# Select the 'name' and 'score' columns

selected_columns = df[['name', 'score']]

# Display the selected columns

print(selected_columns)

name score
0 Anastasia 12.5
1 Dima 9.0
2 Katherine 16.5
3 James NaN
4 Emily 9.0
5 Michael 20.0

2
6 Matthew 14.5
7 Laura NaN
8 Kevin 8.0
9 Jonas 19.0

4. Write a Pandas program to calculate the mean score for each different student in
DataFrame.
[29]: # Create a DataFrame
df = pd.DataFrame(data)

# Calculate the mean score for each student

mean_scores = df.groupby('name')['score'].mean()

# Display the mean scores

print(mean_scores)

name
Anastasia 12.5
Dima 9.0
Emily 9.0
James NaN
Jonas 19.0
Katherine 16.5
Kevin 8.0
Laura NaN
Matthew 14.5
Michael 20.0
Name: score, dtype: float64

5. Write a Pandas program to append a new row ‘k’ to data frame with any value for
each column. Now delete the row ‘c’ and return the original DataFrame.
[30]: ### code here
# Create a DataFrame
df = pd.DataFrame(data)

# Append a new row 'k' with values for each column

new_row = {'name': 'Frank', 'score': 82, 'age': 26}
df = df.append(new_row, ignore_index=True)

# Delete the row 'c' (Charlie) from the DataFrame

df = df.drop(index=2) # Assuming 'c' corresponds to the index of Charlie

# Display the modified DataFrame

print(df)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)

3
~\AppData\Local\Temp\ipykernel_11716\372884500.py in ?()
3 df = pd.DataFrame(data)
4
5 # Append a new row 'k' with values for each column
6 new_row = {'name': 'Frank', 'score': 82, 'age': 26}
----> 7 df = df.append(new_row, ignore_index=True)
8
9 # Delete the row 'c' (Charlie) from the DataFrame
10 df = df.drop(index=2) # Assuming 'c' corresponds to the index of Charlie

~\anaconda3\Lib\site-packages\pandas\core\generic.py in ?(self, name)

6295 and name not in self._accessors
6296 and self._info_axis.
↪_can_hold_identifiers_and_holds_name(name)

6297 ):
6298 return self[name]
-> 6299 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'

import the titanic dataset

[43]: import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')
df

[43]: survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. … … … … … … … … …
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female NaN 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15

0 man True NaN Southampton no False NaN
1 woman False C Cherbourg yes False NaN
2 woman False NaN Southampton yes True NaN
3 woman False C Southampton yes False NaN
4 man True NaN Southampton no True NaN

4
.. … … … … … … …
886 man True NaN Southampton no True NaN
887 woman False B Southampton yes True NaN
888 woman False NaN Southampton no False NaN
889 man True C Cherbourg yes True NaN
890 man True NaN Queenstown no True NaN

[891 rows x 16 columns]

1. Write a Pandas program to create a Pivot table and find the total fare amount
class wise, gender wise.
[31]: #### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Create a pivot table to find total fare amount class-wise and gender-wise
pivot_table = pd.pivot_table(df, values='fare', index='pclass', columns='sex',␣
↪aggfunc='sum', fill_value=0)

# Display the pivot table

print(pivot_table)

sex female male

pclass
1 9975.8250 8201.5875
2 1669.7292 2132.1125
3 2321.1086 4393.5865

2. Write a Pandas program to create a Pivot table and find survival rate by gender.
[32]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Create a pivot table to find survival rate by gender

pivot_table = pd.pivot_table(df,
values='survived',
index='sex',
aggfunc='mean') # Mean will give the survival␣
↪rate

# Display the pivot table

5
print(pivot_table)

survived
sex
female 0.742038
male 0.188908
3. Write a Pandas to create to find the total count no of people survived pclass wise, class wise
and gender wise.
[33]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Create a pivot table to find the total count of people who survived, grouped␣
↪by Pclass and Gender

pivot_table = pd.pivot_table(df,
values='survived',
index='pclass',
columns='sex',
aggfunc='sum',
fill_value=0) # Fill missing values with 0

# Display the pivot table

print(pivot_table)

sex female male

pclass
1 91 45
2 70 17
3 72 47
4. Write a Pandas program to create a Pivot table and calculate how many women and men were
in a particular cabin class.
[34]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Create a pivot table to count how many women and men were in each cabin class
pivot_table = pd.pivot_table(df,
values='class',
index='pclass',

6
columns='sex',
aggfunc='count',
fill_value=0) # Fill missing values with 0

# Display the pivot table

print(pivot_table)

sex female male

pclass
1 94 122
2 76 108
3 144 347

5. Write a Pandas program to count the number of missing values in each column.
[35]: #### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Count the number of missing (NaN) values in each column

missing_values = df.isnull().sum()

# Display the result

print(missing_values)

survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
Unnamed: 15 891
dtype: int64

6. Write a Pandas program to replace null values with the value from the previous
row or the next row in a given DataFrame.

7
[36]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Replace NaN values with the value from the previous row (forward fill)
df_ffill = df.ffill()

# Display the result

print("\nDataFrame after forward fill (previous row):\n", df_ffill)

DataFrame after forward fill (previous row):

survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. … … … … … … … … …
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female 19.0 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15

0 man True NaN Southampton no False NaN
1 woman False C Cherbourg yes False NaN
2 woman False C Southampton yes True NaN
3 woman False C Southampton yes False NaN
4 man True C Southampton no True NaN
.. … … … … … … …
886 man True C Southampton no True NaN
887 woman False B Southampton yes True NaN
888 woman False B Southampton no False NaN
889 man True C Cherbourg yes True NaN
890 man True C Queenstown no True NaN

[891 rows x 16 columns]

7. Write a Pandas program to replace nulls, with median or mean of the specified columns in a
given DataFrame.
[37]: ### code here
import pandas as pd

8
# Load the titanic dataset from csv file
df = pd.read_csv('titanic.csv')

# Replace NaN values with the mean of the respective column

df_mean_filled = df.copy()
df_mean_filled['age'] = df_mean_filled['age'].fillna(df['age'].mean())
df_mean_filled['fare'] = df_mean_filled['fare'].fillna(df['fare'].mean())

# Display the result

print("\nDataFrame after filling NaNs with mean:\n", df_mean_filled)

DataFrame after filling NaNs with mean:

survived pclass sex age sibsp parch fare embarked \
0 0 3 male 22.000000 1 0 7.2500 S
1 1 1 female 38.000000 1 0 71.2833 C
2 1 3 female 26.000000 0 0 7.9250 S
3 1 1 female 35.000000 1 0 53.1000 S
4 0 3 male 35.000000 0 0 8.0500 S
.. … … … … … … … …
886 0 2 male 27.000000 0 0 13.0000 S
887 1 1 female 19.000000 0 0 30.0000 S
888 0 3 female 29.699118 1 2 23.4500 S
889 1 1 male 26.000000 0 0 30.0000 C
890 0 3 male 32.000000 0 0 7.7500 Q

class who adult_male deck embark_town alive alone Unnamed: 15

0 Third man True NaN Southampton no False NaN
1 First woman False C Cherbourg yes False NaN
2 Third woman False NaN Southampton yes True NaN
3 First woman False C Southampton yes False NaN
4 Third man True NaN Southampton no True NaN
.. … … … … … … … …
886 Second man True NaN Southampton no True NaN
887 First woman False B Southampton yes True NaN
888 Third woman False NaN Southampton no False NaN
889 First man True C Cherbourg yes True NaN
890 Third man True NaN Queenstown no True NaN

[891 rows x 16 columns]

8. Write a Pandas program to select first 2 rows, 2 columns and specific two columns from the
dataframe.
[38]: #### code here
import pandas as pd

# Load the titanic dataset from csv file

9
df = pd.read_csv('titanic.csv')

# Select first 2 rows and 2 columns

first_2_rows_2_columns = df.iloc[:2, :2]

# Select specific columns 'Sex' and 'Age'

specific_columns = df[['sex', 'age']]

# Display the result

print("\nSpecific two columns 'Sex' and 'Age':\n", specific_columns)

# Display the result

print("\nFirst 2 rows and 2 columns:\n", first_2_rows_2_columns)

Specific two columns 'Sex' and 'Age':

sex age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
.. … …
886 male 27.0
887 female 19.0
888 female NaN
889 male 26.0
890 male 32.0

[891 rows x 2 columns]

First 2 rows and 2 columns:

survived pclass
0 0 3
1 1 1
9. Write a Pandas program to remove the duplicates from ‘embark_town’ column of titanic dataset.
[39]: #### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Remove duplicates from 'embark_town' column

embark_town_unique = df['embark_town'].drop_duplicates()

10
# Display the result
print("\n'Embark_town' column after removing duplicates:\n", embark_town_unique)

'Embark_town' column after removing duplicates:

0 Southampton
1 Cherbourg
5 Queenstown
61 NaN
Name: embark_town, dtype: object
10. Write a Pandas program to find out the survived is 1 details by the ‘Southampton’ in who is
the ‘child’ from dataset.
[40]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Filter passengers who survived, embarked from 'Southampton', and are children␣
↪(age < 18)

survived_children_southampton = df[ (df['survived'] == 1) &

(df['embark_town'] == 'Southampton') &
(df['age'] < 18) ]

# Display the filtered results

print("\nDetails of survived children from Southampton:\n",␣
↪survived_children_southampton)

Details of survived children from Southampton:

survived pclass sex age sibsp parch fare embarked class
\
10 1 3 female 4.00 1 1 16.7000 S Third
58 1 2 female 5.00 1 2 27.7500 S Second
68 1 3 female 17.00 4 2 7.9250 S Third
78 1 2 male 0.83 0 2 29.0000 S Second
84 1 2 female 17.00 0 0 10.5000 S Second
165 1 3 male 9.00 0 2 20.5250 S Third
172 1 3 female 1.00 1 1 11.1333 S Third
183 1 2 male 1.00 2 1 39.0000 S Second
184 1 3 female 4.00 0 2 22.0250 S Third
193 1 2 male 3.00 1 1 26.0000 S Second
220 1 3 male 16.00 0 0 8.0500 S Third
233 1 3 female 5.00 4 2 31.3875 S Third
237 1 2 female 8.00 0 2 26.2500 S Second

11
261 1 3 male 3.00 4 2 31.3875 S Third
305 1 1 male 0.92 1 2 151.5500 S First
340 1 2 male 2.00 1 1 26.0000 S Second
348 1 3 male 3.00 1 1 15.9000 S Third
407 1 2 male 3.00 1 1 18.7500 S Second
435 1 1 female 14.00 1 2 120.0000 S First
445 1 1 male 4.00 0 2 81.8583 S First
446 1 2 female 13.00 0 1 19.5000 S Second
479 1 3 female 2.00 0 1 12.2875 S Third
489 1 3 male 9.00 1 1 15.9000 S Third
504 1 1 female 16.00 0 0 86.5000 S First
530 1 2 female 2.00 1 1 26.0000 S Second
535 1 2 female 7.00 0 2 26.2500 S Second
549 1 2 male 8.00 1 1 36.7500 S Second
618 1 2 female 4.00 2 1 39.0000 S Second
689 1 1 female 15.00 0 1 211.3375 S First
720 1 2 female 6.00 0 1 33.0000 S Second
750 1 2 female 4.00 1 1 23.0000 S Second
751 1 3 male 6.00 0 1 12.4750 S Third
755 1 2 male 0.67 1 1 14.5000 S Second
777 1 3 female 5.00 0 0 12.4750 S Third
781 1 1 female 17.00 1 0 57.0000 S First
788 1 3 male 1.00 1 2 20.5750 S Third
802 1 1 male 11.00 1 2 120.0000 S First
831 1 2 male 0.83 1 1 18.7500 S Second
853 1 1 female 16.00 0 1 39.4000 S First
869 1 3 male 4.00 1 1 11.1333 S Third

who adult_male deck embark_town alive alone Unnamed: 15

10 child False G Southampton yes False NaN
58 child False NaN Southampton yes False NaN
68 woman False NaN Southampton yes False NaN
78 child False NaN Southampton yes False NaN
84 woman False NaN Southampton yes True NaN
165 child False NaN Southampton yes False NaN
172 child False NaN Southampton yes False NaN
183 child False F Southampton yes False NaN
184 child False NaN Southampton yes False NaN
193 child False F Southampton yes False NaN
220 man True NaN Southampton yes True NaN
233 child False NaN Southampton yes False NaN
237 child False NaN Southampton yes False NaN
261 child False NaN Southampton yes False NaN
305 child False C Southampton yes False NaN
340 child False F Southampton yes False NaN
348 child False NaN Southampton yes False NaN
407 child False NaN Southampton yes False NaN
435 child False B Southampton yes False NaN

12
445 child False A Southampton yes False NaN
446 child False NaN Southampton yes False NaN
479 child False NaN Southampton yes False NaN
489 child False NaN Southampton yes False NaN
504 woman False B Southampton yes True NaN
530 child False NaN Southampton yes False NaN
535 child False NaN Southampton yes False NaN
549 child False NaN Southampton yes False NaN
618 child False F Southampton yes False NaN
689 child False B Southampton yes False NaN
720 child False NaN Southampton yes False NaN
750 child False NaN Southampton yes False NaN
751 child False E Southampton yes False NaN
755 child False NaN Southampton yes False NaN
777 child False NaN Southampton yes True NaN
781 woman False B Southampton yes False NaN
788 child False NaN Southampton yes False NaN
802 child False B Southampton yes False NaN
831 child False NaN Southampton yes False NaN
853 woman False D Southampton yes False NaN
869 child False NaN Southampton yes False NaN
11. Write a Pandas program to filter all records starting from the 2nd row, access every 5th row
from the dataframe.
[41]: ### code here
import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Filter the DataFrame starting from the 2nd row (index 1) and access every 5th␣
↪row

filtered_df = df.iloc[1::5]

# Display the filtered DataFrame

print("\nFiltered DataFrame (starting from 2nd row, every 5th row):\n",␣
↪filtered_df)

Filtered DataFrame (starting from 2nd row, every 5th row):

survived pclass sex age sibsp parch fare embarked class \
1 1 1 female 38.0 1 0 71.2833 C First
6 0 1 male 54.0 0 0 51.8625 S First
11 1 1 female 58.0 0 0 26.5500 S First
16 0 3 male 2.0 4 1 29.1250 Q Third
21 1 2 male 34.0 0 0 13.0000 S Second

13
.. … … … … … … … … …
866 1 2 female 27.0 1 0 13.8583 C Second
871 1 1 female 47.0 1 1 52.5542 S First
876 0 3 male 20.0 0 0 9.8458 S Third
881 0 3 male 33.0 0 0 7.8958 S Third
886 0 2 male 27.0 0 0 13.0000 S Second

who adult_male deck embark_town alive alone Unnamed: 15

1 woman False C Cherbourg yes False NaN
6 man True E Southampton no True NaN
11 woman False C Southampton yes True NaN
16 child False NaN Queenstown no False NaN
21 man True D Southampton yes True NaN
.. … … … … … … …
866 woman False NaN Cherbourg yes False NaN
871 woman False D Southampton yes False NaN
876 man True NaN Southampton no True NaN
881 man True NaN Southampton no True NaN
886 man True NaN Southampton no True NaN

[178 rows x 16 columns]

12. Write a Pandas program to filter those records where embarked matches with multiple values
(C, Q) from the dataset.

[42]: #### code here

import pandas as pd

# Load the titanic dataset from csv file

df = pd.read_csv('titanic.csv')

# Filter records where 'embarked' matches either 'C' or 'Q'

filtered_records = df[df['embarked'].isin(['C', 'Q'])]

# Display the filtered records

print("\nFiltered records where embarked is 'C' or 'Q':\n", filtered_records)

Filtered records where embarked is 'C' or 'Q':

survived pclass sex age sibsp parch fare embarked class \
1 1 1 female 38.0 1 0 71.2833 C First
5 0 3 male NaN 0 0 8.4583 Q Third
9 1 2 female 14.0 1 0 30.0708 C Second
16 0 3 male 2.0 4 1 29.1250 Q Third
19 1 3 female NaN 0 0 7.2250 C Third
.. … … … … … … … … …
875 1 3 female 15.0 0 0 7.2250 C Third
879 1 1 female 56.0 0 1 83.1583 C First

14
885 0 3 female 39.0 0 5 29.1250 Q Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone Unnamed: 15

1 woman False C Cherbourg yes False NaN
5 man True NaN Queenstown no True NaN
9 child False NaN Cherbourg yes False NaN
16 child False NaN Queenstown no False NaN
19 woman False NaN Cherbourg yes True NaN
.. … … … … … … …
875 child False NaN Cherbourg yes True NaN
879 woman False C Cherbourg yes False NaN
885 woman False NaN Queenstown no False NaN
889 man True C Cherbourg yes True NaN
890 man True NaN Queenstown no True NaN

[245 rows x 16 columns]

[ ]:

15
pandas-assignment-day-3

November 8, 2024

[13]: import pandas as pd

import numpy as np
data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily',␣
↪'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no',␣
↪'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Write a Pandas program to create a dataframe by given dictionary and index as lables
[14]: ### code here
# Create a DataFrame
df = pd.DataFrame(data, index=labels)

# Display the created DataFrame

print(df)

name score attempts qualify

a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
f Michael 20.0 3 yes
g Matthew 14.5 1 yes
h Laura NaN 1 no
i Kevin 8.0 2 no
j Jonas 19.0 1 yes

1. Write a Pandas program to sort the DataFrame first by ‘name’ in descending order,
then by ‘score’ in ascending order
[63]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

1
# Sort the DataFrame first by 'name' in descending order, then by 'score' in␣
↪ascending order

sorted_df = df.sort_values(by=['name', 'score'], ascending=[False, True])

# Display the sorted DataFrame

print("\nSorted DataFrame:\n", sorted_df)

Sorted DataFrame:
name score attempts qualify
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
2 Katherine 16.5 2 yes
9 Jonas 19.0 1 yes
3 James NaN 3 no
4 Emily 9.0 2 no
1 Dima 9.0 3 no
0 Anastasia 12.5 1 yes
2. Write a Pandas program to replace the ‘qualify’ column contains the values ‘yes’ and ‘no’ with
True and False
[64]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Replace 'yes' with True and 'no' with False in the 'qualify' column
df['qualify'] = df['qualify'].replace({'yes': True, 'no': False})

# Display the updated DataFrame

print("\nUpdated DataFrame:\n", df)

Updated DataFrame:
name score attempts qualify
0 Anastasia 12.5 1 True
1 Dima 9.0 3 False
2 Katherine 16.5 2 True
3 James NaN 3 False
4 Emily 9.0 2 False
5 Michael 20.0 3 True
6 Matthew 14.5 1 True
7 Laura NaN 1 False
8 Kevin 8.0 2 False

2
9 Jonas 19.0 1 True
C:\Users\indra\AppData\Local\Temp\ipykernel_21872\2229088605.py:7:
FutureWarning: Downcasting behavior in `replace` is deprecated and will be
removed in a future version. To retain the old behavior, explicitly call
`result.infer_objects(copy=False)`. To opt-in to the future behavior, set
`pd.set_option('future.no_silent_downcasting', True)`
df['qualify'] = df['qualify'].replace({'yes': True, 'no': False})

3. Write a Pandas program to delete the ‘attempts’ column from the DataFrame
[65]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Delete the 'attempts' column

df = df.drop(columns=['attempts'])

# Display the updated DataFrame

print("\nUpdated DataFrame (after deleting 'attempts' column):\n", df)

Updated DataFrame (after deleting 'attempts' column):

name score qualify
0 Anastasia 12.5 yes
1 Dima 9.0 no
2 Katherine 16.5 yes
3 James NaN no
4 Emily 9.0 no
5 Michael 20.0 yes
6 Matthew 14.5 yes
7 Laura NaN no
8 Kevin 8.0 no
9 Jonas 19.0 yes

4. Write a Pandas program to insert a new column in existing DataFrame.

[66]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Insert a new column 'grade' based on the scores

# For example, we will assign grades based on scores
df['grade'] = ['A' if score >= 90 else 'B' if score >= 80 else 'C' for score in␣
↪df['score']]

# Display the updated DataFrame

3
print("\nUpdated DataFrame (after inserting 'grade' column):\n", df)

Updated DataFrame (after inserting 'grade' column):

name score attempts qualify grade
0 Anastasia 12.5 1 yes C
1 Dima 9.0 3 no C
2 Katherine 16.5 2 yes C
3 James NaN 3 no C
4 Emily 9.0 2 no C
5 Michael 20.0 3 yes C
6 Matthew 14.5 1 yes C
7 Laura NaN 1 no C
8 Kevin 8.0 2 no C
9 Jonas 19.0 1 yes C

5. Write a Pandas program to change the name ‘James’ to ‘Maxwell’ in name column
of the DataFrame.
[67]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Change 'James' to 'Maxwell' in the 'name' column

df['name'] = df['name'].replace('James', 'Maxwell')

# Alternatively, you can use boolean indexing:

# df.loc[df['name'] == 'James', 'name'] = 'Maxwell'

# Display the updated DataFrame

print("\nUpdated DataFrame (after changing 'James' to 'Maxwell'):\n", df)

Updated DataFrame (after changing 'James' to 'Maxwell'):

name score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 Maxwell NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

6. Write a Pandas program to rename columns of a given DataFrame

4
[68]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Method 1: Rename specific columns using the rename() method

df_renamed = df.rename(columns={'name': 'Full Name', 'score': 'Total Score'})

# Display the updated DataFrame after renaming

print("\nUpdated DataFrame (after renaming specific columns):\n", df_renamed)

# Method 2: Rename all columns by assigning a new list of column names

df.columns = ['Student Name', 'Total Marks', 'Total Attempts', 'Qualified']

# Display the updated DataFrame after renaming all columns

print("\nUpdated DataFrame (after renaming all columns):\n", df)

Updated DataFrame (after renaming specific columns):

Full Name Total Score attempts qualify
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

Updated DataFrame (after renaming all columns):

Student Name Total Marks Total Attempts Qualified
0 Anastasia 12.5 1 yes
1 Dima 9.0 3 no
2 Katherine 16.5 2 yes
3 James NaN 3 no
4 Emily 9.0 2 no
5 Michael 20.0 3 yes
6 Matthew 14.5 1 yes
7 Laura NaN 1 no
8 Kevin 8.0 2 no
9 Jonas 19.0 1 yes

7. Write a Pandas program to convert the datatype of a given column (floats to ints).

5
[69]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Fill NaN values with 0 or any other appropriate value

df['score'] = df['score'].fillna(0)

# Convert the 'Score' column from float to int

df['score'] = df['score'].astype(int)

# Display the updated DataFrame and data types

print("\nUpdated DataFrame (after converting 'score' to int):\n", df)
print("\nUpdated Data Types:\n", df.dtypes)

Updated DataFrame (after converting 'score' to int):

name score attempts qualify
0 Anastasia 12 1 yes
1 Dima 9 3 no
2 Katherine 16 2 yes
3 James 0 3 no
4 Emily 9 2 no
5 Michael 20 3 yes
6 Matthew 14 1 yes
7 Laura 0 1 no
8 Kevin 8 2 no
9 Jonas 19 1 yes

Updated Data Types:

name object
score int32
attempts int64
qualify object
dtype: object
8. Write a Pandas program to convert index in a column of the given dataframe.
[70]: ### code here

#Create a dataframe
df = pd.DataFrame(data, index=labels)

# Convert the index into a column

df_reset = df.reset_index()

# Display the updated DataFrame with index as a column

print("\nUpdated DataFrame (after converting index to column):\n", df_reset)

6
Updated DataFrame (after converting index to column):
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d James NaN 3 no
4 e Emily 9.0 2 no
5 f Michael 20.0 3 yes
6 g Matthew 14.5 1 yes
7 h Laura NaN 1 no
8 i Kevin 8.0 2 no
9 j Jonas 19.0 1 yes

9. Write a Pandas program to shuffle a DataFrame rows.

[71]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Shuffle the rows of the DataFrame

df_shuffled = df.sample(frac=1).reset_index(drop=True)

# Display the shuffled DataFrame

print("\nShuffled DataFrame:\n", df_shuffled)

Shuffled DataFrame:
name score attempts qualify
0 Jonas 19.0 1 yes
1 Laura NaN 1 no
2 James NaN 3 no
3 Michael 20.0 3 yes
4 Katherine 16.5 2 yes
5 Matthew 14.5 1 yes
6 Kevin 8.0 2 no
7 Emily 9.0 2 no
8 Dima 9.0 3 no
9 Anastasia 12.5 1 yes

10 Write a Pandas program to write a DataFrame to CSV file using tab separator
[72]: ### code here

# Create a DataFrame
df = pd.DataFrame(data)

# Specify the filename for the output CSV file

7
filename = 'output_data.tsv' # Using .tsv as an extension for tab-separated␣
↪values

# Write the DataFrame to a CSV file using tab as a separator

df.to_csv(filename, sep='\t', index=False)

print(f"DataFrame written to {filename} using tab as a separator.")

DataFrame written to output_data.tsv using tab as a separator.

[ ]:

[36]: import pandas as pd

student_data1 = pd.DataFrame({
'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed␣
↪Bernal', 'Kwame Morin'],

'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({
'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser␣
↪William', 'Madeeha Preston'],

'marks': [201, 200, 198, 219, 201]})

11. Write a Pandas program to join the two given dataframes along rows and assign
all data.
[37]: ### code here

# Joining the two DataFrames along rows

combined_student_data = pd.concat([student_data1, student_data2],␣
↪ignore_index=True)

# Display the combined DataFrame

print("Combined DataFrame:\n", combined_student_data)

Combined DataFrame:
student_id name marks
0 S1 Danniella Fenton 200
1 S2 Ryder Storey 210
2 S3 Bryce Jensen 190
3 S4 Ed Bernal 222
4 S5 Kwame Morin 199
5 S4 Scarlette Fisher 201

8
6 S5 Carla Williamson 200
7 S6 Dante Morse 198
8 S7 Kaiser William 219
9 S8 Madeeha Preston 201

12. Write a Pandas program to join the two given dataframes along columns and
assign all data.
[38]: ### code here

# Joining the two DataFrames along columns

combined_student_data = pd.concat([student_data1, student_data2], axis=1)

# Display the combined DataFrame

print("Combined DataFrame:\n", combined_student_data)

Combined DataFrame:
student_id name marks student_id name marks
0 S1 Danniella Fenton 200 S4 Scarlette Fisher 201
1 S2 Ryder Storey 210 S5 Carla Williamson 200
2 S3 Bryce Jensen 190 S6 Dante Morse 198
3 S4 Ed Bernal 222 S7 Kaiser William 219
4 S5 Kwame Morin 199 S8 Madeeha Preston 201

13. Write a Pandas program to join the two dataframes using the common column of
both dataframes.
[39]: ### code here

# Using the common column 'student_id' to merge both DataFrames

merged_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪suffixes=('_data1', '_data2'))

# Display the result of the merged DataFrame

print("Merged DataFrame:\n", merged_data)

Merged DataFrame:
student_id name_data1 marks_data1 name_data2 marks_data2
0 S4 Ed Bernal 222 Scarlette Fisher 201
1 S5 Kwame Morin 199 Carla Williamson 200
14. Write a Pandas program to join (left join) the two dataframes using keys from left dataframe
only.
[40]: #### code here

# Performing a left join using the common column 'student_id'

left_joined_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪how='left', suffixes=('_data1', '_data2'))

9
# Display the result of the left joined DataFrame
print("Left Joined DataFrame:\n", left_joined_data)

Left Joined DataFrame:

student_id name_data1 marks_data1 name_data2 marks_data2
0 S1 Danniella Fenton 200 NaN NaN
1 S2 Ryder Storey 210 NaN NaN
2 S3 Bryce Jensen 190 NaN NaN
3 S4 Ed Bernal 222 Scarlette Fisher 201.0
4 S5 Kwame Morin 199 Carla Williamson 200.0
15. Write a Pandas program to join two dataframes using keys from right dataframe only.
[41]: ### code here

# Performing a right join using the common column 'student_id'

right_joined_data = pd.merge(student_data1, student_data2, on='student_id',␣
↪how='right', suffixes=('_data1', '_data2'))

# Display the result of the right joined DataFrame

print("Right Joined DataFrame:\n", right_joined_data)

Right Joined DataFrame:

student_id name_data1 marks_data1 name_data2 marks_data2
0 S4 Ed Bernal 222.0 Scarlette Fisher 201
1 S5 Kwame Morin 199.0 Carla Williamson 200
2 S6 NaN NaN Dante Morse 198
3 S7 NaN NaN Kaiser William 219
4 S8 NaN NaN Madeeha Preston 201

[ ]:

import UFO dataset

[76]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file

df = pd.read_csv('ufo_sighting_data.csv')
df

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\428872793.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')

10
[76]: Date_time city state/province country \
0 10/10/1949 20:30 san marcos tx us
1 10/10/1949 21:00 lackland afb tx NaN
2 10/10/1955 17:00 chester (uk/england) NaN gb
3 10/10/1956 21:00 edna tx us
4 10/10/1960 20:00 kaneohe hi us
… … … … …
80327 9/9/2013 21:15 nashville tn us
80328 9/9/2013 22:00 boise id us
80329 9/9/2013 22:00 napa ca us
80330 9/9/2013 22:20 vienna va us
80331 9/9/2013 23:00 edmond ok us

UFO_shape length_of_encounter_seconds described_duration_of_encounter \

0 cylinder 2700 45 minutes
1 light 7200 1-2 hrs
2 circle 20 20 seconds
3 circle 20 1/2 hour
4 light 900 15 minutes
… … … …
80327 light 600.0 10 minutes
80328 circle 1200.0 20 minutes
80329 other 1200.0 hour
80330 circle 5.0 5 seconds
80331 cigar 1020.0 17 minutes

description date_documented \
0 This event took place in early fall around 194… 4/27/2004
1 1949 Lackland AFB&#44 TX. Lights racing acros… 12/16/2005
2 Green/Orange circular disc over Chester&#44 En… 1/21/2008
3 My older brother and twin sister were leaving … 1/17/2004
4 AS a Marine 1st Lt. flying an FJ4B fighter/att… 1/22/2004
… … …
80327 Round from the distance/slowly changing colors… 9/30/2013
80328 Boise&#44 ID&#44 spherical&#44 20 min&#44 10 r… 9/30/2013
80329 Napa UFO&#44 9/30/2013
80330 Saw a five gold lit cicular craft moving fastl… 9/30/2013
80331 2 witnesses 2 miles apart&#44 Red & White… 9/30/2013

latitude longitude
0 29.8830556 -97.941111
1 29.38421 -98.581082
2 53.2 -2.916667
3 28.9783333 -96.645833
4 21.4180556 -157.803611
… … …
80327 36.165833 -86.784444

11
80328 43.613611 -116.202500
80329 38.297222 -122.284444
80330 38.901111 -77.265556
80331 35.652778 -97.477778

[80332 rows x 11 columns]

16. Write a Pandas program to create the todays date.

[42]: ### code here

import pandas as pd

# Create today's date using pd.Timestamp

today_date = pd.Timestamp.now()

# Alternatively, create today's date using pd.to_datetime

today_date_alt = pd.to_datetime("today")

# Display today's date

print("Today's date using pd.Timestamp:", today_date)
print("Today's date using pd.to_datetime:", today_date_alt)

Today's date using pd.Timestamp: 2024-10-11 15:08:13.536164

Today's date using pd.to_datetime: 2024-10-11 15:08:13.560834
17. Write a Pandas program to get the current date, oldest date and number of days between
Current date and oldest date of Ufo dataset.
[73]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file

df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime

df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce') # Use␣
↪errors='coerce' to handle invalid parsing

# Get the current date as a Timestamp

current_date = pd.Timestamp.now()

# Get the oldest date from the dataset

oldest_date = df['Date_time'].min() # This is a Timestamp

# Convert oldest_date to date for calculation

oldest_date_as_date = oldest_date.date()

# Calculate the number of days between current date and oldest date

12
number_of_days = (current_date.date() - oldest_date_as_date).days # Convert␣
↪current_date to date

# Display results
print("Current date:", current_date.date())
print("Oldest date:", oldest_date_as_date)
print("Number of days between current date and oldest date:", number_of_days)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\560927568.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Current date: 2024-10-11
Oldest date: 1906-11-11
Number of days between current date and oldest date: 43069

[74]: import pandas as pd

# load the ufo_sightnig_data dataset from CSV file

df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime

df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce') #␣
↪Use errors='coerce' to handle invalid parsing

# Get the current date as a Timestamp

current_date = pd.Timestamp.now()

# Get the oldest date from the dataset

oldest_date = df['Date_time'].min() # This is a Timestamp

# Calculate the number of days between current date and oldest date
number_of_days = (current_date - oldest_date).days # Both are Timestamps

# Display results
print("Current date:", current_date)
print("Oldest date:", oldest_date)
print("Number of days between current date and oldest date:", number_of_days)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\2957761521.py:4: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Current date: 2024-10-11 16:31:57.747711
Oldest date: 1906-11-11 00:00:00
Number of days between current date and oldest date: 43069

13
18. Write a Pandas program to get all the info of the dataframe between 1950-10-10 and 1960-10-10.
[75]: ### code here

import pandas as pd

# load the ufo_sightnig_data dataset from CSV file

df = pd.read_csv('ufo_sighting_data.csv')

# Convert 'Date_time' column to datetime

df['Date_time'] = pd.to_datetime(df['Date_time'], errors='coerce')

# Define the start and end dates

start_date = '1950-10-10'
end_date = '1960-10-10'

# Filter the DataFrame for dates between the specified range

filtered_df = df[(df['Date_time'] >= start_date) & (df['Date_time'] <=␣
↪end_date)]

# Display the filtered DataFrame

print("Filtered DataFrame between 1950-10-10 and 1960-10-10:")
print(filtered_df)

C:\Users\indra\AppData\Local\Temp\ipykernel_21872\122481063.py:6: DtypeWarning:
Columns (5,9) have mixed types. Specify dtype option on import or set
low_memory=False.
df = pd.read_csv('ufo_sighting_data.csv')
Filtered DataFrame between 1950-10-10 and 1960-10-10:
Date_time city state/province country \
2 1955-10-10 17:00:00 chester (uk/england) NaN gb
3 1956-10-10 21:00:00 edna tx us
480 1952-10-01 03:30:00 fukuoka (japan) NaN NaN
481 1952-10-01 12:00:00 kansas city mo us
482 1954-10-01 19:00:00 flatwoods wv us
… … … … …
79262 1960-09-05 21:00:00 buffalo ny us
79668 1958-09-07 19:00:00 arthur nd us
80101 1952-09-09 20:00:00 philadelphia pa us
80102 1954-09-09 12:30:00 beaumont tx us
80103 1956-09-09 05:55:00 norfolk va us

UFO_shape length_of_encounter_seconds described_duration_of_encounter \

2 circle 20 20 seconds
3 circle 20 1/2 hour
480 disk 1200 about 20 mins
481 disk 90 90 sec.

14
482 circle 60 1 minute
… … … …
79262 oval 180.0 3 minutes
79668 unknown 900.0 5-15 min.
80101 circle 180.0 3 minutes
80102 disk 300.0 5 minutes
80103 cigar 90.0 1.5 minutes

description date_documented \
2 Green/Orange circular disc over Chester&#44 En… 1/21/2008
3 My older brother and twin sister were leaving … 1/17/2004
480 UFO seen by multiple U. S. military personnel;… 12/7/2006
481 1952 daylight sighting of multiple discs in fo… 10/31/2008
482 I saw the craft go across the horizon. It app… 4/12/2013
… … …
79262 Precise movements of a "craft" appar… 9/15/2005
79668 Two lights of alternating color traveling and … 10/31/2003
80101 saucers in a line over Phila Pa. 12/12/2009
80102 Aprox. 30 Disk shaped UFOs fell out of clouds … 1/17/2004
80103 Unidentified Object Hovering over Interstate 2… 2/24/2007

latitude longitude
2 53.2 -2.916667
3 28.9783333 -96.645833
480 33.590355 130.401716
481 39.0997222 -94.578333
482 38.7230556 -80.650000
… … …
79262 42.886389 -78.878611
79668 47.104167 -97.217778
80101 39.952222 -75.164167
80102 30.085833 -94.101667
80103 36.846667 -76.285556

[444 rows x 11 columns]

[ ]:

15
pandas-assignment-day-4

November 8, 2024

import the Diamond dataset

[24]: import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')
df

[24]: Unnamed: 0 carat cut color clarity depth table price x \

0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89
2 3 0.23 Good E VS1 56.9 65.0 327 4.05
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
… … … … … … … … … …
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83

y z
0 3.98 2.43
1 3.84 2.31
2 4.07 2.31
3 4.23 2.63
4 4.35 2.75
… … …
53935 5.76 3.50
53936 5.75 3.61
53937 5.68 3.56
53938 6.12 3.74
53939 5.87 3.64

[53940 rows x 11 columns]

1. Write a Pandas program to find the number of rows and columns and data type of each column
of diamonds Dataframe.

1
[4]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Get the number of rows and columns

num_rows, num_columns = df.shape

# Get the data type of each column

column_data_types = df.dtypes

# Display the results

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")
print("\nData type of each column:")
print(column_data_types)

Number of rows: 53940

Number of columns: 11

Data type of each column:

Unnamed: 0 int64
carat float64
cut object
color object
clarity object
depth float64
table float64
price int64
x float64
y float64
z float64
dtype: object

2. Write a Pandas program to summarize only ‘object’ columns of the diamonds

Dataframe.
[5]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Summarize only the 'object' columns

object_columns_summary = df.select_dtypes(include=['object']).describe()

2
# Display the summary of object columns
print("Summary of object columns:")
print(object_columns_summary)

Summary of object columns:

cut color clarity
count 53940 53940 53940
unique 5 7 8
top Ideal G SI1
freq 21551 11292 13065
3. Write a Pandas program to remove the second column of the diamonds Dataframe. (don’t use
original dataset)

[6]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Create a copy of the original DataFrame

df_copy = df.copy()

# Remove the second column (index 1)

df_copy.drop(df_copy.columns[1], axis=1, inplace=True)

# Display the first few rows of the modified DataFrame

print("DataFrame after removing the second column:")
print(df_copy.head())

DataFrame after removing the second column:

Unnamed: 0 cut color clarity depth table price x y z
0 1 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
4. Write a Pandas program to remove multiple rows at once (axis=0 refers to rows) from diamonds
dataframe. (dont use original dataset)

[7]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

3
df = pd.read_csv('diamonds.csv')

# Create a copy of the original DataFrame

df_copy = df.copy()

# Define the indices of the rows to be removed (e.g., removing rows 0, 1, and 2)
rows_to_remove = [0, 1, 2]

# Remove the specified rows

df_copy.drop(rows_to_remove, axis=0, inplace=True)

# Display the first few rows of the modified DataFrame

print("DataFrame after removing specified rows:")
print(df_copy.head())

DataFrame after removing specified rows:

Unnamed: 0 carat cut color clarity depth table price x \
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
5 6 0.24 Very Good J VVS2 62.8 57.0 336 3.94
6 7 0.24 Very Good I VVS1 62.3 57.0 336 3.95
7 8 0.26 Very Good H SI1 61.9 55.0 337 4.07

y z
3 4.23 2.63
4 4.35 2.75
5 3.96 2.48
6 3.98 2.47
7 4.11 2.53
5. Write a Pandas program to sort the ‘cut’ Series in ascending order (returns a Series) of diamonds
Dataframe.
[8]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Sort the 'cut' Series in ascending order and return it as a Series

sorted_cut_series = df['cut'].sort_values()

# Display the sorted Series

print("Sorted 'cut' Series in ascending order:")
print(sorted_cut_series)

Sorted 'cut' Series in ascending order:

4
3850 Fair
51464 Fair
51466 Fair
10237 Fair
10760 Fair
…
7402 Very Good
43101 Very Good
16893 Very Good
16898 Very Good
21164 Very Good
Name: cut, Length: 53940, dtype: object
6. Write a Pandas program to sort the entire diamonds DataFrame by the ‘carat’ Series in ascending
and descending order.
[9]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Sort the DataFrame by the 'carat' Series in ascending order

sorted_df_ascending = df.sort_values(by='carat', ascending=True)

# Sort the DataFrame by the 'carat' Series in descending order

sorted_df_descending = df.sort_values(by='carat', ascending=False)

# Display the first few rows of the sorted DataFrames

print("DataFrame sorted by 'carat' in ascending order:")
print(sorted_df_ascending.head())

print("\nDataFrame sorted by 'carat' in descending order:")

print(sorted_df_descending.head())

DataFrame sorted by 'carat' in ascending order:

Unnamed: 0 carat cut color clarity depth table price x \
31593 31594 0.2 Premium E VS2 61.1 59.0 367 3.81
31597 31598 0.2 Ideal D VS2 61.5 57.0 367 3.81
31596 31597 0.2 Premium F VS2 62.6 59.0 367 3.73
31595 31596 0.2 Ideal E VS2 59.7 55.0 367 3.86
31594 31595 0.2 Premium E VS2 59.7 62.0 367 3.84

y z
31593 3.78 2.32
31597 3.77 2.33
31596 3.71 2.33

5
31595 3.84 2.30
31594 3.80 2.28

DataFrame sorted by 'carat' in descending order:

Unnamed: 0 carat cut color clarity depth table price x \
27415 27416 5.01 Fair J I1 65.5 59.0 18018 10.74
27630 27631 4.50 Fair J I1 65.8 58.0 18531 10.23
27130 27131 4.13 Fair H I1 64.8 61.0 17329 10.00
25999 26000 4.01 Premium J I1 62.5 62.0 15223 10.02
25998 25999 4.01 Premium I I1 61.0 61.0 15223 10.14

y z
27415 10.54 6.98
27630 10.16 6.72
27130 9.85 6.43
25999 9.94 6.24
25998 10.10 6.17
7. Write a Pandas program to filter the DataFrame rows to only show carat weight at least 0.3.
[10]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Filter the DataFrame for rows where the carat weight is at least 0.3
filtered_df = df[df['carat'] >= 0.3]

# Display the filtered DataFrame

print("Filtered DataFrame with carat weight at least 0.3:")
print(filtered_df)

Filtered DataFrame with carat weight at least 0.3:

Unnamed: 0 carat cut color clarity depth table price x \
4 5 0.31 Good J SI2 63.3 58.0 335 4.34
10 11 0.30 Good J SI1 64.0 55.0 339 4.25
13 14 0.31 Ideal J SI2 62.2 54.0 344 4.35
15 16 0.32 Premium E I1 60.9 58.0 345 4.38
16 17 0.30 Ideal I SI2 62.0 54.0 348 4.31
… … … … … … … … … …
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83

6
y z
4 4.35 2.75
10 4.28 2.73
13 4.37 2.71
15 4.42 2.68
16 4.34 2.68
… … …
53935 5.76 3.50
53936 5.75 3.61
53937 5.68 3.56
53938 6.12 3.74
53939 5.87 3.64

[52341 rows x 11 columns]

8. Write a Pandas program to find the details of the diamonds where length>5, width>5 and
depth>5.
[11]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Filter the DataFrame for diamonds where length, width, and depth are all␣
↪greater than 5

# Note: The column names in the diamonds dataset for length, width, and depth␣
↪are 'x', 'y', and 'z'

filtered_diamonds = df[(df['x'] > 5) & (df['y'] > 5) & (df['z'] > 5)]

# Display the filtered DataFrame

print("Diamonds where length > 5, width > 5, and depth > 5:")
print(filtered_diamonds)

Diamonds where length > 5, width > 5, and depth > 5:

Unnamed: 0 carat cut color clarity depth table price x \
11778 11779 1.83 Fair J I1 70.0 58.0 5083 7.34
13002 13003 2.14 Fair J I1 69.4 57.0 5405 7.74
13118 13119 2.15 Fair J I1 65.5 57.0 5430 8.01
13562 13563 1.96 Fair F I1 66.6 60.0 5554 7.59
13757 13758 2.22 Fair J I1 66.7 56.0 5607 8.04
… … … … … … … … … …
27748 27749 2.00 Very Good G SI1 63.5 56.0 18818 7.90
27749 27750 2.29 Premium I VS2 60.8 60.0 18823 8.50
48410 48411 0.51 Very Good E VS1 61.8 54.7 1970 5.12
49189 49190 0.51 Ideal E VS1 61.8 55.0 2075 5.15
49905 49906 0.50 Very Good G VVS1 63.7 58.0 2180 5.01

7
y z
11778 7.28 5.12
13002 7.70 5.36
13118 7.95 5.23
13562 7.56 5.04
13757 8.02 5.36
… … …
27748 7.97 5.04
27749 8.47 5.16
48410 5.15 31.80
49189 31.80 5.12
49905 5.04 5.06

[1457 rows x 11 columns]

9. Write a Pandas program to calculate the mean of each row of diamonds DataFrame.
[12]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Calculate the mean of each row

# You may want to exclude non-numeric columns (like 'cut', 'color', 'clarity')␣
↪before calculating the mean

row_means = df.select_dtypes(include='number').mean(axis=1)

# Add the row means to the DataFrame (optional)

df['Row_Mean'] = row_means

# Display the original DataFrame with the new Row_Mean column

print("Diamonds DataFrame with Row Means:")
print(df[['carat', 'x', 'y', 'z', 'Row_Mean']].head()) # Displaying relevant␣
↪columns for clarity

Diamonds DataFrame with Row Means:

carat x y z Row_Mean
0 0.23 3.95 3.98 2.43 56.76125
1 0.21 3.89 3.84 2.31 57.38125
2 0.23 4.05 4.07 2.31 57.82000
3 0.29 4.20 4.23 2.63 58.71875
4 0.31 4.34 4.35 2.75 59.13125
10. Write a Pandas program to calculate the mean of price for each cut and find maximum top 3
of diamonds DataFrame.

8
[13]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Calculate the mean price for each cut

mean_price_by_cut = df.groupby('cut')['price'].mean()

# Find the top 3 cuts with the highest mean price

top_3_cuts = mean_price_by_cut.nlargest(3)

# Display the results

print("Top 3 cuts with the highest mean price:")
print(top_3_cuts)

Top 3 cuts with the highest mean price:

cut
Premium 4584.257704
Fair 4358.757764
Very Good 3981.759891
Name: price, dtype: float64

11. Write a Pandas program to calculate count, minimum, maximum price for each
cut of diamonds DataFrame.
[14]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Calculate count, minimum, and maximum price for each cut

cut_price_stats = df.groupby('cut')['price'].agg(['count', 'min', 'max'])

# Display the results

print("Count, minimum, and maximum price for each cut:")
print(cut_price_stats)

Count, minimum, and maximum price for each cut:

count min max
cut
Fair 1610 337 18574
Good 4906 327 18788
Ideal 21551 326 18806
Premium 13791 326 18823

9
Very Good 12082 336 18818
12. Write a Pandas program to display and count the unique values in cut series of diamonds
DataFrame.
[15]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Display unique values in the 'cut' series

unique_cuts = df['cut'].unique()
print("Unique values in the 'cut' series:")
print(unique_cuts)

# Count unique values in the 'cut' series

cut_counts = df['cut'].value_counts()
print("\nCount of unique values in the 'cut' series:")
print(cut_counts)

Unique values in the 'cut' series:

['Ideal' 'Premium' 'Good' 'Very Good' 'Fair']

Count of unique values in the 'cut' series:

cut
Ideal 21551
Premium 13791
Very Good 12082
Good 4906
Fair 1610
Name: count, dtype: int64
13. Write a Pandas program to count the number of missing values in each Series of diamonds
DataFrame.
[16]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Count the number of missing values in each Series (column)

missing_values_count = df.isnull().sum()

# Display the count of missing values

print("Number of missing values in each column:")

10
print(missing_values_count)

Number of missing values in each column:

Unnamed: 0 0
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
price 0
x 0
y 0
z 0
dtype: int64
14. Write a Pandas program to calculate the multiply of x, y and z for each cut of diamonds
DataFrame.
[17]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Calculate the product of x, y, and z and create a new column 'xyz_product'

df['xyz_product'] = df['x'] * df['y'] * df['z']

# Group by 'cut' and calculate the sum of the 'xyz_product' for each cut
cut_xyz_product = df.groupby('cut')['xyz_product'].sum()

# Display the results

print("Total product of x, y, and z for each cut:")
print(cut_xyz_product)

Total product of x, y, and z for each cut:

cut
Fair 2.655704e+05
Good 6.684782e+05
Ideal 2.486876e+06
Premium 2.000414e+06
Very Good 1.582739e+06
Name: xyz_product, dtype: float64

15. Write a Pandas program to read rows 0 through 2 (inclusive), columns ‘color’ and
‘price’ of diamonds DataFrame.

11
[18]: ## code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Select rows 0 through 2 (inclusive) and columns 'color' and 'price'

selected_data = df.loc[0:2, ['color', 'price']]

# Display the selected data

print("Selected rows and columns from the diamonds DataFrame:")
print(selected_data)

Selected rows and columns from the diamonds DataFrame:

color price
0 E 326
1 E 326
2 E 327
16. Write a Pandas program to read rows in positions 0 and 1, columns in positions 0 and 3 of
diamonds DataFrame.
[19]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Select rows in positions 0 and 1 and columns in positions 0 and 3

selected_data = df.iloc[0:2, [0, 3]] # 0:2 selects rows 0 and 1; [0, 3]␣
↪selects columns 0 and 3

# Display the selected data

print("Selected rows and columns from the diamonds DataFrame:")
print(selected_data)

Selected rows and columns from the diamonds DataFrame:

Unnamed: 0 color
0 1 E
1 2 E

17. Write a Pandas program to get randomly sample rows from diamonds DataFrame.
[20]: ### code here

import pandas as pd

12
# Load the diamonds dataset from CSV file
df = pd.read_csv('diamonds.csv')

# Get a random sample of 5 rows from the diamonds DataFrame

random_sample = df.sample(n=5, random_state=1) # Set random_state for␣
↪reproducibility

# Display the random sample

print("Random sample of rows from the diamonds DataFrame:")
print(random_sample)

Random sample of rows from the diamonds DataFrame:

Unnamed: 0 carat cut color clarity depth table price x \
2714 2715 0.33 Ideal H SI1 61.7 55.0 564 4.43
14653 14654 1.20 Ideal I VS2 62.1 57.0 5914 6.78
52760 52761 0.62 Ideal D VS1 61.0 57.0 2562 5.51
48658 48659 0.34 Good H VS2 63.1 56.0 537 4.41
14812 14813 1.20 Ideal E SI2 62.5 55.0 5964 6.77

y z
2714 4.46 2.74
14653 6.71 4.19
52760 5.54 3.37
48658 4.46 2.80
14812 6.84 4.25
18. Write a Pandas program to get sample 75% of the diamonds DataFrame’s rows without re-
placement and store the remaining 25% of the rows in another DataFrame.
[21]: ### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Sample 75% of the rows without replacement

sampled_df = df.sample(frac=0.75, random_state=1) # frac=0.75 means 75%

# Store the remaining 25% of the rows in another DataFrame

remaining_df = df.drop(sampled_df.index)

# Display the shapes of the sampled and remaining DataFrames

print("Shape of the sampled DataFrame (75%):", sampled_df.shape)
print("Shape of the remaining DataFrame (25%):", remaining_df.shape)

# Optional: Display the first few rows of each DataFrame

print("\nSampled DataFrame (75%):")

13
print(sampled_df.head())

print("\nRemaining DataFrame (25%):")

print(remaining_df.head())

Shape of the sampled DataFrame (75%): (40455, 11)

Shape of the remaining DataFrame (25%): (13485, 11)

Sampled DataFrame (75%):

y z
2714 4.46 2.74
14653 6.71 4.19
52760 5.54 3.37
48658 4.46 2.80
14812 6.84 4.25

Remaining DataFrame (25%):

Unnamed: 0 carat cut color clarity depth table price x \
9 10 0.23 Very Good H VS1 59.4 61.0 338 4.00
14 15 0.20 Premium E SI2 60.2 62.0 345 3.79
15 16 0.32 Premium E I1 60.9 58.0 345 4.38
18 19 0.30 Good J SI1 63.8 56.0 351 4.23
20 21 0.30 Good I SI2 63.3 56.0 351 4.26

y z
9 4.05 2.39
14 3.75 2.27
15 4.42 2.68
18 4.26 2.71
20 4.30 2.71
19. Write a Pandas program to read the diamonds DataFrame and detect duplicate color.
[22]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Check for duplicates in the 'color' column

14
duplicates = df[df['color'].duplicated(keep=False)] # keep=False marks all␣
↪duplicates as True

# Display the duplicate colors

print("Duplicate colors in the diamonds DataFrame:")
print(duplicates[['color']].drop_duplicates())

Duplicate colors in the diamonds DataFrame:

color
0 E
3 I
4 J
7 H
12 F
25 G
28 D
20. Write a Pandas program to count the duplicate rows of diamonds DataFrame.
[23]: #### code here

import pandas as pd

# Load the diamonds dataset from CSV file

df = pd.read_csv('diamonds.csv')

# Count duplicate rows in the DataFrame

duplicate_count = df.duplicated().sum() # Returns the count of duplicate rows

# Display the count of duplicate rows

print("Number of duplicate rows in the diamonds DataFrame:", duplicate_count)

Number of duplicate rows in the diamonds DataFrame: 0

[ ]:

Experiment-2-1-Ml Kritika
No ratings yet
Experiment-2-1-Ml Kritika
11 pages
BDA pr2
No ratings yet
BDA pr2
2 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
5-1 Dataframes Intro Load Inspect - Instruction
No ratings yet
5-1 Dataframes Intro Load Inspect - Instruction
2 pages
Exno 4
No ratings yet
Exno 4
13 pages
Iris - Ipynb - Colab
No ratings yet
Iris - Ipynb - Colab
1 page
Pandas Exercises
No ratings yet
Pandas Exercises
15 pages
A09Ass06 - Jupyter Notebook
No ratings yet
A09Ass06 - Jupyter Notebook
29 pages
Dsbda Ouput 1-10
No ratings yet
Dsbda Ouput 1-10
89 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
Session-24 - Jupyter Notebook
No ratings yet
Session-24 - Jupyter Notebook
13 pages
Assigntment 3 Python Lab
No ratings yet
Assigntment 3 Python Lab
1 page
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
No ratings yet
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
4 pages
DL Experiment - 1
No ratings yet
DL Experiment - 1
10 pages
Untitled5 1
No ratings yet
Untitled5 1
13 pages
Session-25 - Jupyter Notebook
No ratings yet
Session-25 - Jupyter Notebook
20 pages
ML N PY Programs
No ratings yet
ML N PY Programs
17 pages
Assignment 5'
No ratings yet
Assignment 5'
4 pages
25 - Assignment10.ipynb - Colaboratory
No ratings yet
25 - Assignment10.ipynb - Colaboratory
13 pages
Ploomber Notebook Conversion - 2
No ratings yet
Ploomber Notebook Conversion - 2
14 pages
Dsbda 10
No ratings yet
Dsbda 10
8 pages
b21 DSBDA Assignment No 10
No ratings yet
b21 DSBDA Assignment No 10
1 page
Tutorial-4 Machine Learning With Pandas
No ratings yet
Tutorial-4 Machine Learning With Pandas
47 pages
SVM and Kmeans - Iris Dataset - Ipynb - Colab
No ratings yet
SVM and Kmeans - Iris Dataset - Ipynb - Colab
5 pages
Ids Lab 8
No ratings yet
Ids Lab 8
8 pages
Trần Mạnh Hùng 20192643.Ipynb - Colab
No ratings yet
Trần Mạnh Hùng 20192643.Ipynb - Colab
6 pages
Practical of Professional Skills
No ratings yet
Practical of Professional Skills
4 pages
Experiment 3
No ratings yet
Experiment 3
4 pages
Mlpy 2
No ratings yet
Mlpy 2
18 pages
Chap5 - Wei - Ipynb - Colab
No ratings yet
Chap5 - Wei - Ipynb - Colab
29 pages
Batch1 Ds
No ratings yet
Batch1 Ds
15 pages
FDS LAB Record Print
No ratings yet
FDS LAB Record Print
45 pages
Cota12 6
No ratings yet
Cota12 6
4 pages
PR 6
No ratings yet
PR 6
6 pages
Iris Recognition - Jupyter Notebook
No ratings yet
Iris Recognition - Jupyter Notebook
8 pages
Assignment 3 Iris
No ratings yet
Assignment 3 Iris
2 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
Dsfasdflalksdflkasdjfasf
No ratings yet
Dsfasdflalksdflkasdjfasf
4 pages
E22cseu1389 Assignment11 1
No ratings yet
E22cseu1389 Assignment11 1
4 pages
Name:-Nisha Ambike: Roll No: - 02
No ratings yet
Name:-Nisha Ambike: Roll No: - 02
2 pages
Iris Flower Classification
No ratings yet
Iris Flower Classification
47 pages
Dsbda 3B
No ratings yet
Dsbda 3B
5 pages
Dsbda 10
No ratings yet
Dsbda 10
5 pages
Experiment1.Ipynb - Colab
No ratings yet
Experiment1.Ipynb - Colab
11 pages
Dsbda Assig 6 Data Analytcs 3
No ratings yet
Dsbda Assig 6 Data Analytcs 3
6 pages
DA Basics
No ratings yet
DA Basics
6 pages
Iris - Ipynb - Colaboratory
No ratings yet
Iris - Ipynb - Colaboratory
8 pages
31 Pandas 02
No ratings yet
31 Pandas 02
8 pages
Task 1
No ratings yet
Task 1
14 pages
Csv-Files-Using-Pandas-With-Examples/ Read CSV Files Using Pandas - With Examples
No ratings yet
Csv-Files-Using-Pandas-With-Examples/ Read CSV Files Using Pandas - With Examples
6 pages
Iris Dataset Analysis & Visualization
No ratings yet
Iris Dataset Analysis & Visualization
4 pages
Ihtisham Ali 6534
No ratings yet
Ihtisham Ali 6534
3 pages
I, S, V D: Mporting Ummarizing AND Isualizing ATA
No ratings yet
I, S, V D: Mporting Ummarizing AND Isualizing ATA
18 pages
DS Journal - Final
No ratings yet
DS Journal - Final
37 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
Assignment 3
No ratings yet
Assignment 3
7 pages
Intro to Computing: Robot Tasks
No ratings yet
Intro to Computing: Robot Tasks
22 pages
Week 05 Lecture 01 AStar Reading Assignment
No ratings yet
Week 05 Lecture 01 AStar Reading Assignment
87 pages
Cassandra Notes
No ratings yet
Cassandra Notes
6 pages
Object-Oriented Analysis Design: Springer
No ratings yet
Object-Oriented Analysis Design: Springer
9 pages
The Impact of Software Development Process On Software Quality: A Review
No ratings yet
The Impact of Software Development Process On Software Quality: A Review
4 pages
Adnan Mustafa
No ratings yet
Adnan Mustafa
8 pages
Class XII Computer Science Guide
100% (1)
Class XII Computer Science Guide
33 pages
Java CAPS 5.1.3 Notes Logging JCD-based Programmatic Log Manipulation
No ratings yet
Java CAPS 5.1.3 Notes Logging JCD-based Programmatic Log Manipulation
12 pages
Threads Revisited: Lecture Notes
No ratings yet
Threads Revisited: Lecture Notes
8 pages
Log
No ratings yet
Log
22 pages
Mobirise Tutorial
100% (1)
Mobirise Tutorial
17 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
The Gray Code
No ratings yet
The Gray Code
4 pages
MongoDB Essentials for Developers
No ratings yet
MongoDB Essentials for Developers
11 pages
Online Test 2 - Students
No ratings yet
Online Test 2 - Students
2 pages
M.tech Ads Lab Manual 1
100% (1)
M.tech Ads Lab Manual 1
84 pages
Deep Learning Internship Report
No ratings yet
Deep Learning Internship Report
62 pages
Decision Tree Classification Guide
No ratings yet
Decision Tree Classification Guide
8 pages
Csharp Input Output
No ratings yet
Csharp Input Output
23 pages
Unit 4 Concepts of Stream
No ratings yet
Unit 4 Concepts of Stream
9 pages
Exam 70241 Preparation Questions
No ratings yet
Exam 70241 Preparation Questions
4 pages
See Q1 Shell Script
No ratings yet
See Q1 Shell Script
3 pages
CS101x S427 Merge Sort in C and Its Analysis IIT Bombay
No ratings yet
CS101x S427 Merge Sort in C and Its Analysis IIT Bombay
32 pages
Mongodb Interview Questions (V4.4)
No ratings yet
Mongodb Interview Questions (V4.4)
25 pages
Lab Manual Signals and Systems Rev5
No ratings yet
Lab Manual Signals and Systems Rev5
180 pages
TMS TAdvStringGrid v6.0 Developers Guide
No ratings yet
TMS TAdvStringGrid v6.0 Developers Guide
182 pages
The MMX Instruction Set (ch11 - MMX)
No ratings yet
The MMX Instruction Set (ch11 - MMX)
38 pages
Excel Data Management Guide
No ratings yet
Excel Data Management Guide
45 pages
Dotnet Tutorial
No ratings yet
Dotnet Tutorial
11 pages
SAP BDC Customer Master Script
No ratings yet
SAP BDC Customer Master Script
11 pages