Data Science using Pandas
Pandas word derived from PANel DAta System.
It becomes popular for data analysis.
It provides highly optimized performance with back-end source code purely written in C or Python.
It makes a simple and easy process for data analysis.
Pandas offer two basic data structures:
1. Series
2. DataFrame
We will be working with Jaipur weather data obtained from Kaggle, a platform for data enthusiasts to gather or
share knowledge. The data has been cleaned and simplified, so that we can focus on data visualization instead of
data cleaning. Our data is stored in the file named JaipurFinalCleanData.csv (CSV (Comma-Separated Value) is a
file containing a set of data, separated by commas). This file contains weather information of Jaipur and is saved
at the same location as the notebook.
Today, we will learn how to use Python to open csv files.
import pandas as pd
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("medallists.csv")
#dataframe.head() means we are getting the first 5 rows of data
dataframe.head()
### Display the first 10 rows of data by modifying the function above
dataframe.head(10)
###If you have a large DataFrame with many rows, Pandas will ### only return the first 5 rows, and the last 5
rows
print(dataframe)
### use to_string() to print the entire DataFrame.
print(dataframe.to_string())
###Sorting values using pandas by using the sort_values() function.
medallists = dataframe.sort_values(by='medal_code', ascending = False)
print(medallists.head(5))
###Sort the values in ascending order of mean country code and print the first 5 rows
medallists= dataframe.sort_values(by='country_code',ascending = True)
print(medallists.head(5))
###Pandas provide an easy way for us to drop columns using the ".drop" function.
dataframe = dataframe.drop(["max_dew_pt_2"], axis=1) # index (0 or ‘index’) or columns (1 or ‘columns’).
Page 1 of 7
###Drop the following columns: (min_dew_pt_2, max_pressure_2, min_pressure_2)
dataframe = dataframe.drop(["min_dew_pt_2", "max_pressure_2", "min_pressure_2"], axis=1)
Exercise
Question 1:
Using Pandas, calculate the total number of medals won by each country. Create a bar chart to
visualize the top 10 countries with the most medals.
Hint:
Group the data by the country column and count the number of medals.
Use the plot.bar() function from Pandas or Matplotlib to create the bar chart.
Question 2:
Analyze the distribution of medal types (Gold, Silver, Bronze) across all countries. Create a pie chart
that shows the proportion of each medal type.
Hint:
Use the medal_type column to count the occurrences of each medal type.
Use the plot.pie() function to create the pie chart.
Question 3:
Using Pandas, find out how the number of medals awarded changes over time. Create a line chart that
plots the number of medals awarded per day.
Hint:
Group the data by medal_date and count the number of medals awarded each day.
Use the plot.line() function to create the line chart.
Question 4:
Compare the performance of male and female athletes by counting the total number of medals won by
each gender. Create a bar chart to visualize this comparison.
Hint:
Group the data by gender and count the number of medals.
Use the plot.bar() function to create the bar chart.
Page 2 of 7
Question 5:
Analyze the performance in a specific event, such as the "Men's Individual Time Trial." Count the
number of each type of medal won in this event and create a bar chart to visualize the results.
Hint:
Filter the data for the desired event using the event column.
Group by medal_type to count the number of each type of medal.
Use the plot.bar() function to create the bar chart.
Page 3 of 7
Solution
Question 1
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Group by country and count the number of medals
medals_by_country = df.groupby('country')['medal_type'].count().sort_values(ascending=False)
# Select the top 10 countries
top_10_countries = medals_by_country.head(10)
# Create a bar chart
top_10_countries.plot(kind='bar', color='skyblue')
plt.title('Top 10 Countries with Most Medals')
plt.xlabel('Country')
plt.ylabel('Number of Medals')
plt.xticks(rotation=45)
plt.show()
Question 2
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Count the occurrences of each medal type
medal_distribution = df['medal_type'].value_counts()
# Create a pie chart
medal_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['gold', 'silver', '#cd7f32'])
plt.title('Distribution of Medal Types')
plt.ylabel('') # Hide the y-label
plt.show()
Page 4 of 7
Question 3
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Convert medal_date to datetime format
df['medal_date'] = pd.to_datetime(df['medal_date'])
# Group by medal_date and count the number of medals awarded each day
medals_by_date = df.groupby('medal_date')['medal_type'].count()
# Create a line chart
medals_by_date.plot(kind='line', color='green', marker='o')
plt.title('Number of Medals Awarded Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Medals')
plt.grid(True)
plt.show()
Question 4
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Group by gender and count the number of medals
medals_by_gender = df.groupby('gender')['medal_type'].count()
# Create a bar chart
medals_by_gender.plot(kind='bar', color=['blue', 'pink'])
plt.title('Medals Won by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Medals')
plt.xticks(rotation=0)
plt.show()
Page 5 of 7
Question 5
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Filter the data for the "Men's Individual Time Trial" event
event_data = df[df['event'] == "Men's Individual Time Trial"]
# Group by medal_type and count the number of each type of medal
medals_in_event = event_data.groupby('medal_type')['medal_type'].count()
# Create a bar chart
medals_in_event.plot(kind='bar', color=['gold', 'silver', '#cd7f32'])
plt.title("Medals in Men's Individual Time Trial")
plt.xlabel('Medal Type')
plt.ylabel('Number of Medals')
plt.xticks(rotation=0)
plt.show()
Page 6 of 7
Bonus
Using the provided dataset of Olympic medalists, write a Python script to display a bar chart showing the
number of Gold, Silver, and Bronze medals won by India.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("medallists.csv")
# Filter the data for medals won by India
india_medals = df[df['country'] == 'India']
# Group by medal_type and count the number of medals
india_medals_count = india_medals.groupby('medal_type')['medal_type'].count()
# Create a bar chart
india_medals_count.plot(kind='bar', color=['gold', 'silver', '#cd7f32'])
plt.title('Medals Won by India')
plt.xlabel('Medal Type')
plt.ylabel('Number of Medals')
plt.xticks(rotation=0)
plt.show()
HOT:
Display the same for the country entered by the user
Page 7 of 7