Machine Learning Part 03
Machine Learning Part 03
What is Matplotlib?
Matplotlib is a popular data visualization library in Python. It provides a wide range of tools
for creating various types of plots and charts, making it a valuable tool for data analysis,
scientific research, and data presentation. Matplotlib allows you to create high-quality,
customizable plots and figures for a variety of purposes, including line plots, bar charts,
scatter plots, histograms, and more.
Matplotlib is highly customizable and can be used to control almost every aspect of your
plots, from the colors and styles to labels and legends. It provides both a functional and an
object-oriented interface for creating plots, making it suitable for a wide range of users,
from beginners to advanced data scientists and researchers.
Matplotlib can be used in various contexts, including Jupyter notebooks, standalone Python
scripts, and integration with web applications and GUI frameworks. It also works well with
other Python libraries commonly used in data analysis and scientific computing, such as
NumPy and Pandas.
To use Matplotlib, you typically need to import the library in your Python code, create the
desired plot or chart, and then display or save it as needed. Here's a simple example of
creating a basic line plot using Matplotlib:
In [ ]: import numpy as np
import pandas as pd
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 18, 10]
This is just a basic introduction to Matplotlib. The library is quite versatile, and you can
explore its documentation and tutorials to learn more about its capabilities and how to
create various types of visualizations for your data.
Line Plot
A 2D line plot is one of the most common types of plots in Matplotlib. It's used to visualize
data with two continuous variables, typically representing one variable on the x-axis and
another on the y-axis, and connecting the data points with lines. This type of plot is useful
for showing trends, relationships, or patterns in data over a continuous range.
Bivariate Analysis
categorical -> numerical and numerical -> numerical
Use case - Time series data
price = [48000,54000,57000,49000,47000,45000]
year = [2015,2016,2017,2018,2019,2020]
plt.plot(year,price)
[<matplotlib.lines.Line2D at 0x28b7c3742e0>]
Out[ ]:
Real-world Dataset
In [ ]: batsman = pd.read_csv('Data\Day45\sharma-kohli.csv')
In [ ]: batsman.head()
[<matplotlib.lines.Line2D at 0x28b7c40ca60>]
Out[ ]:
Multiple Plots:
It's possible to create multiple lines on a single plot, making it easy to compare multiple
datasets or variables. In the example, both Rohit Sharma's and Virat Kohli's career runs are
plotted on the same graph.
[<matplotlib.lines.Line2D at 0x28b7d4e2610>]
Out[ ]:
In [ ]: # labels title
plt.plot(batsman['index'],batsman['V Kohli'])
plt.plot(batsman['index'],batsman['RG Sharma'])
In [ ]: #colors
plt.plot(batsman['index'],batsman['V Kohli'],color='Red')
plt.plot(batsman['index'],batsman['RG Sharma'],color='Purple')
You can specify different colors for each line in the plot. In the example, colors like 'Red' and
'Purple' are used to differentiate the lines.
You can change the style and width of the lines. Common line styles include 'solid,' 'dotted,'
'dashed,' etc. In the example, 'solid' and 'dashdot' line styles are used.
In [ ]: # Marker
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
Markers are used to highlight data points on the line plot. You can customize markers' style
and size. In the example, markers like 'D' and 'o' are used with different colors.
In [ ]: # grid
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
plt.grid()
Adding a grid to the plot can make it easier to read and interpret the data. The grid helps in
aligning the data points with the tick marks on the axes.
In [ ]: # show
plt.plot(batsman['index'],batsman['V Kohli'],color='#D9F10F',linestyle='solid',line
plt.plot(batsman['index'],batsman['RG Sharma'],color='#FC00D6',linestyle='dashdot',
plt.grid()
plt.show()
After customizing your plot, you can use plt.show() to display it. This command is often used
in Jupyter notebooks or standalone Python scripts.
2D line plots are valuable for visualizing time series data, comparing trends in multiple
datasets, and exploring the relationship between two continuous variables. Customization
options in Matplotlib allow you to create visually appealing and informative plots for data
analysis and presentation.
A scatter plot, also known as a scatterplot or scatter chart, is a type of data visualization used
in statistics and data analysis. It's used to display the relationship between two variables by
representing individual data points as points on a two-dimensional graph. Each point on the
plot corresponds to a single data entry with values for both variables, making it a useful tool
for identifying patterns, trends, clusters, or outliers in data.
Bivariate Analysis
numerical vs numerical
Use case - Finding correlation
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
plt.show()
y = 10*x + 3 + np.random.randint(0,300,50)
y
In [ ]: plt.scatter(x,y)
<matplotlib.collections.PathCollection at 0x264627ccc70>
Out[ ]:
In [ ]: import numpy as np
import pandas as pd
In [ ]: # marker
plt.scatter(df['avg'],df['strike_rate'],color='red',marker='+')
plt.title('Avg and SR analysis of Top 50 Batsman')
plt.xlabel('Average')
plt.ylabel('SR')
Scatter plots are particularly useful for visualizing the distribution of data, identifying
correlations or relationships between variables, and spotting outliers. You can adjust the
appearance and characteristics of the scatter plot to suit your needs, including marker size,
color, and transparency. This makes scatter plots a versatile tool for data exploration and
analysis.
Bar plot
A bar plot, also known as a bar chart or bar graph, is a type of data visualization that is used
to represent categorical data with rectangular bars. Each bar's length or height is
proportional to the value it represents. Bar plots are typically used to compare and display
the relative sizes or quantities of different categories or groups.
Bivariate Analysis
Numerical vs Categorical
Use case - Aggregate analysis of groups
plt.bar(colors,children,color='Purple')
In [ ]: plt.bar(np.arange(df.shape[0]) - 0.2,df['2015'],width=0.2,color='yellow')
plt.bar(np.arange(df.shape[0]),df['2016'],width=0.2,color='red')
plt.bar(np.arange(df.shape[0]) + 0.2,df['2017'],width=0.2,color='blue')
plt.xticks(np.arange(df.shape[0]), df['batsman'])
plt.show()
Bar plots are useful for comparing the values of different categories and for showing the
distribution of data within each category. They are commonly used in various fields,
including business, economics, and data analysis, to make comparisons and convey
information about categorical data. You can customize bar plots to make them more visually
appealing and informative.
Histogram
A histogram is a type of chart that shows the distribution of numerical data. It's a graphical
representation of data where data is grouped into continuous number ranges and each
range corresponds to a vertical bar. The horizontal axis displays the number range, and the
vertical axis (frequency) represents the amount of data that is present in each range.
A histogram is a set of rectangles with bases along with the intervals between class
boundaries and with areas proportional to frequencies in the corresponding classes. The x-
axis of the graph represents the class interval, and the y-axis shows the various frequencies
corresponding to different class intervals. A histogram is a type of data visualization used to
represent the distribution of a dataset, especially when dealing with continuous or numeric
data. It displays the frequency or count of data points falling into specific intervals or "bins"
along a continuous range. Histograms provide insights into the shape, central tendency, and
spread of a dataset.
Univariate Analysis
Numerical col
Use case - Frequency Count
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: # simple data
data = [32,45,56,10,15,27,61]
plt.hist(data,bins=[10,25,40,55,70])
In [ ]: # on some data
df = pd.read_csv('Data\Day48\Vk.csv')
df
0 12 62
1 17 28
2 20 64
3 27 0
4 30 10
136 624 75
138 632 54
139 633 0
140 636 54
In [ ]: plt.hist(df['batsman_runs'])
plt.show()
In [ ]: # handling bins
plt.hist(df['batsman_runs'],bins=[0,10,20,30,40,50,60,70,80,90,100,110,120],color='
plt.show()
Pie Chart
A pie chart is a circular graph that's divided into slices to illustrate numerical proportion. The
slices of the pie show the relative size of the data. The arc length of each slice, and
consequently its central angle and area, is proportional to the quantity it represents.
All slices of the pie add up to make the whole equaling 100 percent and 360 degrees. Pie
charts are often used to represent sample data. Each of these categories is represented as a
“slice of the pie”. The size of each slice is directly proportional to the number of data points
that belong to a particular category.
Univariate/Bivariate Analysis
Categorical vs numerical
Use case - To find contibution on a standard scale
In [ ]: # simple data
data = [23,45,100,20,49]
subjects = ['eng','science','maths','sst','hindi']
plt.pie(data,labels=subjects)
plt.show()
In [ ]: # dataset
df = pd.read_csv('Data\Day48\Gayle-175.csv')
df
0 AB de Villiers 31
1 CH Gayle 175
2 R Rampaul 0
3 SS Tiwary 2
4 TM Dilshan 33
5 V Kohli 11
In [ ]: plt.pie(df['batsman_runs'],labels=df['batsman'],autopct='%0.1f%%')
plt.show()
In [ ]: # explode shadow
plt.pie(df['batsman_runs'],labels=df['batsman'],autopct='%0.1f%%',explode=[0.3,0,0,
plt.show()
Advanced Matplotlib(part-1)
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Colored Scatterplots
In [ ]: iris = pd.read_csv('Data\Day49\iris.csv')
iris.sample(5)
In [ ]: iris['Species'] = iris['Species'].replace({'Iris-setosa':0,'Iris-versicolor':1,'Iri
iris.sample(5)
In [ ]: plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'])
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76de6880>
Out[ ]:
In [ ]: # cmap
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec75d12f10>
Out[ ]:
In [ ]: # alpha
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76e54790>
Out[ ]:
In [ ]: # plot size
plt.figure(figsize=(15,7))
plt.scatter(iris['SepalLengthCm'],iris['PetalLengthCm'],c=iris['Species'],cmap='jet
plt.xlabel('Sepal Length')
plt.ylabel('Petal Length')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1ec76f3bf40>
Out[ ]:
Annotations
In [ ]: batters = pd.read_csv('Data\Day49\Batter.csv')
In [ ]: sample_df = batters.head(100).sample(25,random_state=5)
In [ ]: sample_df
In [ ]: plt.figure(figsize=(18,10))
plt.scatter(sample_df['avg'],sample_df['strike_rate'],s=sample_df['runs'])
for i in range(sample_df.shape[0]):
plt.text(sample_df['avg'].values[i],sample_df['strike_rate'].values[i],sample_df[
In [ ]: x = [1,2,3,4]
y = [5,6,7,8]
plt.scatter(x,y)
plt.text(1,5,'Point 1')
plt.text(2,6,'Point 2')
plt.text(3,7,'Point 3')
plt.text(4,8,'Point 4',fontdict={'size':12,'color':'brown'})
plt.figure(figsize=(18,10))
plt.scatter(sample_df['avg'],sample_df['strike_rate'],s=sample_df['runs'])
plt.axvline(30,color='red')
for i in range(sample_df.shape[0]):
plt.text(sample_df['avg'].values[i],sample_df['strike_rate'].values[i],sample_df[
Subplots
In [ ]: # A diff way to plot graphs
batters.head()
In [ ]: plt.figure(figsize=(15,6))
plt.scatter(batters['avg'],batters['strike_rate'])
plt.title('Something')
plt.xlabel('Avg')
plt.ylabel('Strike Rate')
plt.show()
In [ ]: fig,ax = plt.subplots(figsize=(15,6))
ax.scatter(batters['avg'],batters['strike_rate'],color='red',marker='+')
ax.set_title('Something')
ax.set_xlabel('Avg')
ax.set_ylabel('Strike Rate')
fig.show()
In [ ]: ig, ax = plt.subplots(nrows=2,ncols=1,sharex=True,figsize=(10,6))
ax[0].scatter(batters['avg'],batters['strike_rate'],color='red')
ax[1].scatter(batters['avg'],batters['runs'])
ax[1].set_title('Avg Vs Runs')
ax[1].set_ylabel('Runs')
ax[1].set_xlabel('Avg')
Text(0.5, 0, 'Avg')
Out[ ]:
In [ ]: fig, ax = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
ax[0,0]
ax[0,1].scatter(batters['avg'],batters['runs'])
ax[1,0].hist(batters['avg'])
ax[1,1].hist(batters['runs'])
(array([499., 40., 19., 19., 9., 6., 4., 4., 3., 2.]),
Out[ ]:
array([ 0. , 663.4, 1326.8, 1990.2, 2653.6, 3317. , 3980.4, 4643.8,
5307.2, 5970.6, 6634. ]),
<BarContainer object of 10 artists>)
In [ ]: fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.scatter(batters['avg'],batters['strike_rate'],color='red')
ax2 = fig.add_subplot(2,2,2)
ax2.hist(batters['runs'])
ax3 = fig.add_subplot(2,2,3)
ax3.hist(batters['avg'])
(array([102., 125., 103., 82., 78., 43., 22., 14., 2., 1.]),
Out[ ]:
array([ 0. , 5.56666667, 11.13333333, 16.7 , 22.26666667,
27.83333333, 33.4 , 38.96666667, 44.53333333, 50.1 ,
55.66666667]),
<BarContainer object of 10 artists>)
Advanced Matplotlib(part-2)
3D scatter Plot
A 3D scatter plot is used to represent data points in a three-dimensional space.
In [ ]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: batters = pd.read_csv('Data\Day49\Batter.csv')
batters.head()
In [ ]: fig = plt.figure()
ax = plt.subplot(projection='3d')
ax.scatter3D(batters['runs'],batters['avg'],batters['strike_rate'],marker='+')
ax.set_title('IPL batsman analysis')
ax.set_xlabel('Runs')
ax.set_ylabel('Avg')
ax.set_zlabel('SR')
Text(0.5, 0, 'SR')
Out[ ]:
In the example, you created a 3D scatter plot to analyze IPL batsmen based on runs,
average (avg), and strike rate (SR).
The ax.scatter3D function was used to create the plot, where the three variables were
mapped to the x, y, and z axes.
3D Line Plot
A 3D line plot represents data as a line in three-dimensional space.
In [ ]: x = [0,1,5,25]
y = [0,10,13,0]
z = [0,13,20,9]
fig = plt.figure()
ax = plt.subplot(projection='3d')
ax.scatter3D(x,y,z,s=[100,100,100,100])
ax.plot3D(x,y,z,color='red')
[<mpl_toolkits.mplot3d.art3d.Line3D at 0x23ec4988340>]
Out[ ]:
In the given example, you created a 3D line plot with three sets of data points
represented by lists x, y, and z.
The ax.plot3D function was used to create the line plot.
3D Surface Plots
3D surface plots are used to visualize functions of two variables as surfaces in three-
dimensional space.
In [ ]: x = np.linspace(-10,10,100)
y = np.linspace(-10,10,100)
In [ ]: xx, yy = np.meshgrid(x,y)
In [ ]: z = xx**2 + yy**2
z.shape
(100, 100)
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec66ca9a0>
Out[ ]:
In [ ]: z = np.sin(xx) + np.cos(yy)
fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec33aa520>
Out[ ]:
surface plot using the ax.plot_surface function. In First example, you plotted a parabolic
surface, and in Seound, you plotted a surface with sine and cosine functions.
Contour Plots
Contour plots are used to visualize 3D data in 2D, representing data as contours on a
2D plane.
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec616e0d0>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contour(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec56e4be0>
Out[ ]:
In [ ]: z = np.sin(xx) + np.cos(yy)
fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contourf(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec8865f40>
Out[ ]:
You created both filled contour plots (ax.contourf) and contour line plots (ax.contour) in 2D
space. These plots are useful for representing functions over a grid.
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot(projection='3d')
p = ax.plot_surface(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7b7ca00>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contour(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7c698e0>
Out[ ]:
In [ ]: fig = plt.figure(figsize=(12,8))
ax = plt.subplot()
p = ax.contourf(xx,yy,z,cmap='viridis')
fig.colorbar(p)
<matplotlib.colorbar.Colorbar at 0x23ec7ed9700>
Out[ ]:
Heatmap
A heatmap is a graphical representation of data in a 2D grid, where individual values are
represented as colors.
In [ ]: delivery = pd.read_csv('Data\Day50\IPL_Ball_by_Ball_2008_2022.csv')
delivery.head()
Out[ ]: non-
ID innings overs ballnumber batter bowler extra_type batsman_run ext
striker
YBK Mohammed JC
0 1312200 1 0 1 NaN 0
Jaiswal Shami Buttler
YBK Mohammed JC
1 1312200 1 0 2 legbyes 0
Jaiswal Shami Buttler
JC Mohammed YBK
2 1312200 1 0 3 NaN 1
Buttler Shami Jaiswal
YBK Mohammed JC
3 1312200 1 0 4 NaN 0
Jaiswal Shami Buttler
YBK Mohammed JC
4 1312200 1 0 5 NaN 0
Jaiswal Shami Buttler
In [ ]: grid = temp_df.pivot_table(index='overs',columns='ballnumber',values='batsman_run',
In [ ]: plt.figure(figsize=(20,10))
plt.imshow(grid)
plt.yticks(delivery['overs'].unique(), list(range(1,21)))
plt.xticks(np.arange(0,6), list(range(1,7)))
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x23ec9384820>
Out[ ]:
In the given example, we used the imshow function to create a heatmap of IPL
deliveries.
The grid represented ball-by-ball data with the number of sixes (batsman_run=6) in
each over and ball number.
Heatmaps are effective for visualizing patterns and trends in large datasets.
These techniques provide powerful tools for visualizing complex data in three dimensions
and for representing large datasets effectively. Each type of plot is suitable for different
types of data and can help in gaining insights from the data.
1. Line Plot
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Year': [2010, 2011, 2012, 2013, 2014],
'Sales': [100, 120, 150, 200, 180]}
df = pd.DataFrame(data)
2. Bar Plot:
file:///C:/Users/disha/Downloads/Day51 - Pandas plot function.html 1/5
11/9/23, 9:40 AM Day51 - Pandas plot function
In [ ]: import pandas as pd
# Create a DataFrame
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
'Population': [8398748, 3980408, 2716000, 2326006]}
df = pd.DataFrame(data)
3. Histogram:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]}
df = pd.DataFrame(data)
# Create a histogram
df.plot(y='Age', kind='hist', bins=5, title='Age Distribution')
4. Scatter Plot:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 1, 3, 5]}
df = pd.DataFrame(data)
5. Pie Chart:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Category': ['A', 'B', 'C', 'D'],
'Value': [30, 40, 20, 10]}
df = pd.DataFrame(data)
The plot function provides various customization options for labels, titles, colors, and more,
and it can handle different plot types by specifying the kind parameter. While Pandas' plot
function is useful for quick and simple visualizations, more complex and customized
visualizations may require using libraries like Matplotlib or Seaborn in combination with
Pandas.
1. Statistical Plots: Seaborn simplifies the process of creating statistical plots by providing
functions for common statistical visualizations such as scatter plots, line plots, bar plots,
box plots, violin plots, and more.
2. Themes and Color Palettes: Seaborn allows you to easily customize the look of your
visualizations by providing different themes and color palettes. This makes it simple to
create professional-looking plots without having to manually tweak every detail.
4. Matrix Plots: Seaborn provides functions for visualizing matrices of data, such as
heatmaps. These can be useful for exploring relationships and patterns in large datasets.
5. Time Series Plots: Seaborn supports the visualization of time series data, allowing you
to create informative plots for temporal analysis.
While Seaborn is built on top of Matplotlib, it abstracts away much of the complexity and
provides a more concise and visually appealing syntax for creating statistical visualizations. It
is a popular choice among data scientists and analysts for quickly generating exploratory
data visualizations.
Seaborn offers several other plot types for various visualization needs:
1. Categorical Plots:
violinplot : Combines aspects of a box plot and a kernel density plot, providing
insights into the distribution of a variable for different categories.
rugplot : Adds small vertical lines (rug) to a plot to indicate the distribution of
data points along the x or y-axis.
3. Regression Plots:
regplot : Creates a scatter plot with a linear regression line fitted to the data.
tsplot : Formerly used for time series data, it has been replaced by the more
flexible lineplot . However, it's still available for backward compatibility.
5. Facet Grids:
FacetGrid : Although not a specific plot type, FacetGrid is a powerful tool that
allows you to create a grid of subplots based on the values of one or more
categorical variables. It can be used with various plot types to create a matrix of
visualizations.
6. Relational Plots:
relplot : Combines aspects of scatter plots and line plots to visualize the
relationship between two variables across different levels of a third variable. It's a
flexible function that can create scatter plots, line plots, or other types of relational
plots.
scatterplot : Creates a simple scatter plot to show the relationship between two
variables.
lineplot : Generates line plots to depict the trend between two variables.
jointplot : Combines scatter plots for two variables along with histograms for
each variable.
7. Matrix Plots:
heatmap : Displays a matrix where the values are represented by colors. Heatmaps
are often used to visualize correlation matrices or other two-dimensional datasets.
These functions provide a range of tools for exploring relationships in data, whether you're
interested in visualizing the distribution of individual variables, the relationship between two
variables, or patterns within a matrix of data.
These are just a selection of Seaborn's capabilities. The library is designed to make it easy to
create a wide range of statistical graphics for data exploration and presentation. The choice
of which plot to use depends on the nature of your data and the specific insights you want
to extract.
In Seaborn, the concepts of "axis-level" and "figure-level" functions refer to the level at
which the functions operate and the structure of the resulting plots.
1. Axis-Level Functions:
Operate at the level of a single subplot or axis.
Produce a single plot by default.
Examples include functions like sns.scatterplot() , sns.lineplot() ,
sns.boxplot() , etc.
Accept the ax parameter to specify the Axes where the plot will be drawn. If
not specified, a new Axes is created.
1. Figure-Level Functions:
Operate at the level of the entire figure, potentially creating multiple
subplots.
Produce a FacetGrid or a similar object that can be used to create a grid of
subplots.
Examples include functions like sns.relplot() , sns.catplot() ,
sns.pairplot() , etc.
Return a FacetGrid object, allowing for easy creation of subplots based on
additional categorical variables.
The choice between axis-level and figure-level functions depends on your specific needs.
Axis-level functions are often more straightforward for simple plots, while figure-level
functions are powerful for creating complex visualizations with multiple subplots or facets
based on categorical variables.
Keep in mind that figure-level functions return objects like FacetGrid, and you can customize
the resulting plots further using the methods and attributes of these objects. The
documentation for each function provides details on how to use and customize the output.
1. Relational Plots
to see the statistical relation between 2 or more variables.
Bivariate Analysis
In [ ]: import pandas as pd
In [ ]: tips = sns.load_dataset('tips')
tips
<seaborn.axisgrid.FacetGrid at 0x7f686180b340>
Out[ ]:
In [ ]: # line plot
gap = px.data.gapminder()
temp_df = gap[gap['country'] == 'India']
temp_df
In [ ]: # using relpplot
sns.relplot(data=temp_df, x='year', y='lifeExp', kind='line')
file:///C:/Users/disha/Downloads/Day53 -Introduction_of_Seaborn.html 6/10
11/11/23, 1:46 PM Day53 -Introduction_of_Seaborn
<seaborn.axisgrid.FacetGrid at 0x7f686180bf70>
Out[ ]:
In [ ]: temp_df = gap[gap['country'].isin(['India','Brazil','Germany'])]
temp_df
<seaborn.axisgrid.FacetGrid at 0x7f68616088e0>
Out[ ]:
In [ ]: # facet plot -> figure level function -> work with relplot
# it will not work with scatterplot and lineplot
sns.relplot(data=tips, x='total_bill', y='tip', kind='scatter', col='sex')
<seaborn.axisgrid.FacetGrid at 0x7f68630464d0>
Out[ ]:
In [ ]:
For univariate data, the displot will show a histogram of the data, along with a line
representing the kernel density estimate (KDE). The KDE is a smoother version of the
histogram that can be used to better visualize the underlying distribution of the data.
For bivariate data, the displot will show a scatter plot of the data, along with a contour plot
representing the joint distribution of the data. The contour plot is a way of visualizing the
relationship between two variables, and can be used to identify clusters of data points or to
see how the variables are related.
histplot
kdeplot
rugplot
In [ ]: tips = sns.load_dataset('tips')
tips
In [ ]: # displot
sns.displot(data=tips, x='total_bill', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x7b831cbeaef0>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b831a9590f0>
Out[ ]:
# countplot
sns.displot(data=tips, x='day', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x7b83171a4a90>
Out[ ]:
In [ ]: # hue parameter
sns.displot(data=tips, x='tip', kind='hist',hue='sex')
<seaborn.axisgrid.FacetGrid at 0x7b8317081f90>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b8316eddc30>
Out[ ]:
In [ ]: titanic = sns.load_dataset('titanic')
In [ ]: titanic.head()
Out[ ]: survived pclass sex age sibsp parch fare embarked class who adult_male dec
<seaborn.axisgrid.FacetGrid at 0x7b831a9e2440>
Out[ ]:
In [ ]: # faceting uusin col and rows -> and it not work on histplot function
<seaborn.axisgrid.FacetGrid at 0x7b8316eddf30>
Out[ ]:
kdeplot
Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian
kernel, producing a continuous density estimate
In [ ]: sns.kdeplot(data=tips,x='total_bill')
In [ ]: sns.displot(data=tips,x='total_bill',kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7b8316ba3250>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b831691bf40>
Out[ ]:
Rugplot
Plot marginal distributions by drawing ticks along the x and y axes.
This function is intended to complement other plots by showing the location of individual
observations in an unobtrusive way.
In [ ]: sns.kdeplot(data=tips,x='total_bill')
sns.rugplot(data=tips,x='total_bill')
Bivariate histogram
A bivariate histogram bins the data within rectangles that tile the plot and then shows the
count of observations within each rectangle with the fill color
In [ ]: # Bivariate Kdeplot
# a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian
sns.kdeplot(data=tips, x='total_bill', y='tip')
Matrix Plot
In Seaborn, both heatmap and clustermap functions are used for visualizing matrices,
but they serve slightly different purposes.
1. Heatmap:
The heatmap function is used to plot rectangular data as a color-encoded matrix.
It is essentially a 2D representation of the data where each cell is colored based on
its value.
Heatmaps are useful for visualizing relationships and patterns in the data, making
them suitable for tasks such as correlation matrices or any other situation where
you want to visualize the magnitude of a phenomenon.
1. Clustermap:
The clustermap function, on the other hand, not only visualizes the matrix but
also performs hierarchical clustering on both rows and columns to reorder them
based on similarity.
It is useful when you want to identify patterns not only in the individual values of
the matrix but also in the relationships between rows and columns.
In [ ]: gap = px.data.gapminder()
In [ ]: # Heatmap
In [ ]: # annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True)
In [ ]: # linewidth
# annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True,linewidth=0.5)
In [ ]: # cmap
# annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True,linewidth=0.5, cmap='summer')