DATA VISUALIZATION
Presenting informative in visualizations is one of the most important tasks in data analysis. It
may be a part of the exploratory process, for example, to help identify outliers or needed data
transformations, or as a way of generating ideas for models.
Python has many libraries for making static or dynamic visualizations. matplotlib is a desktop
plotting package designed for creating publication-quality plots. matplotlib supports various GUI
backends on all operating systems and can export visualizations to all the common
graphics formats (PDF, SVG, JPG, PNG, BMP, GIF, etc.). matplotlib has several add-on toolkits
for data visualization that use matplotlib for their underlying plotting. One of these is seaborn.
The matplotlib API in action
import matplotlib.pyplot as plt
import numpy as np
data = np.arange(10)
data
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
plt.plot(data)
[<matplotlib.lines.Line2D object at 0x000001F6BD392F10>]
plt.show()
Simple Line Figure
Figures and Subplots
Plots in matplotlib is part of Figure object. You can create a new figure with plt.figure:
fig = plt.figure()
The plt.figure has many options: figsize for example will guarantee the figure has a certain size
and aspect ratio if saved to disk.
You can’t make a plot with a blank figure. You must create one or more subplots using
add_subplot:
fig = plt.figure();
ax1 = fig.add_subplot(2, 2, 1);
ax2 = fig.add_subplot(2, 2, 2);
ax3 = fig.add_subplot(2, 2, 3);
ax1.hist(np.random.randn(100), bins=20, color='k', alpha=0.3);
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30));
ax3.plot(np.random.randn(50).cumsum(), 'k--'); //plot a random line
fig.show() //dispay on the screen
fig.savefig('figpath.svg')//save the fig
Figure with four subplots and 3 of witch are displayed
plt.plot([1.5, 3.5, -2, 1.6])
You can use matplotlib with pandas and numpy
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), \
columns=['A', 'B', 'C', 'D'],index=np.arange(0, 100, 10))
df.plot()
plt.show()
Bar Plots
The plot.bar() and plot.barh() create vertical and horizontal bar plots, respectively. In this case,
the Series or DataFrame index will be used as the x (bar) or y (barh) ticks:
fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
data.plot.bar(ax=axes[0], color='k', alpha=0.7)
data.plot.barh(ax=axes[1], color='k', alpha=0.7)
plt.show()
The options color='k' and alpha=0.7 set the color of the plots to black and transparency 0.7
respectively.
With a DataFrame, bar plots group the values in each row together in a group in bars, side by
side, for each value.
df = pd.DataFrame(np.random.rand(6, 4), index=['one', 'two', 'three', 'four', 'five', 'six'],
columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df
Genus A B C D
one 0.036514 0.306275 0.020181 0.385821
two 0.800127 0.268534 0.056382 0.447547
three 0.878521 0.375050 0.025206 0.566525
four 0.235946 0.230388 0.730691 0.734806
five 0.914254 0.701028 0.626693 0.878269
six 0.076666 0.765472 0.211579 0.176230
df.plot.bar()
plt.show()
DataFrame bar plot
Note that the name “Genus” on the DataFrame’s columns is used to title the legend.
You can create stacked bar plots from a DataFrame by passing stacked=True, resulting in the
value in each row being stacked together.
df.plot.barh(stacked=True, alpha=0.5)
plt.show()
Staked plot created from random data DataFrame
Histograms and Density Plots
A histogram is a kind of bar plot that gives a discretized display of value frequency. The data
points are split into discrete, evenly spaced bins, and the number of data points in each bin is
plotted. Using the current CGPA of a given Level students in SAZU, we can make a histogram of
CGPA score grouped into .5 using the plot.hist method on the Series:
import matplotlib as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as ppl
students=pd.read_csv('C:/Users/adamn/Downloads/STATISTICAL_DATA/main.csv')
data=students[['cgpa', 'score']]
cgpa_hist = data['cgpa'].plot.hist(bins=[0.5, 1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0])
ppl.show()
Histogram showing students’ CGPA distribution
A related plot type is a density plot, which is formed by computing an estimate of a continuous
probability distribution that might have generated the observed data. The usual procedure is to
approximate this distribution as a mixture of “kernels”—that is, simpler distributions like the
normal distribution. Thus, density plots are also known as kernel density estimate (KDE) plots.
Using plot.kde makes a density plot using the conventional mixture-of-normals estimate
cgpa_density = data['cgpa'].plot.density()
ppl.show()
Density Plot for the same students’ CGPA scores
Scatter or Point Plots
Point plots or scatter plots can be a useful way of examining the relationship between two one-
dimensional data series. For example, here we load dataset containing students’, among others,
JAMB scores and current CGPA, select the two variables, then plot scatter diagram of the CGP
and the JAMB score:
import matplotlib as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as ppl
students =
pd.read_csv('C:/Users/adamn/Downloads/STATISTICAL_DATA/main_data.csv')
data=students[['cgpa', 'score']]
data
cgpa score
0 4.76 87
1 3.36 47
2 1.89 13
3 2.92 12
4 3.29 53
.. ... ...
734 3.16 26
735 1.49 36
736 1.87 14
737 1.68 17
738 2.69 22
We can then use seaborn’s regplot method, to draw the scatter plot and at the same time fits a
linear regression line:
sns.regplot(x='score', y='cgpa', data=data);
ppl.title('JAMB Score vs CGPA');
ppl.xlabel('JAMB Score');
ppl.ylabel('CGPA');
ppl.show() //displays the plot
University-wide
Bauchi State Students
Other States
In exploratory data analysis it’s helpful to be able to look at all the scatter plots among a group of
variables; this is known as a pairs plot or scatter plot matrix. Making such a plot from scratch is a
bit of work, so seaborn has a convenient pairplot function, which supports placing histograms or
density estimates of each variable along the diagonal:
sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.2});ppl.show()
Check out the seaborn.pairplot docstring for more granular configuration options.