[go: up one dir, main page]

0% found this document useful (0 votes)
19 views12 pages

Data Visualisation

The document discusses the importance of data visualization in data analysis, highlighting Python's matplotlib library for creating various types of plots. It covers how to create figures, subplots, bar plots, histograms, density plots, and scatter plots using matplotlib and pandas. Additionally, it introduces seaborn for enhanced visualizations, including pair plots for exploratory data analysis.

Uploaded by

hacksbank.net
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Data Visualisation

The document discusses the importance of data visualization in data analysis, highlighting Python's matplotlib library for creating various types of plots. It covers how to create figures, subplots, bar plots, histograms, density plots, and scatter plots using matplotlib and pandas. Additionally, it introduces seaborn for enhanced visualizations, including pair plots for exploratory data analysis.

Uploaded by

hacksbank.net
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DATA VISUALIZATION

Presenting informative in visualizations is one of the most important tasks in data analysis. It
may be a part of the exploratory process, for example, to help identify outliers or needed data
transformations, or as a way of generating ideas for models.

Python has many libraries for making static or dynamic visualizations. matplotlib is a desktop
plotting package designed for creating publication-quality plots. matplotlib supports various GUI
backends on all operating systems and can export visualizations to all the common
graphics formats (PDF, SVG, JPG, PNG, BMP, GIF, etc.). matplotlib has several add-on toolkits
for data visualization that use matplotlib for their underlying plotting. One of these is seaborn.

The matplotlib API in action

import matplotlib.pyplot as plt


import numpy as np
data = np.arange(10)
data
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
plt.plot(data)
[<matplotlib.lines.Line2D object at 0x000001F6BD392F10>]
plt.show()

Simple Line Figure


Figures and Subplots

Plots in matplotlib is part of Figure object. You can create a new figure with plt.figure:
fig = plt.figure()

The plt.figure has many options: figsize for example will guarantee the figure has a certain size
and aspect ratio if saved to disk.

You can’t make a plot with a blank figure. You must create one or more subplots using
add_subplot:

fig = plt.figure();
ax1 = fig.add_subplot(2, 2, 1);
ax2 = fig.add_subplot(2, 2, 2);
ax3 = fig.add_subplot(2, 2, 3);
ax1.hist(np.random.randn(100), bins=20, color='k', alpha=0.3);
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30));
ax3.plot(np.random.randn(50).cumsum(), 'k--'); //plot a random line
fig.show() //dispay on the screen
fig.savefig('figpath.svg')//save the fig
Figure with four subplots and 3 of witch are displayed

plt.plot([1.5, 3.5, -2, 1.6])


You can use matplotlib with pandas and numpy

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), \
columns=['A', 'B', 'C', 'D'],index=np.arange(0, 100, 10))
df.plot()
plt.show()

Bar Plots

The plot.bar() and plot.barh() create vertical and horizontal bar plots, respectively. In this case,
the Series or DataFrame index will be used as the x (bar) or y (barh) ticks:

fig, axes = plt.subplots(2, 1)


data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
data.plot.bar(ax=axes[0], color='k', alpha=0.7)
data.plot.barh(ax=axes[1], color='k', alpha=0.7)
plt.show()
The options color='k' and alpha=0.7 set the color of the plots to black and transparency 0.7
respectively.

With a DataFrame, bar plots group the values in each row together in a group in bars, side by
side, for each value.

df = pd.DataFrame(np.random.rand(6, 4), index=['one', 'two', 'three', 'four', 'five', 'six'],


columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
df
Genus A B C D
one 0.036514 0.306275 0.020181 0.385821
two 0.800127 0.268534 0.056382 0.447547
three 0.878521 0.375050 0.025206 0.566525
four 0.235946 0.230388 0.730691 0.734806
five 0.914254 0.701028 0.626693 0.878269
six 0.076666 0.765472 0.211579 0.176230
df.plot.bar()
plt.show()
DataFrame bar plot

Note that the name “Genus” on the DataFrame’s columns is used to title the legend.

You can create stacked bar plots from a DataFrame by passing stacked=True, resulting in the
value in each row being stacked together.

df.plot.barh(stacked=True, alpha=0.5)
plt.show()

Staked plot created from random data DataFrame


Histograms and Density Plots

A histogram is a kind of bar plot that gives a discretized display of value frequency. The data
points are split into discrete, evenly spaced bins, and the number of data points in each bin is
plotted. Using the current CGPA of a given Level students in SAZU, we can make a histogram of
CGPA score grouped into .5 using the plot.hist method on the Series:

import matplotlib as plt


import pandas as pd
import numpy as np
import matplotlib.pyplot as ppl
students=pd.read_csv('C:/Users/adamn/Downloads/STATISTICAL_DATA/main.csv')
data=students[['cgpa', 'score']]
cgpa_hist = data['cgpa'].plot.hist(bins=[0.5, 1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0])
ppl.show()

Histogram showing students’ CGPA distribution

A related plot type is a density plot, which is formed by computing an estimate of a continuous
probability distribution that might have generated the observed data. The usual procedure is to
approximate this distribution as a mixture of “kernels”—that is, simpler distributions like the
normal distribution. Thus, density plots are also known as kernel density estimate (KDE) plots.
Using plot.kde makes a density plot using the conventional mixture-of-normals estimate
cgpa_density = data['cgpa'].plot.density()
ppl.show()
Density Plot for the same students’ CGPA scores

Scatter or Point Plots


Point plots or scatter plots can be a useful way of examining the relationship between two one-
dimensional data series. For example, here we load dataset containing students’, among others,
JAMB scores and current CGPA, select the two variables, then plot scatter diagram of the CGP
and the JAMB score:

import matplotlib as plt


import pandas as pd
import numpy as np
import matplotlib.pyplot as ppl
students =
pd.read_csv('C:/Users/adamn/Downloads/STATISTICAL_DATA/main_data.csv')
data=students[['cgpa', 'score']]
data
cgpa score
0 4.76 87
1 3.36 47
2 1.89 13
3 2.92 12
4 3.29 53
.. ... ...
734 3.16 26
735 1.49 36
736 1.87 14
737 1.68 17
738 2.69 22
We can then use seaborn’s regplot method, to draw the scatter plot and at the same time fits a
linear regression line:
sns.regplot(x='score', y='cgpa', data=data);
ppl.title('JAMB Score vs CGPA');
ppl.xlabel('JAMB Score');
ppl.ylabel('CGPA');
ppl.show() //displays the plot

University-wide
Bauchi State Students
Other States

In exploratory data analysis it’s helpful to be able to look at all the scatter plots among a group of
variables; this is known as a pairs plot or scatter plot matrix. Making such a plot from scratch is a
bit of work, so seaborn has a convenient pairplot function, which supports placing histograms or
density estimates of each variable along the diagonal:

sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.2});ppl.show()


Check out the seaborn.pairplot docstring for more granular configuration options.

You might also like