The Next Level of Data Visualization in Python
The Next Level of Data Visualization in Python
(Source)
Will Koehrsen
Jan 9 · 8 min read
The sunk-cost fallacy is one of many harmful cognitive biases to which humans fall
prey. It refers to our tendency to continue to devote time and resources to a lost cause
because we have already spent — sunk — so much time in the pursuit. The sunk-cost
fallacy applies to staying in bad jobs longer than we should, slaving away at a project
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 1/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
even when it’s clear it won’t work, and yes, continuing to use a tedious, outdated
plotting library — matplotlib — when more efficient, interactive, and better-looking
alternatives exist.
Over the past few months, I’ve realized the only reason I use matplotlib is the
hundreds of hours I’ve sunk into learning the convoluted syntax. This complication
leads to hours of frustration on StackOverflow figuring out how to format dates or add
a second y-axis. Fortunately, this is a great time for Python plotting, and after exploring
the options, a clear winner — in terms of ease-of-use, documentation, and
functionality — is the plotly Python library. In this article, we’ll dive right into plotly ,
learning how to make better plots in less time — often with one line of code.
All of the code for this article is available on GitHub. The charts are all interactive and
can be viewed on NBViewer here.
The plotly Python package is an open-source library built on plotly.js which in turn
is built on d3.js . We’ll be using a wrapper on plotly called cufflinks designed to work
with Pandas dataframes. So, our entire stack is cufflinks > plotly > plotly.js > d3.js
which means we get the efficiency of coding in Python with the incredible interactive
graphics capabilities of d3.
(Plotly itself is a graphics company with several products and open-source tools. The
Python library is free to use, and we can make unlimited charts in offline mode plus up
to 25 charts in online mode to share with the world.)
All the work in this article was done in a Jupyter Notebook with plotly + cufflinks
running in offline mode. After installing plotly and cufflinks with pip install
. . .
df['claps'].iplot(kind='hist', xTitle='claps',
yTitle='count', title='Claps Distribution')
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 3/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
For those used to matplotlib , all we have to do is add one more letter ( iplot instead
of plot ) and we get a much better-looking and interactive chart! We can click on the
data to get more details, zoom into sections of the plot, and as we’ll see later, select
different categories to highlight.
df[['time_started', 'time_published']].iplot(
kind='hist',
histnorm='percent',
barmode='overlay',
xTitle='Time of Day',
yTitle='(%) of Articles',
title='Time Started and Time Published')
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 4/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
s we saw, we can combine the power of pandas with plotly + cufflinks. For a boxplot of
the fans per story by publication, we use a pivot and then plot:
df.pivot(columns='publication', values='fans').iplot(
kind='box',
yTitle='fans',
title='Fans Distribution by Publication')
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 5/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
The benefits of interactivity are that we can explore and subset the data as we like.
There’s a lot of information in a boxplot, and without the ability to see the numbers,
we’ll miss most of it!
. . .
Scatterplots
The scatterplot is the heart of most analyses. It allows us to see the evolution of a
variable over time or the relationship between two (or more) variables.
Time-Series
A considerable portion of real-world data has a time element. Luckily, plotly +
cufflinks was designed with time-series visualizations in mind. Let’s make a dataframe
of my TDS articles and look at how the trends have changed.
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 6/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
Here we are doing quite a few different things all in one line:
For more information, we can also add in text annotations quite easily:
tds_monthly_totals.iplot(
mode='lines+markers+text',
text=text,
y='word_count',
opacity=0.8,
xTitle='Date',
yTitle='Word Count',
title='Total Word Count by Month')
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 7/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
df.iplot(
x='read_time',
y='read_ratio',
# Specify the category
categories='publication',
xTitle='Read Time',
yTitle='Reading Percent',
title='Reading Percent vs Read Ratio by Publication')
Let’s get a little more sophisticated by using a log axis — specified as a plotly layout —
(see the Plotly documentation for the layout specifics) and sizing the bubbles by a
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 8/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
numeric variable:
tds.iplot(
x='word_count',
y='reads',
size='read_ratio',
text=text,
mode='markers',
# Log xaxis
layout=dict(
xaxis=dict(type='log', title='Word Count'),
yaxis=dict(title='Reads'),
title='Reads vs Log Word Count Sized by Read Ratio'))
With a little more work (see notebook for details), we can even put four variables (this
is not advised) on one graph!
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 9/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
df.pivot_table(
values='views', index='published_date',
columns='publication').cumsum().iplot(
mode='markers+lines',
size=8,
symbol=[1, 2, 3, 4, 5],
layout=dict(
xaxis=dict(title='Date'),
yaxis=dict(type='log', title='Total Views'),
title='Total Views over Time by Publication'))
See the notebook or the documentation for more examples of added functionality. We
can add in text annotations, reference lines, and best-fit lines to our plots with a single
line of code, and still with all the interaction.
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 10/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
. . .
Advanced Plots
Now we’ll get into a few plots that you probably won’t use all that often, but which can
be quite impressive. We’ll use the plotly figure_factory , to keep even these incredible
Scatter Matrix
When we want to explore relationships among many variables, a scattermatrix (also
called a splom) is a great option:
import plotly.figure_factory as ff
figure = ff.create_scatterplotmatrix(
df[['claps', 'publication', 'views',
'read_ratio','word_count']],
diag='histogram',
index='publication')
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 11/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
Correlation Heatmap
To visualize the correlations between numeric variables, we calculate the correlations
and then make an annotated heatmap:
corrs = df.corr()
figure = ff.create_annotated_heatmap(
z=corrs.values,
x=list(corrs.columns),
y=list(corrs.index),
annotation_text=corrs.round(2).values,
showscale=True)
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 12/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
The list of plots goes on and on. Cufflinks also has several themes we can use to get
completely different styling with no effort. For example, below we have a ratio plot in
the “space” theme and a spread plot in “ggplot”:
For those who are so inclined, you can even make a pie chart:
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 13/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 14/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
With everything mentioned here, we are still not exploring the full capabilities of the
library! I’d encourage you to check out both the plotly and the cufflinks documentation
for more incredible graphics.
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 15/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
Conclusions
The worst part about the sunk cost fallacy is you only realize how much time you’ve
wasted after you’ve quit the endeavor. Fortunately, now that I’ve made the mistake of
sticking with matploblib for too long, you don’t have to!
When thinking about plotting libraries, there are a few things we want:
As of right now, the best option for doing all of these in Python is plotly. Plotly allows us
to make visualizations quickly and helps us get better insight into our data through
interactivity. Also, let’s admit it, plotting should be one of the most enjoyable parts of
data science! With other libraries, plotting turned into a tedious task, but with plotly,
there is again joy in making a great figure!
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 16/17
8/11/2019 The Next Level of Data Visualization in Python - Towards Data Science
Now that it’s 2019, it is time to upgrade your Python plotting library for better
efficiency, functionality, and aesthetics in your data science visualizations.
. . .
https://towardsdatascience.com/the-next-level-of-data-visualization-in-python-dd6e99039d5e 17/17