Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
Pandas is one of the most popular Python libraries for Data Science and
Analytics. I like to say it’s the “SQL of Python.” Why? Because pandas helps you
to manage two-dimensional data tables in Python. Of course, it has many more
features. In this pandas tutorial series, I’ll show you the most important (that is,
the most often used) things that you have to know as an Analyst or a Data
Scientist. This is the first episode and we will start from the basics!
Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!
Before we start
If you haven’t done so yet, I recommend going through these articles first:
2. Next step: log in to your server and fire up Jupyter. Then open a new
Jupyter Notebook in your favorite browser. (If you don’t know how to do
that, I really do recommend going through the articles I linked in the “Before
we start” section.)
Note: I’ll also rename my Jupyter Notebook to “pandas_tutorial_1”.
Firing up Jupyter Notebook
3. Import numpy and pandas to your Jupyter Notebook by running these two
lines in a cell:
import numpy as np
import pandas as pd
Note: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the
end of your import statement, your Jupyter Notebook understands that from this
point on every time you type pd , you are actually referring to the pandas library.
Okay, now we have everything! Let’s start with this pandas tutorial!
The first question is:
Series: a pandas Series is a one dimensional data structure “(a one dimensional
ndarray”) that can store values — and for every value it holds a unique index,
too.
Pandas Series example
In this pandas tutorial, I’ll focus mostly onDataFrames. The reason is simple:
most of the analytical methods I will talk about will make more sense in a 2D
datatable than in a 1D array.
Start with a simple demo data set, called zoo! This time – for the sake of
practicing – you will create a .csv file for yourself! Here’s the raw data:
animal,uniq_id,water_need
elephant,1001,500
elephant,1002,600
elephant,1003,550
tiger,1004,300
tiger,1005,320
tiger,1006,330
tiger,1007,290
tiger,1008,310
zebra,1009,200
zebra,1010,220
zebra,1011,240
zebra,1012,230
zebra,1013,220
zebra,1014,100
zebra,1015,80
lion,1016,420
lion,1017,600
lion,1018,500
lion,1019,390
kangaroo,1020,410
kangaroo,1021,430
kangaroo,1022,410
Go back to your Jupyter Home tab and create a new text file…
…then copy-paste the above zoo data into this text file…
And there you go! This is the zoo.csv data file, brought to pandas. This nice 2D
table? Well, this is a pandas dataframe. The numbers on the left are the indexes.
And the column names on the top are picked up from the first row of our
zoo.csv file.
To be honest, though, you will probably never create a .csv data file for yourself,
like we just did… you will use pre-existing data files. So you have to learn how to
download .csv files to your server!
If you are here from the Junior Data Scientist’s First Month video course then
you have already dealt with downloading your .txt or .csv data files to your data
server, so you must be pretty proficient in it… But if you are not here from the
course (or if you want to learn another way to download a .csv file to your
server and to get another exciting dataset), follow these steps:
I’ve uploaded a small sample dataset here: DATASET
(Link: 46.101.230.157/dilan/pandas_tutorial_read.csv)
If you click the link, the data file will be downloaded to your computer. But you
don’t want to download this data file to your computer, right? You want to
download it to your server and then load it to your Jupyter Notebook. It only
takes two steps.
STEP 2) Now, go back again to your Jupyter Notebook and use the same
read_csv function that we have used before (but don’t forget to change the file
name and the delimiter value):
pd.read_csv('pandas_tutorial_read.csv', delimiter=';')
Better!
And with that, we finally loaded our .csv data into apandas dataframe!
Note 1: Just so you know, there is an alternative method. (I don’t prefer it though.)
You can load the .csv data using the URL directly. In this case the data won’t be
downloaded to your data server.
read the .csv directly from the server (using its URL)
Note 2: If you are wondering what’s in this data set – this is the data log of a travel
blog. This is a log of one day only (if you are a JDS course participant, you will get
much more of this data set on the last week of the course ;-)). I guess the names of the
columns are fairly self-explanatory.
The most basic method is to print your whole data frame to your screen. Of
course, you don’t have to run the pd.read_csv() function again and again and
again. Just store its output the first time you run it!
After that, you can call this article_read value anytime to print your DataFrame!
Sometimes, it’s handy not to print the whole dataframe and flood your screen
with data. When a few lines is enough, you can print only the first 5 lines – by
typing:
article_read.head()
This one is a bit tricky! Let’s say you want to print the ‘country’ and the ‘user_id’
columns only.
You should use this syntax:
article_read[['country', 'user_id']]
Any guesses why we have to use double bracket frames? It seems a bit over-
complicated, I admit, but maybe this will help you remember: the outer bracket
frames tell pandas that you want to select columns, and the inner brackets are
for the list (remember? Python lists go between bracket frames) of the column
names.
By the way, if you change the order of the column names, the order of the
returned columns will change, too:
article_read[['user_id', 'country']]
This is the DataFrame of your selected columns.
Note: Sometimes (especially in predictive analytics projects), you want to get Series
objects instead of DataFrames. You can get a Series using any of these two syntaxes
(and selecting only one column):
article_read.user_id
article_read['user_id']
If the previous one was a bit tricky, this one will bereally tricky!
Let’s say, you want to see a list of only the users who came from the ‘SEO’
source. In this case you have to filter for the ‘SEO’ value in the ‘source’ column:
article_read[article_read.source == 'SEO']
STEP 1) First, between the bracket frames it evaluates every line: is the
article_read.source column’s value 'SEO' or not? The results are boolean values
( True or False ).
STEP 2) Then from the article_read table, it prints every row where this value is
True and doesn’t print any row where it’s False .
Does it look over-complicated? Maybe. But this is the way it is, so let’s just learn
it because you will use this a lot!
This line first selects the first 5 rows of our data set. And then it takes only the
‘country’ and the ‘user_id’ columns.
Could you get the same result with a different chain of functions? Of course
you can:
article_read[['country', 'user_id']].head()
In this version, you select the columns first, then take the first five rows. The
result is the same – the order of the functions (and the execution) is different.
One more thing. What happens if you replace the ‘article_read’ value with the
original read_csv() function:
This will work, too – only it’s ugly (and inefficient).But it’s really important that
you understand that working with pandas is nothing but applying the right
functions and methods, one by one.
Test yourself!
As always, here’s a short assignment to test yourself! Solve it, so the content of
this article can sink in better!
Select the user_id, the country and the topic columns for the users who are
from country_2! Print the first five rows only!
.
.
.
Or, to be more transparent, you can break this into more lines:
Either way, the logic is the same. First you take your original dataframe
( article_read ), then you filter for the rows where the country value is country_2
( [article_read.country == 'country_2'] ), then you take the three columns that were
required ( [['user_id','topic', 'country']] ) and eventually you take the first five rows
only ( .head() ).
Conclusion
You are done with the first episode of my pandas tutorial series! Great job! In
the next article, you can learn more about the different aggregation methods
(e.g. sum, mean, max, min) and about grouping (so basically about
segmentation). Stay with me: Pandas Tutorial, Episode 2!
If you want to learn more about how to become a data scientist, take my 50-
minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First
Month video course.
Cheers,
Tomi Mester
← P R E V I O U S P O S T
N E X T P O S T →
6 Comments
ygautomo
F E B R U A R Y 1 4 , 2 0 1 9
Hi Tomi Mester,
Thanks.
R E P L Y
Tomi Mester
M A R C H 6 , 2 0 1 9
hey,
well, the logic in pandas is more linear than in SQL (it’s more “inside-out”
in SQL).
Actually, both version will work fine.
Your version might be a little bit more efficient (in terms of calculation
time) though — because of the particular order you use (column selection
first – then filtering for the rows), but we are talking about milliseconds
here.
Cheers,
Tomi
R E P L Y
Andres Cardenas
M A R C H 1 , 2 0 1 9
R E P L Y
Tomi Mester
M A R C H 6 , 2 0 1 9
R E P L Y
Rachit Kulshreshtha
M A R C H 1 7 , 2 0 1 9
Hi Tomi,
Really loved the way you even showed more possible combinations even after
getting the job done. These are the curious bits which really help a new learner
and gives them the confidence to explore more.
R E P L Y
Tomi Mester
M A R C H 1 8 , 2 0 1 9
R E P L Y
Leave a Reply
C O M M E N T
N A M
* E
E M A* I L
W E B S I T E
Post Comment
We use cookies to ensure that we give you the best experience on our website.
Ok Read more