An Introduction To Data Analysis Free Version
An Introduction To Data Analysis Free Version
1 Introduction
There is a lot of hype about data analysts in the media today, and rightfully so. Many companies
are looking to recruit people who can analyse large quantities of data. Over the past years, the data
that these companies have managed to collect has grown very quickly. Hidden deep in these vast
amounts of data are insights that can help managers make better decisions. As such, companies
are turning to data analysts to help then uncover these insights.
Data analysts need two skills. First, they need to write code. This does not mean that they need to
be programmers. A common mistake is thinking that data analysts need to come from a computer
science background. This is not true. Today, there are many libraries and packages that make the
job of the analyst much easier when it comes to writing code. Most courses on data analysis start
by providing the student with a background in programming. This course is different. The student
will learn the coding as we go along. The course will focus on analytical skills. The coding will be
introduced when needed, and even then, we will rely on libraries instead of writing complicated
code.
Another skill that data analysts need is statistics. Unfortunately, it has been my experience that
many data analysts do not have a good background in statistics. To them, data analysis is just
about plotting graphs and training models. The result is that the work done by these analysts
does not make much sense. The only way to understand which tool to use is to have a good grasp
of these tools. This requires an understanding of statistics. Fortunately, you do not need to be
a mathematician to understand statistics, at least not at the level required of a data analyst. Just
like the coding part, this course will introduce the students to the various statistical concepts as
we move along. In other words, as the course progresses, we will be talking about coding and
statistics at the same time while introducing new concepts when they are needed.
The goal of this course is to provide the begining students with the necessary foundation that will
allow them to start analysing data on their own using python. This course assumes no previous
knowledge of python, statistics, or data structures. This course will not make you an expert. No
single course can take you from A to Z, although many courses claim to do that. Instead, if you
are thinking about pursuing a career as a data analyst, then this course is a good place to start.
Concepts will be explained in a simple way by relying on examples. Concepts will be introduced
on a need to know basis. There is no separate chapter about python. There is no separate chapter
about statistics. Instead, the student will learn both at the same time. The further you go in this
course, the more you will learn about python and statistics.
1
2 What You Need to Install
In order to follow this course you will need to be able to execute the code. You can do that using
Anaconda. You have two options. Either install Anaconda on your computer and work locally,
or you can work on the cloud without installing anything. Work on the cloud is the easiest op-
tion. You do not need to install or configure anything. All you have to do is go to the website
https://www.anaconda.com/code-in-the-cloud, create an account, and start coding. That’s all.
If you would rather download what you need and work on your computer, you can also do that
easily. Go to https://www.anaconda.com/products/distribution and download and install Ana-
conda. Once you do that, start Anaconda by searching for “Anaconda” in your computer search
bar. The symbol is a green hollow circle. Click on it and wait a little as your computer launches it.
You should then see a screen similar to the one shown in the figure below.
In this navigator, launch JupyterLab and wait a few seconds. The notebook should open in your
web browser.
3 Libraries
On your local computer, you can install libraries by going to the Environments tab in Anaconda
navigator, as shown in the figure below.
Here you can see a list of installed libraries. As you notice, Anaconda comes with a long list of
libraries that are already installed. In the event that you want to use a library that is not installed,
you can simply choose the Not installed selection from the drop down box and then search for the
library that you wish to install using the search bar on the right. Select that library and then click
on Apply. This will install the library. Once a library is installed you can import it and use it in
your notebook. We will be imorting the pandas library soon since it is the main library that we
will use. To see that thi slibrary is already installed, select the Installed selection in the drop down
box in Anaconda navigator and search for pandas. You should see pandas in the list.
2
In python, whenever you want to use a library, you will have to import it. This is accomplished
using the import command. In the above command, we told python that we want to import the
Pandas library and that we want to call it pd. We could have just typed import pandas but since
whenever we use a library we will have to type its name, it is much easier to use a shorter name.
Once you execute the above code (press control+enter) you will have access to all the useful func-
tions of this library. Let us use one of these useful functions to read a data set.
The read_csv() function is a very useful function that is part of the pandas library. Notice that to
use a function in a library, you will have to type the name of the library (or the short alias that we
chose for it), followed by a ‘.’, and then followed by the function that you want to use. In our case
the function is read_csv(). From the name, we know that this function allows us to read a data set
that is in CSV format. The data set is stored in the web address that we provided to the function
as input (inside the brackets). So basically, we are using the pandas library to download a data set
that is stored as a CSV file.
In python, to save something we simply assign it to a variable. In the above command, the
read_csv() function is going to return a data set. We saved the data set into the variable
first_data_set. We now use this variable to refer to the data set. The data set has been stored
as a panda dataframe that is called first_data_set. We can now use many functions associated
with panda dataframes in order to look at and to analyse the data. One of these functions is the
head() function.
[3]: first_data_set.head()
3
Updated On Latitude Longitude Location
0 01/03/2023 03:46:28 PM 41.990846 -87.666096 (41.990846423, -87.666096144)
1 01/03/2023 03:46:28 PM 41.780331 -87.684892 (41.780330681, -87.684891779)
2 01/03/2023 03:46:28 PM 41.722303 -87.623745 (41.722303228, -87.623745129)
3 01/03/2023 03:46:28 PM 41.805347 -87.725961 (41.805347066, -87.725961264)
4 01/03/2023 03:46:28 PM 41.860250 -87.646715 (41.860249838, -87.64671467)
[5 rows x 22 columns]
The head() function displays the first five rows of the dataframe. This is useful because as of
yet we do not know what the data set contains. Looking at the output above, we see that the
data set has 22 columns. We also see the names of each column. Take a look at the column
named Primary Type. You can see that the data stored in this column describes crimes (ROBBERY,
THEFT, ASSAULT,. . . etc). It seems that this data set has something to do with crimes. We also see
a column called FBI Code. There is also a column named Location. Perhaps this columns provides
the location where the crime took place. Maybe.
A very important thing to notice here is that some of the columns have numerical values while
others have text values. This is important because different types of columns have different types,
and the kind of analysis that we can perform depends on the values of the columns. We can look
at the column names and types using the dtypes attribute:
[4]: first_data_set.dtypes
[4]: ID int64
Case Number object
Date object
Block object
IUCR object
Primary Type object
Description object
Location Description object
Arrest bool
Domestic bool
Beat int64
District int64
Ward float64
Community Area int64
FBI Code object
X Coordinate float64
Y Coordinate float64
Year int64
Updated On object
Latitude float64
Longitude float64
Location object
dtype: object
Notice that we called dtypes an attribute. As you can see from the command, there are no brack-
4
ets after dtypes. Functions must be followed by brackets, while attributes are not. The dtypes
attribute of the dataframe lists the names of all columns and the type of each column. ID for ex-
ample is of type int64. This simply means an integer (a number with no decimal points). The Case
Number column is of type object which means that it is composed of text. The Arrest column is of
type bool which means that it takes on two values, True or False. The X Coordinate column is of
type float64 which is to say a number with a decimal point.
Let us take another look at the data set, but this time instead of displaying the top five rows, let us
display the last five rows so that we can introduce a new function:
[5]: first_data_set.tail()
Location
5
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
[5 rows x 22 columns]
We see that the ID number is an integer, the Case Number is a combination of letters and numbers,
the Arrest column takes on the values of True and False, and the X Coordinate column is numerical
with a decimal point. We can now understand what different data types are stored in each column.
So far we have looked at the first and last five rows. We now know that the dataframe contains
data about crimes. Actually, the data is about crimes in Chicago (https://data.cityofchicago.org),
so let us give the dataframe a more meanigful name.
Here, we used the copy() function in order to create a new copy of the dataframe and save it to the
variable chicago_crimes. From now on we will use this new variable since the name makes more
sense.
[7]: chicago_crimes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238317 entries, 0 to 238316
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 238317 non-null int64
1 Case Number 238317 non-null object
2 Date 238317 non-null object
3 Block 238317 non-null object
4 IUCR 238317 non-null object
5 Primary Type 238317 non-null object
6 Description 238317 non-null object
6
7 Location Description 237540 non-null object
8 Arrest 238317 non-null bool
9 Domestic 238317 non-null bool
10 Beat 238317 non-null int64
11 District 238317 non-null int64
12 Ward 238307 non-null float64
13 Community Area 238317 non-null int64
14 FBI Code 238317 non-null object
15 X Coordinate 233118 non-null float64
16 Y Coordinate 233118 non-null float64
17 Year 238317 non-null int64
18 Updated On 238317 non-null object
19 Latitude 233118 non-null float64
20 Longitude 233118 non-null float64
21 Location 233118 non-null object
dtypes: bool(2), float64(5), int64(5), object(10)
memory usage: 36.8+ MB
Let us look at the outout of the info() function. At the top we see that the RangeIndex is 237604
entries. This is the number of rows in the data set. Below that we see the number of columns
(22). Below that we see a list of the columns. In this list, we see the number of the column (#),
the name of the column (Column), and the type of the column (dtype). We also see something
called Non-Null Count. This is an interesting column because it gives us the number of entries in
each column that have a non-null value. But what is a null-value? A null-value is simply when
the value is empty. It is like having an empty cell in an Excel sheet. It is usually the case that the
data that we are dealing with is not complete. Not every single row will have all the information
for every single column. Sometimes the information is missing, and the value is just empty, it is a
null-value. Therefore, the Non-Null Count is the number of entries in the column that are not null,
i.e. not missing. For the column ID for example, we see that there are 237604 non-null values. We
know that the total number of rows in the dataframe is 237604. This means that for this column
there are no missing values. Now look at the entry for the column X Coordinate. We see that there
are 233156 non-null values. This means that there are 237604 - 233156 = 4448 null-values.
As data analysts, it is very important that we understand how much missing values exist in our
data. The above output tells us the number of non-null values. We can calculate the number of
null-values by subtracting the total number of rows and the number of null-values, as we did for
the column X Coordinate. To make things simpler, we can use pandas functions to calculate the
number of missing values for each column, instead of calculating the number of non-missing. To
do that, we should first understand the isna() method:
[8]: chicago_crimes.isna()
7
... ... ... ... ... ... ... ...
238312 False False False False False False False
238313 False False False False False False False
238314 False False False False False False False
238315 False False False False False False False
238316 False False False False False False False
Longitude Location
0 False False
1 False False
2 False False
3 False False
4 False False
... ... ...
238312 False False
238313 False False
238314 False False
238315 False False
238316 False False
8
Looking at the output above, we see that all cells are now either True or False. This is because the
isna() method displays a False when the cell is not missing (it has a value) and a True if the cell is
null, i.e. empty. But how is this useful? Well, it is not useful by itself, but it becomes useful when
we combine it with another function:
[9]: chicago_crimes.isna().sum()
[9]: ID 0
Case Number 0
Date 0
Block 0
IUCR 0
Primary Type 0
Description 0
Location Description 777
Arrest 0
Domestic 0
Beat 0
District 0
Ward 10
Community Area 0
FBI Code 0
X Coordinate 5199
Y Coordinate 5199
Year 0
Updated On 0
Latitude 5199
Longitude 5199
Location 5199
dtype: int64
The sum() function is used to simply add all values in a column together. True is evaluated as 1
and False is evaluated as 0. So when we add the True and False values, we get the number of True
values, and True represents a null-value (missing value). Looking at the output we see that for the
ID column the sum is 0. This means that all values were False, so we were just adding zeros. This
column has no null values. Now look at the column X Cordinate. The sum is 4448. This means
that there were 4448 True values after using the isna() function. Each one of these True values was
treated as a 1. So when we added them we got the total number of missing values.
As a summary, the isna() function returns the dataframe with True for missing values and False
for non-missing values. The sum() method adds all values in each column together, with True
being 1 and False being 0. Therefore, the sum for each column gives us the total number of null
values in that column. In our case, some columns have no null values, while other columns have
some null values.
So now we know how many rows are in the dataframe. We also know how many null values are
in each of the columns. Let us now start looking at specific rows or specific columns.
9
5.1 Zooming in to a specific row
We know by now that the head() function displays the first five rows while the tail() function
displays the last five rows. What if I wanted to look at a specific row? Let us look at the tenth row
for example.
[10]: chicago_crimes.loc[10]
[10]: ID 12798495
Case Number JF361883
Date 08/18/2022 06:00:00 PM
Block 030XX E 80TH ST
IUCR 2820
Primary Type OTHER OFFENSE
Description TELEPHONE THREAT
Location Description RESIDENCE
Arrest False
Domestic True
Beat 422
District 4
Ward 7.0
Community Area 46
FBI Code 08A
X Coordinate 1197715.0
Y Coordinate 1852504.0
Year 2022
Updated On 01/03/2023 03:46:28 PM
Latitude 41.750117
Longitude -87.551051
Location (41.75011688, -87.551050773)
Name: 10, dtype: object
The loc[] method allows us to look at a specific row. Here we used the function to retireve the
information in row number 10. Notice that panda displays the output vertically because this
makes looking at the values much easier. It is important to note that pandas starts numering rows
with zero. So the first row is assigned the number 0, not 1. This means that when we look at the
row with number 10, we are actually looking at the 11th row. To look at the 10th row we would
use the following:
[11]: chicago_crimes.loc[9]
[11]: ID 12793378
Case Number JF355848
Date 08/13/2022 10:45:00 PM
Block 118XX S SANGAMON ST
IUCR 0560
Primary Type ASSAULT
Description SIMPLE
Location Description STREET
10
Arrest False
Domestic True
Beat 524
District 5
Ward 34.0
Community Area 53
FBI Code 08A
X Coordinate 1172121.0
Y Coordinate 1826361.0
Year 2022
Updated On 01/03/2023 03:46:28 PM
Latitude 41.678977
Longitude -87.645603
Location (41.678976937, -87.645603032)
Name: 9, dtype: object
We can even use this method to look at more than one row. Using the “:” we can display the rows
numbered 10, 11, and 12:
[12]: chicago_crimes.loc[10:12]
Location
10 (41.75011688, -87.551050773)
11 (41.81834934, -87.622623461)
12 (41.922325648, -87.75285662)
[3 rows x 22 columns]
11
Here we used the loc[] method to display the three consecutive rows which are numbered 10, 11,
and 12. The “:” is used to tell the method that we want to display a series of rows starting with
the number to the left of the “:” and ending with the number to the right of the “:”.
[13]: 0 ROBBERY
1 THEFT
2 THEFT
3 ASSAULT
4 THEFT
...
238312 DECEPTIVE PRACTICE
238313 CRIMINAL DAMAGE
238314 BURGLARY
238315 THEFT
238316 THEFT
Name: Primary Type, Length: 238317, dtype: object
Notice that we used double square brackets. This is a very important point because it has to do
with a very important data structure in python, and this structure is arrays.
5.2.1 Arrays
An array is a collection of items. Certain methods require an input. This input can be one value,
or a collection of values. An array is a collection of values. It is how we group items together to
12
use them in a method. When we wanted to display only the column Primary Type, we just typed
“Primary Type” between the brackets. But when we want to display more than one column type,
we will have to send a collection of the names of the columns. This collection is the array. To
collect items inside an array, we use the square brackets []. So to create an array that contains the
names of two columns, we need to create an array that contains both names:
We can then use this as the input to display the columns specified above:
[16]: chicago_crimes[my_array]
If we wanted to do this in one step, i.e. without creating an array, we can do the following:
Let us take another example. If we wanted to display only the column Date, then we can use the
following:
13
[18]: chicago_crimes["Date"]
If we also wanted to display the column Block as well, then we need to combine both columns in
an array and use that array as the input:
14
[21]: Date Description
15 08/18/2022 02:25:00 PM OVER $500
16 08/18/2022 10:30:00 PM SIMPLE
17 08/10/2022 10:55:00 PM ARMED - HANDGUN
18 08/14/2022 01:45:00 PM RETAIL THEFT
19 08/14/2022 10:00:00 PM OVER $500
20 08/13/2022 01:30:00 AM AUTOMOBILE
Now that we know how to zoom into a column, let us count the number of null values in a specific
column. Previously, we saw how to count the number of null values for all columns, like so:
[22]: chicago_crimes.isna().sum()
[22]: ID 0
Case Number 0
Date 0
Block 0
IUCR 0
Primary Type 0
Description 0
Location Description 777
Arrest 0
Domestic 0
Beat 0
District 0
Ward 10
Community Area 0
FBI Code 0
X Coordinate 5199
Y Coordinate 5199
Year 0
Updated On 0
Latitude 5199
Longitude 5199
Location 5199
dtype: int64
[23]: 777
[24]: Date 0
Location Description 777
15
Ward 10
dtype: int64
This brings us to a very important point. When we choose a subset of columns using the
square brackets [], we are still dealing with a dataframe. This means that the functions that
we used on the whole dataframe can also be used on a subset of the dataframe. This is why
the isna() function worked here. The variable chicago_crimes is a pandas dataframe. The
command chicago_crimes[["Date", "Location Description", "Ward"]] also returns a pan-
das dataframe. If you want to double check this, then create a new variable that we will use
to store the subset of the data:
[26]: chicago_crimes.drop(3)
16
2 0810 THEFT OVER $500
4 0820 THEFT $500 AND UNDER
5 0810 THEFT OVER $500
... ... ... ...
238312 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238313 1320 CRIMINAL DAMAGE TO VEHICLE
238314 0620 BURGLARY UNLAWFUL ENTRY
238315 0810 THEFT OVER $500
238316 0810 THEFT OVER $500
17
238316 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
Notice that we no longer have a row that is numbered 3. Let us now use the head() function to
display the first five rows:
[27]: chicago_crimes.head()
18
4 01/03/2023 03:46:28 PM 41.860250 -87.646715 (41.860249838, -87.64671467)
[5 rows x 22 columns]
Row number 3 is still there. What happened? What happened is that we did not “save”
the drop action. We did not overwrite the chicago_crimes variable. When we used the
chicago_crimes.drop(3) command, the result was the dataframe without row number 3, but
the variable chicago_crimes is still the same. If we wanted to save our work, then we have two
options. The first option is the following:
[5 rows x 22 columns]
19
In the above command, we are telling drop() to remove row number 3, and we are also telling it
to set the parameter inplace to True, which means that we want to save the returned dataframe in
place of the old one. Let is now look at the dataframe:
[30]: chicago_crimes.head()
[5 rows x 22 columns]
We now see that the row which is numbered 3 is no longer part of the dataframe.
We can also use the drop() method to delete more than one row:
20
238312 12936285 JF526139 06/27/2022 10:05:00 AM
238313 12936301 JF526810 12/22/2022 06:00:00 PM
238314 12936397 JF526745 12/19/2022 02:00:00 PM
238315 12935341 JF525383 12/20/2022 06:45:00 AM
238316 12938501 JF523997 12/26/2022 10:30:00 PM
Description \
0 VEHICULAR HIJACKING
1 OVER $500
2 OVER $500
6 SIMPLE
7 OVER $500
... ...
238312 FINANCIAL IDENTITY THEFT OVER $ 300
238313 TO VEHICLE
238314 UNLAWFUL ENTRY
238315 OVER $500
238316 OVER $500
21
1 66 06 1161110.0 1863210.0 2022
2 49 06 1177962.0 1842197.0 2022
6 8 08B 1174910.0 1908582.0 2022
7 42 06 1187557.0 1864569.0 2022
... ... ... ... ... ...
238312 7 11 1170513.0 1917030.0 2022
238313 5 14 1161968.0 1923233.0 2022
238314 4 05 1158237.0 1929586.0 2022
238315 29 06 1158071.0 1894595.0 2022
238316 2 06 1160681.0 1942466.0 2022
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
6 (41.904538325, -87.632942313)
7 (41.783471704, -87.587890396)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
Notice that we now use the square brackets []. As you recall, when we want to pass more than
one value, then we need to combine them into an array. We can see from the output that the
rows numbered 4 and 5 are no longer there (in addition to the row numbered 3 which we had
previously deleted). Remember, the data frame chicago_crimes still contains these rows since we
had not saved the drop action by using the inplace parameter.
We can also delete columns using the drop() function:
[32]: chicago_crimes.drop(columns="ID")
22
[32]: Case Number Date Block IUCR \
0 JF350580 08/09/2022 04:07:00 PM 014XX W ELMDALE AVE 0325
1 JF352712 08/10/2022 04:00:00 PM 062XX S ARTESIAN AVE 0810
2 JF352659 08/11/2022 10:00:00 AM 094XX S STATE ST 0810
4 JF359058 08/16/2022 04:10:00 PM 015XX S HALSTED ST 0820
5 JF359932 08/15/2022 03:00:00 PM 075XX S PHILLIPS AVE 0810
... ... ... ... ...
238312 JF526139 06/27/2022 10:05:00 AM 025XX N HALSTED ST 1153
238313 JF526810 12/22/2022 06:00:00 PM 020XX W CORNELIA AVE 1320
238314 JF526745 12/19/2022 02:00:00 PM 044XX N ROCKWELL ST 0620
238315 JF525383 12/20/2022 06:45:00 AM 027XX W ROOSEVELT RD 0810
238316 JF523997 12/26/2022 10:30:00 PM 021XX W DEVON AVE 0810
23
238313 32.0 5 14 1161968.0 1923233.0 2022
238314 47.0 4 05 1158237.0 1929586.0 2022
238315 28.0 29 06 1158071.0 1894595.0 2022
238316 50.0 2 06 1160681.0 1942466.0 2022
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
Notice that we instructued the function to drop columns and not rows but using the columns
parameter. We can also drop more than one column. As you should know by now, to do that we
need to pass the names as an array:
24
238314 12936397 JF526745 0620 BURGLARY
238315 12935341 JF525383 0810 THEFT
238316 12938501 JF523997 0810 THEFT
Description \
0 VEHICULAR HIJACKING
1 OVER $500
2 OVER $500
4 $500 AND UNDER
5 OVER $500
... ...
238312 FINANCIAL IDENTITY THEFT OVER $ 300
238313 TO VEHICLE
238314 UNLAWFUL ENTRY
238315 OVER $500
238316 OVER $500
25
4 2022 01/03/2023 03:46:28 PM 41.860250 -87.646715
5 2022 01/03/2023 03:46:28 PM 41.758253 -87.565147
... ... ... ... ...
238312 2022 01/03/2023 03:46:28 PM 41.927817 -87.648846
238313 2022 01/03/2023 03:46:28 PM 41.945022 -87.680072
238314 2022 01/03/2023 03:46:28 PM 41.962532 -87.693611
238315 2022 01/03/2023 03:46:28 PM 41.866517 -87.695179
238316 2022 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
The above commands have not been saved. Remember, we were just viewing the result after the
drop, but we have not been saving these actions. To make sure, display the first five rows of the
dataframe:
[34]: chicago_crimes.head()
26
4 ... 11.0 28 06 1171290.0 1892413.0 2022
5 ... 7.0 43 06 1193842.0 1855434.0 2022
[5 rows x 22 columns]
The columns ID, Date, and Block are still there. If we wanted to drop these columns and to
overwrite the dataframe, then we need to use the inplace parameter:
Longitude Location
0 -87.666096 (41.990846423, -87.666096144)
1 -87.684892 (41.780330681, -87.684891779)
2 -87.623745 (41.722303228, -87.623745129)
4 -87.646715 (41.860249838, -87.64671467)
5 -87.565147 (41.758252785, -87.565147001)
At this point I would like to reload the original data set. We were only deleting rows and columns
just as an illustration. How do we do that? We can simply just re-download the data set using the
27
read_csv() method that is part of the pandas package:
chicago_crimes.head()
[5 rows x 22 columns]
28
A more useful thing to do would be to look at rows that satisfy a certain condition. This is perhaps
one of the most common tasks that a data analyst does.
As an example, let us consider the type of of crime. We know that in our data set there is a column
called Primary Type:
[37]: 0 ROBBERY
1 THEFT
2 THEFT
3 ASSAULT
4 THEFT
...
238312 DECEPTIVE PRACTICE
238313 CRIMINAL DAMAGE
238314 BURGLARY
238315 THEFT
238316 THEFT
Name: Primary Type, Length: 238317, dtype: object
We see that the types of crimes vary. What if we just wanted to look at the records that were of
type THEFT? This is possible using the following command:
29
238315 027XX W ROOSEVELT RD 0810 THEFT OVER $500
238316 021XX W DEVON AVE 0810 THEFT OVER $500
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
30
7 (41.783471704, -87.587890396)
... ...
238307 (41.884476226, -87.631310613)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
There are several things to explain about the above command. First the double “=” sign. In
programming languages, we use a double “=” sign to check if two things are equal. So to check if
2 is equal to 2 we type:
[39]: 2==2
[39]: True
[40]: 0 False
1 True
2 True
3 False
4 True
...
238312 False
238313 False
238314 False
238315 True
238316 True
Name: Primary Type, Length: 238317, dtype: bool
Notice the output. We have a True or a False for each row. True is for rows that have a value of
“THEFT” in the column Primary Type, and False is for rows that have some other values. To see
this, let us look at the first five rows:
[41]: chicago_crimes.head()
31
Primary Type Description Location Description Arrest Domestic \
0 ROBBERY VEHICULAR HIJACKING STREET True False
1 THEFT OVER $500 STREET False False
2 THEFT OVER $500 STREET False True
3 ASSAULT SIMPLE RESIDENCE False False
4 THEFT $500 AND UNDER SIDEWALK False False
[5 rows x 22 columns]
See the primary type values? We have ROBBERY, THEFT, THEFT, ASSULT, THEFT, which results
in FALSE, TRUE, TRUE, FALSE, TRUE.
We then use this to pick the subset of our dataframe. When the value is True, then the correspond-
ing row will be part of the subset. When the value is False then the row will not be part of the
subset:
32
4 015XX S HALSTED ST 0820 THEFT $500 AND UNDER
5 075XX S PHILLIPS AVE 0810 THEFT OVER $500
7 061XX S PARK SHORE EAST CT 0810 THEFT OVER $500
... ... ... ... ...
238307 001XX W RANDOLPH ST 0860 THEFT RETAIL THEFT
238310 042XX N MARINE DR 0810 THEFT OVER $500
238311 064XX W Irving Park Rd 0860 THEFT RETAIL THEFT
238315 027XX W ROOSEVELT RD 0810 THEFT OVER $500
238316 021XX W DEVON AVE 0810 THEFT OVER $500
33
238316 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
7 (41.783471704, -87.587890396)
... ...
238307 (41.884476226, -87.631310613)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
We can save this result in a new dataframe that contains only theft crimes. Remember, a subset of
a dataframe is a dataframe:
34
5 01/03/2023 03:46:28 PM 41.758253 -87.565147 (41.758252785, -87.565147001)
7 01/03/2023 03:46:28 PM 41.783472 -87.587890 (41.783471704, -87.587890396)
[5 rows x 22 columns]
Let us look at another example. What if I wanted a dataframe that contained only those crimes
that resulted in an Arrest. To do that, we might type the following:
[44]: chicago_crimes[chicago_crimes["Arrest"]=="True"]
[0 rows x 22 columns]
This seems strange. Is it possible that the data set does not contain any crime that resulted in
an arrest? Actually, we have done a simple but serious mistake. In the condition above, we are
checking to see if the value of Arrest is equal to the text “True”, but if you remember well, this
column is of type bool (which means it is either traue or false). Let us see this:
[45]: chicago_crimes.dtypes
[45]: ID int64
Case Number object
Date object
Block object
IUCR object
Primary Type object
Description object
Location Description object
Arrest bool
Domestic bool
Beat int64
District int64
Ward float64
Community Area int64
FBI Code object
X Coordinate float64
Y Coordinate float64
Year int64
Updated On object
Latitude float64
Longitude float64
Location object
35
dtype: object
Since we are just interested in the Arrest column, we could have typed the following:
[46]: chicago_crimes["Arrest"].dtypes
[46]: dtype('bool')
In either case, notice that the type is bool. This is not an object. This is why the condition should
have been written the following way:
[47]: chicago_crimes[chicago_crimes["Arrest"]==True]
36
237967 APPLIANCE STORE True False ... 12.0 61
237999 DRUG STORE True False ... 2.0 8
238187 BAR OR TAVERN True False ... 27.0 28
Notice the difference in the commands? The difference is that the word True is not in quotation
marks. When the column is of type object, then we use quotation marks. When the column is of
type bool then we do not use the quotation marks.
Let us try another example:
37
238305 12935459 JF525819 12/22/2022 12:01:00 AM
238310 12935276 JF524804 12/21/2022 11:05:00 PM
238313 12936301 JF526810 12/22/2022 06:00:00 PM
238315 12935341 JF525383 12/20/2022 06:45:00 AM
38
2 01/03/2023 03:46:28 PM 41.722303 -87.623745
7 01/03/2023 03:46:28 PM 41.783472 -87.587890
8 01/03/2023 03:46:28 PM 41.811420 -87.600731
... ... ... ...
238295 01/03/2023 03:46:28 PM 41.993116 -87.697050
238305 01/03/2023 03:46:28 PM 41.961746 -87.692365
238310 01/03/2023 03:46:28 PM 41.958736 -87.646526
238313 01/03/2023 03:46:28 PM 41.945022 -87.680072
238315 01/03/2023 03:46:28 PM 41.866517 -87.695179
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
7 (41.783471704, -87.587890396)
8 (41.811419728, -87.600731459)
... ...
238295 (41.993115987, -87.697049774)
238305 (41.961745665, -87.692365237)
238310 (41.958736385, -87.646526104)
238313 (41.945021752, -87.680071764)
238315 (41.866517317, -87.695178701)
Here, we are getting the rows where the value of Location Description is STREET. Notice that we
used quotation marks. This is because this column contains text, i.e. it is of type object. What if
we wanted to get the records where the value of Domestic was False? Is the Domestic column of
type bool or object? Let’s check:
[49]: chicago_crimes["Domestic"].dtype
[49]: dtype('bool')
[50]: chicago_crimes[chicago_crimes["Domestic"]==False]
39
238315 12935341 JF525383 12/20/2022 06:45:00 AM 027XX W ROOSEVELT RD
238316 12938501 JF523997 12/26/2022 10:30:00 PM 021XX W DEVON AVE
40
5 01/03/2023 03:46:28 PM 41.758253 -87.565147
... ... ... ...
238312 01/03/2023 03:46:28 PM 41.927817 -87.648846
238313 01/03/2023 03:46:28 PM 41.945022 -87.680072
238314 01/03/2023 03:46:28 PM 41.962532 -87.693611
238315 01/03/2023 03:46:28 PM 41.866517 -87.695179
238316 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
As a final example, let us look at crimes that took place in Community Area 29:
41
237976 013XX S INDEPENDENCE BLVD 1310 CRIMINAL DAMAGE
238110 014XX S SPAULDING AVE 1153 DECEPTIVE PRACTICE
238188 012XX S AVERS AVE 1153 DECEPTIVE PRACTICE
238315 027XX W ROOSEVELT RD 0810 THEFT
Location
22 (41.864525633, -87.709784193)
35 (41.851393305, -87.723903515)
42
37 (41.866358918, -87.709806763)
86 (41.851670934, -87.714240208)
144 (41.863777237, -87.7263227)
... ...
237936 (41.868005372, -87.724105561)
237976 (41.864070063, -87.719458499)
238110 (41.861840145, -87.707974761)
238188 (41.865241697, -87.721484832)
238315 (41.866517317, -87.695178701)
Why did we not put the number 29 in quotation marks? You guessed it. The column is not of type
object. What type is it?
[51]: dtype('int64')
[53]: True
Let us now use this syntax to get a subset of the crimes data where the crime was of type THEFT
and an arrest was made:
43
237674 12928904 JF517643 12/20/2022 04:30:00 PM 0000X E GRAND AVE
237846 12933440 JF523488 12/26/2022 08:37:00 PM 045XX S WESTERN AVE
237868 12931059 JF520508 12/23/2022 03:40:00 AM 020XX N MILWAUKEE AVE
237967 12925155 JF513324 12/16/2022 08:12:00 PM 045XX S WESTERN BLVD
Location
140 (41.853890639, -87.683855655)
461 (41.93859561, -87.649576033)
44
654 (41.946212132, -87.646579678)
849 (41.829929212, -87.724198877)
857 (41.911970795, -87.76309048)
... ...
237663 (41.706192826, -87.700593626)
237674 (41.891692496, -87.626214622)
237846 (41.810768138, -87.684507864)
237868 (41.917656022, -87.688750258)
237967 (41.810693516, -87.683929054)
An important point to note here is that the two conditions must be placed in separate parantheses.
[55]: True
Notice that the expression evaluated to True even though 2 is not les sthan 0. This is because we
want the first or second condition to be true. When one of the conditions is true, then the result
will be true.
Going to our dataframe, let us get the crimes that resulted in an arrest or that were domestic:
[56]: chicago_crimes[(chicago_crimes["Arrest"]==True) |␣
,→(chicago_crimes["Domestic"]==True)]
45
13 051A ASSAULT
... ... ...
238262 0560 ASSAULT
238263 0497 BATTERY
238288 1310 CRIMINAL DAMAGE
238308 0498 BATTERY
238309 2820 OTHER OFFENSE
Description \
0 VEHICULAR HIJACKING
2 OVER $500
9 SIMPLE
10 TELEPHONE THREAT
13 AGGRAVATED - HANDGUN
... ...
238262 SIMPLE
238263 AGGRAVATED DOMESTIC BATTERY - OTHER DANGEROUS ...
238288 TO PROPERTY
238308 AGG. DOMESTIC BATTERY - HANDS, FISTS, FEET, SE...
238309 TELEPHONE THREAT
46
Latitude Longitude Location
0 41.990846 -87.666096 (41.990846423, -87.666096144)
2 41.722303 -87.623745 (41.722303228, -87.623745129)
9 41.678977 -87.645603 (41.678976937, -87.645603032)
10 41.750117 -87.551051 (41.75011688, -87.551050773)
13 41.760795 -87.652790 (41.76079497, -87.65279005)
... ... ... ...
238262 41.757342 -87.635685 (41.757341587, -87.63568495)
238263 41.863461 -87.652316 (41.863461031, -87.652315509)
238288 41.696216 -87.624175 (41.696215599, -87.624174764)
238308 41.879779 -87.724254 (41.879779001, -87.724254458)
238309 41.899957 -87.722557 (41.899956643, -87.722556723)
Looking at the result, we see that in each of these records, either Arrest is True or Domestic is True.
Let us now get the crimes that were commited in Community Area 23 or that were of type BAT-
TERY:
Examining these records, we see that all of them are either of type BATTERY, or they have been
commited in Community Area 23.
47
2 0810 THEFT
3 0560 ASSAULT
4 0820 THEFT
5 0810 THEFT
... ... ...
238300 1365 CRIMINAL TRESPASS
238301 0610 BURGLARY
238306 0890 THEFT
238308 0498 BATTERY
238315 0810 THEFT
Description \
1 OVER $500
2 OVER $500
3 SIMPLE
4 $500 AND UNDER
5 OVER $500
... ...
238300 TO RESIDENCE
238301 FORCIBLE ENTRY
238306 FROM BUILDING
238308 AGG. DOMESTIC BATTERY - HANDS, FISTS, FEET, SE...
238315 OVER $500
48
238308 26 04B 1150117.0 1899370.0 2022
238315 29 06 1158071.0 1894595.0 2022
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238300 (41.737573664, -87.666064301)
238301 (41.755923803, -87.65743376)
238306 (41.750321833, -87.666210811)
238308 (41.879779001, -87.724254458)
238315 (41.866517317, -87.695178701)
The phrase “at least” means greater than or equal, which is written using the operator “>=”. The
condition “less than” is simply written using the operator “<”. If we wanted to say “less than
or equal”, we would use the operator “<=”. As an example, get the crimes where the value of X
Coordinate is less than or equal to 1177962:
49
238315 12935341 JF525383 12/20/2022 06:45:00 AM 027XX W ROOSEVELT RD
238316 12938501 JF523997 12/26/2022 10:30:00 PM 021XX W DEVON AVE
50
4 01/03/2023 03:46:28 PM 41.860250 -87.646715
... ... ... ...
238312 01/03/2023 03:46:28 PM 41.927817 -87.648846
238313 01/03/2023 03:46:28 PM 41.945022 -87.680072
238314 01/03/2023 03:46:28 PM 41.962532 -87.693611
238315 01/03/2023 03:46:28 PM 41.866517 -87.695179
238316 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
51
1 0810 THEFT OVER $500
2 0810 THEFT OVER $500
3 0560 ASSAULT SIMPLE
4 0820 THEFT $500 AND UNDER
... ... ... ...
238312 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238313 1320 CRIMINAL DAMAGE TO VEHICLE
238314 0620 BURGLARY UNLAWFUL ENTRY
238315 0810 THEFT OVER $500
238316 0810 THEFT OVER $500
52
238315 01/03/2023 03:46:28 PM 41.866517 -87.695179
238316 03/22/2023 04:47:43 PM 41.997825 -87.684267
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
... ...
238312 (41.927817456, -87.648845932)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
53
238310 0810 THEFT OVER $500
238311 0860 THEFT RETAIL THEFT
238315 0810 THEFT OVER $500
238316 0810 THEFT OVER $500
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
54
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
6 (41.904538325, -87.632942313)
... ...
238308 (41.879779001, -87.724254458)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
What about getting the crimes that are either THEFT, or BATTERY, or ASSUALT? Again, we can
just use the same logic as above:
,→Type"]=="ASSAULT")]
55
3 RESIDENCE False False ... 14.0
4 SIDEWALK False False ... 11.0
5 RESIDENCE False False ... 7.0
... ... ... ... ... ...
238308 APARTMENT False True ... 28.0
238310 STREET False False ... 46.0
238311 SMALL RETAIL STORE False False ... 38.0
238315 STREET False False ... 28.0
238316 PARKING LOT / GARAGE (NON RESIDENTIAL) False False ... 50.0
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238308 (41.879779001, -87.724254458)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
56
238316 (41.997824802, -87.684266677)
We can put as many conditions as we want. However, this is getting tedious. Luckily, the pan-
das library has a very useful function for this. This function is the isin() function. As the name
suggests, it checks whether a value is in a certain set. Let us take a look at it:
57
238316 PARKING LOT / GARAGE (NON RESIDENTIAL) False False ... 50.0
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238308 (41.879779001, -87.724254458)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
Here, we ae just telling pandas to return the records where the value of the column Primary Type
is in the given array. Notice that we use the square brackets. This is because, as you remember,
we are passing more than one value to the function. We could have performed the above in two
steps:
58
[63]: crime_types = ["THEFT", "BATTERY", "ASSAULT"]
chicago_crimes[chicago_crimes["Primary Type"].isin(crime_types)]
59
5 43 06 1193842.0 1855434.0 2022
... ... ... ... ... ...
238308 26 04B 1150117.0 1899370.0 2022
238310 3 06 1171054.0 1928302.0 2022
238311 17 06 1132626.0 1925898.0 2022
238315 29 06 1158071.0 1894595.0 2022
238316 2 06 1160681.0 1942466.0 2022
Location
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
5 (41.758252785, -87.565147001)
... ...
238308 (41.879779001, -87.724254458)
238310 (41.958736385, -87.646526104)
238311 (41.952897791, -87.787860507)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
We can combine the isin() function with other operators. For example, let us get all crimes of types
THEFT, BATTERY, and ASSAULT that have been commited in Community Area 29:
60
... ... ... ...
237679 12934815 JF525187 12/15/2022 12:00:00 AM
237734 12925997 JF514296 12/17/2022 06:28:00 PM
237877 12923943 JF511793 12/15/2022 03:58:00 PM
237904 12930839 JF520302 12/22/2022 06:59:00 PM
238315 12935341 JF525383 12/20/2022 06:45:00 AM
61
22 01/03/2023 03:46:28 PM 41.864526 -87.709784
35 01/03/2023 03:46:28 PM 41.851393 -87.723904
37 01/03/2023 03:46:28 PM 41.866359 -87.709807
86 01/03/2023 03:46:28 PM 41.851671 -87.714240
144 01/03/2023 03:46:28 PM 41.863777 -87.726323
... ... ... ...
237679 01/03/2023 03:46:28 PM 41.861081 -87.693790
237734 01/03/2023 03:46:28 PM 41.854460 -87.715388
237877 01/03/2023 03:46:28 PM 41.867038 -87.710564
237904 01/03/2023 03:46:28 PM 41.867663 -87.715459
238315 01/03/2023 03:46:28 PM 41.866517 -87.695179
Location
22 (41.864525633, -87.709784193)
35 (41.851393305, -87.723903515)
37 (41.866358918, -87.709806763)
86 (41.851670934, -87.714240208)
144 (41.863777237, -87.7263227)
... ...
237679 (41.861081457, -87.693790049)
237734 (41.854459796, -87.715387915)
237877 (41.867038112, -87.710563803)
237904 (41.8676628, -87.715459036)
238315 (41.866517317, -87.695178701)
62
IUCR Primary Type Description \
749 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
1258 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
1305 1154 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT $300 AND UNDER
1316 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
1373 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
... ... ... ...
238182 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238196 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238204 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238259 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
238312 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300
63
238196 41.882214 -87.654375 (41.882213693, -87.654374823)
238204 41.814734 -87.622509 (41.814734427, -87.622509357)
238259 41.995446 -87.812465 (41.995445746, -87.812464832)
238312 41.927817 -87.648846 (41.927817456, -87.648845932)
Notice that the value for Location Description for all returned rows is null. This is because the
function isna() returns a True when the value is 0, and returns a False otherwise. True values
satisfy the condition and therefore the rows are shown.
What if we wanted to return the rows where the value was not missing? Once again, pandas has
what we need:
64
... ... ... ... ... ...
238311 SMALL RETAIL STORE False False ... 38.0
238313 STREET False False ... 32.0
238314 APARTMENT False False ... 47.0
238315 STREET False False ... 28.0
238316 PARKING LOT / GARAGE (NON RESIDENTIAL) False False ... 50.0
Location
0 (41.990846423, -87.666096144)
1 (41.780330681, -87.684891779)
2 (41.722303228, -87.623745129)
3 (41.805347066, -87.725961264)
4 (41.860249838, -87.64671467)
... ...
238311 (41.952897791, -87.787860507)
238313 (41.945021752, -87.680071764)
238314 (41.962531969, -87.693611152)
238315 (41.866517317, -87.695178701)
238316 (41.997824802, -87.684266677)
65