Pandas for Data Science Students
Pandas for Data Science Students
October 9, 2022
import sys
print("=== Python version ===\n{}".format(sys.version))
The main object that pandas implements is the DataFrame, which is essentially a 2-D table. It’s
an ideal target for holding the tibbles of Topic+Notebook 7, and its design derives in part from
data frame objects in the R language.
In addition to DataFrame, another important component of pandas is the Series, which is es-
sentially one column of a DataFrame object (and, therefore, corresponds to variables and responses
in a tibble).
2 Series objects
A pandas Series object is a column-oriented object that we will use to store a variable of a tibble.
1
In [3]: obj = Series([-1, 2, -3, 4, -5])
print(f"`obj` has type `{type(obj)}`:\n\n{obj}")
0 -1
1 2
2 -3
3 4
4 -5
dtype: int64
Observe the common base type (dtype: int64) and index (element numbers).
Regarding the base type, a Series differs from a Python list in that the types of its elements
are assumed to be the same. Doing so allows many operations on a Series to be faster than their
counterparts for list objects, as in this search example.
print(f"""
Creating random `list` and `Series` objects:
- Length: {n_ints} elements
- Range: [{-max_value}, {max_value}]
""")
a_list = [randint(-max_value, max_value) for _ in range(n_ints)]
a_series = Series(a_list)
2
==> (`list` time) divided by `Series` time is roughly 9.3x
If you create a Series with "mixed types," the dtype will become the most generic Python type,
object. (A deeper understanding of what this fact means requires some knowledge of object-
oriented programming, but that won’t be necessary for our course.)
Out[5]: 0 -1
1 2
2 -3
3 4
4 -5
dtype: object
In [6]: print(obj.dtype)
print(obj2.dtype)
int64
object
Regarding the index, it provides a convenient way to reference individual elements of the
Series.
By default, a Series has an index that is akin to range() in standard Python, and effectively
numbers the entries from 0 to n-1, where n is the length of the Series. A Series object also
becomes list-like in how you reference its elements.
==> obj[2]:
-3
==> obj[3]:
4
==> obj[1:3]:
3
1 2
2 -3
3 4
dtype: int64
You can also use more complex index objects, like lists of integers and conditional masks.
In [9]: I = [0, 2, 3]
obj[I] # Also: obj[[0, 2, 3]]
Out[9]: 0 -1
2 -3
3 4
dtype: int64
In [11]: print(obj[I_pos])
1 2
3 4
dtype: int64
However, the index can be a more general structure, which effectively turns a Series object
into something that is "dictionary-like."
Out[12]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64
4
In [13]: print("* obj3['bob']: {}\n".format(obj3['bob']))
print("* obj3['carol']: {}\n".format(obj3['carol']))
* obj3['bob']: -2
* obj3['carol']: 3
In [14]: peeps = {'alice': 1, 'carol': 3, 'esther': 5, 'bob': -2, 'dave': -4, 'frank': -6}
obj4 = Series(peeps)
print(obj4)
alice 1
carol 3
esther 5
bob -2
dave -4
frank -6
dtype: int64
5
alice False
bob True
carol False
dave True
esther False
frank True
dtype: bool
In [18]: print(obj3[I_neg])
bob -2
dave -4
frank -6
dtype: int64
Because of the dictionary-like naming of Series elements, you can use the Python in operator
in the same way you would a dictionary.
Note: In the timing experiment comparing list search and Series search, you may
have noticed that the benchmark does not use in, but rather, Series.isin. Why is
that?
True
True
alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64
alice 6
bob 3
carol 8
6
dave 1
esther 10
frank -1
dtype: int64
alice True
bob True
carol True
dave True
esther True
frank False
dtype: bool
alice 3.5
bob 8.0
carol 0.5
dave 11.0
esther -2.5
frank 14.0
dtype: float64
A Series object also supports vector-style operations with automatic alignment based on in-
dex values.
In [21]: print(obj3)
alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64
Out[22]: alice 1
carol 3
esther 5
dtype: int64
7
dave NaN
esther 10.0
frank NaN
dtype: float64
Observe what happened with undefined elements. If you are familiar with relational
databases, this behavior is akin to an outer-join.
Another useful transformation is the .apply(fun) method. It returns a copy of the Series
where the function fun has been applied to each element. For example:
Out[24]: 5
Out[25]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64
In [26]: obj3.apply(abs)
Out[26]: alice 1
bob 2
carol 3
dave 4
esther 5
frank 6
dtype: int64
Out[27]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64
In [28]: print(obj3.name)
None
8
In [29]: obj3.name = 'peep'
obj3
Out[29]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
Name: peep, dtype: int64
3 DataFrame objects
A pandas DataFrame object is a table whose columns are Series objects, all keyed on the same
index. It’s the perfect container for what we have been referring to as a tibble.
In [30]: cafes = DataFrame({'name': ['east pole', 'chrome yellow', 'brash', 'taproom', '3heart'
'zip': [30324, 30312, 30318, 30317, 30306, 30308, 30303, 30318],
'poc': ['jared', 'kelly', 'matt', 'jonathan', 'nhan', 'dale', 'kitt
print("type:", type(cafes))
print(cafes)
9
The DataFrame has named columns, which are stored as an Index (more later!):
In [32]: cafes.columns
Out[33]: pandas.core.series.Series
As you might expect, these Series objects should all have the same index.
In [34]: cafes.index
Out[35]: array([ True, True, True, True, True, True, True, True])
Out[36]: array([ True, True, True, True, True, True, True, True])
In [38]: cafes[1::2]
10
The index above is, by default, an integer range.
In [39]: cafes.index
Like Series, DataFrame has special attributes loc and iloc for label-based and integer-based
indexing, respectively. Since loc operator indexes exclusively with labels, there is also an iloc oper-
ator that indexes exclusively with integers to work consistently whether or not the index contains
integers. While you can select data by label this way, the preferred way to select index values is
with the special loc operator. The reason to prefer loc is because of the different treatment of in-
tegers when indexing with []. Regular []-based indexing will treat integers as labels if the index
contains integers, so the behavior differs depending on the data type of the index.
df.loc[rows] -- Select single row or subset of rows from the DataFrame by label
df.loc[:, cols] -- Select single column or subset of columns by label
df.loc[rows, cols] -- Select both row(s) and column(s) by label
df.iloc[rows] -- Select single row or subset of rows from the DataFrame by integer position
df.iloc[:, cols] -- Select single column or subset of columns by integer position
df.iloc[rows, cols] -- Select both row(s) and column(s) by integer position
You can access subsets of rows using the .loc field and index values:
Alternatively, you can use integer offsets via the .iloc field, which is 0-based.
11
Adding columns is easy. Suppose every cafe has a 4-star rating on Yelp! and a two-dollar-sign
cost:
Because the columns are Series objects, there is an implicit matching that is happening on the
indexes. In the preceding example, it works because all the Series objects involved have identical
indexes.
However, the following will not work as intended because referencing rows yields copies.
12
For instance, suppose there is a price hike of one more '$' for being in the 30306 and 30308 zip
codes. (If you are in Atlanta, you may know that these are the zip codes that place you close to, or
in, Ponce City Market and the Eastside Beltline Trail!) Let’s increase the price there, on a copy of
the dataframe, cafes3.
In [48]: cafes3[is_fancy]
$$$
13
<ipython-input-50-245dbd93431d>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
What does that error message mean? Let’s see if anything changed.
In [51]: cafes3
Nope! When you slice horizontally, you get copies of the original data, not references to subsets
of the original data. Therefore, we’ll need different strategy.
Observe that the error message suggests a way!
A different approach. For pedagogical purposes, let’s see if we can go about solving this
problem in other ways to see what might or might not work.
14
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$ 2.0
spiller park pcm dale 30308 4.0 $$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0
Based on the earlier discussion, a well-educated first attempt might be to construct a Series
with a named index, where the index values for fancy neighborhoods have an additional '$', and
then use string concatentation.
Out[55]: 3heart $
spiller park pcm $
dtype: object
15
Out[58]: poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$$ 2.0
spiller park pcm dale 30308 4.0 $$$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0
In [59]: cafes4.apply(lambda x: repr(type(x))) # What does this do? What does the output tell y
In [60]: cafes4.apply(lambda x: repr(type(x)), axis=1) # What does this do? What does the outpu
In [61]: cafes4.apply(lambda x: print('==> ' + x.name + '\n' + repr(x)) if x.name == 'east pole
16
Exercise. Use DataFrame.apply() to update the 'value' column in cafes4, which is out of
date given the update of the prices.
In [62]: cafes4 # Verify visually that `'value'` is out of date
Out[62]: poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$$ 2.0
spiller park pcm dale 30308 4.0 $$$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0
In [63]: def calc_value(row):
return row['rating'] / len(row['price'])
display(cafes_cheap)
display(cafes_pricey)
poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0
17
poc zip rating price value
3heart nhan 30306 4.0 $$$ 1.333333
spiller park pcm dale 30308 4.0 $$$ 1.333333
In [67]: cafes4.index
In [69]: cafes4.index.union(['chattahoochee'])
If you need to change the index of a DataFrame, here is one way to do it.
18
In [71]: cafes5 = cafes4.reindex(Index(['3heart', 'east pole', 'brash', 'starbucks']))
display(cafes4)
display(cafes5)
Observe that this reindexing operation matches the supplied index values against the existing
ones. (What happens to index values you leave out? What happens with new index values?)
Another useful operation is dropping the index (and replacing it with the default, integers).
Fin! That’s the end of these notes. With this information as background, you should be able to
complete Notebook 7.
19
part1
October 9, 2022
1.4.0
2 Setup
The following code cell imports a local Python module (stored in cse6040utils.py) and uses one
of its utility functions to open the sample dataset.
Note: Due to the size of the data file, we are not making it available for download. You
will need to run this notebook on Vocareum.
'./resource/asnlib/publicdata/P00000001-ALL.csv' is ready!
<ipython-input-3-91c3b0ffd817>:2: DtypeWarning: Columns (6) have mixed types. Specify dtype opt
fecdata = pd.read_csv(local_data['P00000001-ALL.csv'])
1
Out[3]: cmte_id cand_id cand_nm contbr_nm \
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD
In [4]: fecdata.head()
2
3 RETIRED 250.0 01-AUG-11 NaN
4 RETIRED 300.0 20-JUN-11 NaN
In [5]: fecdata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 1001731 non-null object
1 cand_id 1001731 non-null object
2 cand_nm 1001731 non-null object
3 contbr_nm 1001731 non-null object
4 contbr_city 1001712 non-null object
5 contbr_st 1001727 non-null object
6 contbr_zip 1001620 non-null object
7 contbr_employer 988002 non-null object
8 contbr_occupation 993301 non-null object
9 contb_receipt_amt 1001731 non-null float64
10 contb_receipt_dt 1001731 non-null object
11 receipt_desc 14166 non-null object
12 memo_cd 92482 non-null object
13 memo_text 97770 non-null object
14 form_tp 1001731 non-null object
15 file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
In [6]: fecdata.sample(5)
3
907172 EASTPOINT FL 323280644.0 NaN
117243 BUCKEYE AZ 853966956 RETIRED
273563 JACKSONVILLE FL 32250 MORGAN & MORGAN
547504 LARCHMONT NY 105382302 SELF-EMPLOYED
809326 ZANESVILLE OH 437012222 NONE
In [7]: fecdata.describe()
Assign party affiliations (they are all Republicans except for Barack Obama):
In [9]: party_affiliations = {name: 'D' if name == 'Obama, Barack' else 'R' for name in unique_
party_affiliations
4
Out[9]: {'Bachmann, Michelle': 'R',
'Romney, Mitt': 'R',
'Obama, Barack': 'D',
"Roemer, Charles E. 'Buddy' III": 'R',
'Pawlenty, Timothy': 'R',
'Johnson, Gary Earl': 'R',
'Paul, Ron': 'R',
'Santorum, Rick': 'R',
'Cain, Herman': 'R',
'Gingrich, Newt': 'R',
'McCotter, Thaddeus G': 'R',
'Huntsman, Jon': 'R',
'Perry, Rick': 'R'}
In [10]: aff = {name: "D" if name == "Obama, Barack" else "R" for name in unique_candidates}
aff
In [12]: candidate_sample.map(party_affiliations)
Out[12]: 973734 R
404526 D
271902 D
279449 D
553562 D
Name: cand_nm, dtype: object
5
In [13]: fecdata['party'] = fecdata['cand_nm'].map(party_affiliations)
In [14]: fecdata.sample(5)
contbr_employer contbr_occupation \
177713 KAISER RN
539733 WEBSTER CENTRAL SCHOOLS SPECIAL EDUCATION TEACHER
377571 CAPE COD NATIONAL GOLF CLUB GOLF COURSE MANAGER
704096 SELF-EMPLOYED CORPORATE FINANCIAL CONSULTANT
391007 RETIRED RETIRED
Out[15]: 298.75139514000006
Which party got more individual donations (transactions, not total dollars)?
6
In [16]: fecdata['party'].value_counts()
Out[16]: D 593746
R 407985
Name: party, dtype: int64
In [17]: fecdata.groupby('party')['contb_receipt_amt'].sum()*1e-6
Out[17]: party
D 133.502591
R 165.248804
Name: contb_receipt_amt, dtype: float64
Filter all the data to include only the two main candidates, Romney and Obama.
7
687400 RETIRED 15.0 22-APR-12
65495 ATTORNEY 250.0 09-APR-12
cand_nm
Obama, Barack 133.502591
Romney, Mitt 85.219249
Name: contb_receipt_amt, dtype: float64
5 Who contributes?
Get a list of top occupations:
In [21]: len(fecmain['contbr_occupation'].unique())
Out[21]: 33606
In [22]: fecmain['contbr_occupation'].value_counts()
In [24]: fecmain['contbr_occupation'].map(occ_mapping)
8
Out[24]: 411 NaN
412 NaN
413 NaN
414 NaN
415 NaN
...
701381 NaN
701382 NaN
701383 NaN
701384 NaN
701385 NaN
Name: contbr_occupation, Length: 700975, dtype: object
In [25]: # .get()!
print(occ_mapping.get('PROFESSOR'))
print(occ_mapping.get('PROFESSOR', 'PROFESSOR'))
None
PROFESSOR
In [27]: fecmain['contbr_occupation']
9
In [29]: fecmain['contbr_employer'] = fecmain['contbr_employer'].map(lambda x: emp_mapping.get(
In [30]: emp_mapping.get('prof','pro')
Out[30]: 'pro'
Create a "pivot table" that shows occupations as rows and party affiliation as columns, sum-
ming the individual contributions.
Out[31]: party D R
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100.0 NaN
AREA VICE PRESIDENT 250.0 NaN
RESEARCH ASSOCIATE 100.0 NaN
TEACHER 500.0 NaN
THERAPIST 3900.0 NaN
... ... ..
ZEPPOS AND ASSOCIATES 1000.0 NaN
ZONE MANAGER 135.0 NaN
ZOOKEEPER 35.0 NaN
ZOOLOGIST 400.0 NaN
ZOOLOGY EDUCATION 25.0 NaN
Determine which occupations account for $1 million or more in contributions. Compare the
amounts between the two party affiliations. (Bonus: Make a plot to compare these visually.)
Out[32]: 25
In [33]: over_1mil
Out[33]: party D R
contbr_occupation
ATTORNEY 11.126933 5.302579
BANKER 0.224084 0.993536
BUSINESS OWNER 0.449979 0.601530
CEO 2.071475 2.247242
CHAIRMAN 0.495547 0.752587
CONSULTANT 2.459813 1.404577
ENGINEER 0.950426 0.317249
EXECUTIVE 1.355161 2.230654
FINANCE 0.294031 1.118567
10
HOMEMAKER 4.243394 8.037251
INVESTOR 0.884133 1.494725
LAWYER 3.159392 0.007705
MANAGER 0.762693 0.565826
NOT EMPLOYED 1.708108 NaN
NOT PROVIDED 4.849802 11.173375
OWNER 0.998867 0.845396
PHYSICIAN 3.732387 1.332996
PRESIDENT 1.878010 2.403440
PROFESSOR 2.163571 0.160362
REAL ESTATE 0.528752 1.033896
RETIRED 25.270507 11.266949
SELF-EMPLOYED 0.667393 1.063738
STUDENT 0.627450 0.488743
TEACHER 1.250969 0.138838
WRITER 1.084089 0.090363
In [35]: over_1mil.sum(axis=1).sort_values()
Out[35]: contbr_occupation
BUSINESS OWNER 1.051509
STUDENT 1.116193
WRITER 1.174452
BANKER 1.217621
CHAIRMAN 1.248134
ENGINEER 1.267675
MANAGER 1.328519
TEACHER 1.389808
FINANCE 1.412599
REAL ESTATE 1.562648
NOT EMPLOYED 1.708108
SELF-EMPLOYED 1.731132
OWNER 1.844263
PROFESSOR 2.323933
INVESTOR 2.378858
LAWYER 3.167097
EXECUTIVE 3.585815
CONSULTANT 3.864390
PRESIDENT 4.281450
CEO 4.318717
PHYSICIAN 5.065384
HOMEMAKER 12.280645
NOT PROVIDED 16.023177
ATTORNEY 16.429512
RETIRED 36.537456
dtype: float64
11
In [36]: over_1mil_sorted = over_1mil.loc[sorted_occ.index]
over_1mil_sorted.plot(kind='barh', stacked=True, figsize=(10, 6));
6 Simple ranking
Determine largest donors:
In [38]: fecmain.loc[largest_donors.index]
12
257270 C00431445 P80003338 Obama, Barack
Out[39]: cand_nm
Obama, Barack 325136 2014490.51
326651 1944042.43
344539 1679114.65
Romney, Mitt 65131 12700.00
41888 10000.00
90076 10000.00
Name: contb_receipt_amt, dtype: float64
In [40]: type(grouped)
Out[40]: pandas.core.groupby.generic.DataFrameGroupBy
13
Out[41]: cand_nm
Obama, Barack <class 'pandas.core.frame.DataFrame'>
Romney, Mitt <class 'pandas.core.frame.DataFrame'>
dtype: object
Use .apply() to get DataFrame objects showing the largest donors, grouped by candidate and
occupation:
top_amounts_by_occupation(fecmain)
Out[42]: contbr_occupation
RETIRED 36537456.46
ATTORNEY 16429511.79
NOT PROVIDED 16023176.80
HOMEMAKER 12280645.16
PHYSICIAN 5065383.78
Name: contb_receipt_amt, dtype: float64
14
7 Big vs. small donations
For each of the leading two candidates, did most of their money come from large or small dona-
tions?
Fin!
15