0% found this document useful (0 votes)

142 views34 pages

Pandas for Data Science Students

This document provides additional notes on pandas Series objects and how they can be used to store and manipulate data. Series are column-oriented objects that act like both lists and dictionaries. They support fast vectorized operations and automatic alignment of data based on index values.

Uploaded by

Alexandra Prokhorova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views34 pages

Pandas for Data Science Students

Uploaded by

Alexandra Prokhorova

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

part0

October 9, 2022

1 Supplemental notes on Pandas

The pandas library is a Python module for representing what we call "tibbles" in Topic 7. Beyond
what you see there, this notebook has additional notes to help you understand how to manipulate
objects in Pandas. These notes adapt those found in the recommended text, Python for Data
Analysis (2nd ed.), which is written by the createor of pandas, Wes McKinney.
Versions. The state of pandas is a bit in-flux, so it’s important to be flexible and accommodate
differences in functionality that might vary by version. The following code shows you how to
check what version of pandas you have.

In [1]: import pandas as pd # Standard idiom for loading pandas

print("=== pandas version: {} ===\n".format(pd.version))

import sys
print("=== Python version ===\n{}".format(sys.version))

=== pandas version: 1.4.0 ===

=== Python version ===

3.8.7 (default, Jan 25 2021, 11:14:52)
[GCC 5.5.0 20171010]

The main object that pandas implements is the DataFrame, which is essentially a 2-D table. It’s
an ideal target for holding the tibbles of Topic+Notebook 7, and its design derives in part from
data frame objects in the R language.
In addition to DataFrame, another important component of pandas is the Series, which is es-
sentially one column of a DataFrame object (and, therefore, corresponds to variables and responses
in a tibble).

In [2]: from pandas import DataFrame, Series

2 Series objects
A pandas Series object is a column-oriented object that we will use to store a variable of a tibble.

1
In [3]: obj = Series([-1, 2, -3, 4, -5])
print(f"`obj` has type `{type(obj)}`:\n\n{obj}")

`obj` has type `<class 'pandas.core.series.Series'>`:

0 -1
1 2
2 -3
3 4
4 -5
dtype: int64

Observe the common base type (dtype: int64) and index (element numbers).
Regarding the base type, a Series differs from a Python list in that the types of its elements
are assumed to be the same. Doing so allows many operations on a Series to be faster than their
counterparts for list objects, as in this search example.

In [4]: from random import randint

n_ints = 10000000
max_value = 5*n_ints

print(f"""
Creating random `list` and `Series` objects:
- Length: {n_ints} elements
- Range: [{-max_value}, {max_value}]
""")
a_list = [randint(-max_value, max_value) for _ in range(n_ints)]
a_series = Series(a_list)

print("==> Estimating time to search the `list`:")

t_list_search = %timeit -o randint(-max_value, max_value) in a_list

print("\n==> Estimating time to search the `Series`:")

t_series_search = %timeit -o a_series.isin([randint(-max_value, max_value)])

print(f"\n==> (`list` time) divided by `Series` time is roughly {t_list_search.average

Creating random `list` and `Series` objects:

- Length: 10000000 elements
- Range: [-50000000, 50000000]

==> Estimating time to search the `list`:

189 ms ś 8.64 ms per loop (mean ś std. dev. of 7 runs, 10 loops each)

==> Estimating time to search the `Series`:

20.4 ms ś 184 ţs per loop (mean ś std. dev. of 7 runs, 10 loops each)

2
==> (`list` time) divided by `Series` time is roughly 9.3x

If you create a Series with "mixed types," the dtype will become the most generic Python type,
object. (A deeper understanding of what this fact means requires some knowledge of object-
oriented programming, but that won’t be necessary for our course.)

In [5]: obj2 = Series([-1, '2', -3, '4', -5])

obj2

Out[5]: 0 -1
1 2
2 -3
3 4
4 -5
dtype: object

If you want to query the base type, use:

In [6]: print(obj.dtype)
print(obj2.dtype)

int64
object

Regarding the index, it provides a convenient way to reference individual elements of the
Series.
By default, a Series has an index that is akin to range() in standard Python, and effectively
numbers the entries from 0 to n-1, where n is the length of the Series. A Series object also
becomes list-like in how you reference its elements.

In [7]: print("obj.index: {}".format(obj.index))

print("range(0, 5): {}".format(range(0, 5)))

obj.index: RangeIndex(start=0, stop=5, step=1)

range(0, 5): range(0, 5)

In [8]: print("==> obj[2]:\n{}\n".format(obj[2]))

print("==> obj[3]:\n{}\n".format(obj[3]))
print("==> obj[1:3]:\n{}\n".format(obj[1:4]))

==> obj[2]:
-3

==> obj[3]:
4

==> obj[1:3]:

3
1 2
2 -3
3 4
dtype: int64

You can also use more complex index objects, like lists of integers and conditional masks.

In [9]: I = [0, 2, 3]
obj[I] # Also: obj[[0, 2, 3]]

Out[9]: 0 -1
2 -3
3 4
dtype: int64

In [10]: I_pos = obj > 0

print(type(I_pos), I_pos)

<class 'pandas.core.series.Series'> 0 False

1 True
2 False
3 True
4 False
dtype: bool

In [11]: print(obj[I_pos])

1 2
3 4
dtype: int64

However, the index can be a more general structure, which effectively turns a Series object
into something that is "dictionary-like."

In [12]: obj3 = Series([ 1, -2, 3, -4, 5, -6],

['alice', 'bob', 'carol', 'dave', 'esther', 'frank'])
obj3

Out[12]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64

4
In [13]: print("* obj3['bob']: {}\n".format(obj3['bob']))
print("* obj3['carol']: {}\n".format(obj3['carol']))

* obj3['bob']: -2

* obj3['carol']: 3

In fact, you can construct a Series from a dictionary directly:

In [14]: peeps = {'alice': 1, 'carol': 3, 'esther': 5, 'bob': -2, 'dave': -4, 'frank': -6}
obj4 = Series(peeps)
print(obj4)

alice 1
carol 3
esther 5
bob -2
dave -4
frank -6
dtype: int64

In [15]: mujeres = [0, 2, 4] # list of integer offsets

print("* las mujeres of `obj3` at offsets {}:\n{}\n".format(mujeres, obj3[mujeres]))

* las mujeres of `obj3` at offsets [0, 2, 4]:

alice 1
carol 3
esther 5
dtype: int64

In [16]: hombres = ['bob', 'dave', 'frank'] # list of index values

print("* hombres, by their names, {}:\n{}".format(hombres, obj3[hombres]))

* hombres, by their names, ['bob', 'dave', 'frank']:

bob -2
dave -4
frank -6
dtype: int64

In [17]: I_neg = obj3 < 0

print(I_neg)

5
alice False
bob True
carol False
dave True
esther False
frank True
dtype: bool

In [18]: print(obj3[I_neg])

bob -2
dave -4
frank -6
dtype: int64

Because of the dictionary-like naming of Series elements, you can use the Python in operator
in the same way you would a dictionary.

Note: In the timing experiment comparing list search and Series search, you may
have noticed that the benchmark does not use in, but rather, Series.isin. Why is
that?

In [19]: print('carol' in peeps)

print('carol' in obj3)

True
True

Basic arithmetic works on Series as vector-like operations.

In [20]: print(obj3, "\n")

print(obj3 + 5, "\n")
print(obj3 + 5 > 0, "\n")
print((-2.5 * obj3) + (obj3 + 5))

alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64

alice 6
bob 3
carol 8

6
dave 1
esther 10
frank -1
dtype: int64

alice True
bob True
carol True
dave True
esther True
frank False
dtype: bool

alice 3.5
bob 8.0
carol 0.5
dave 11.0
esther -2.5
frank 14.0
dtype: float64

A Series object also supports vector-style operations with automatic alignment based on in-
dex values.

In [21]: print(obj3)

alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64

In [22]: obj_l = obj3[mujeres]

obj_l

Out[22]: alice 1
carol 3
esther 5
dtype: int64

In [23]: obj3 + obj_l

Out[23]: alice 2.0

bob NaN
carol 6.0

7
dave NaN
esther 10.0
frank NaN
dtype: float64

Observe what happened with undefined elements. If you are familiar with relational
databases, this behavior is akin to an outer-join.
Another useful transformation is the .apply(fun) method. It returns a copy of the Series
where the function fun has been applied to each element. For example:

In [24]: abs(-5) # Python built-in function

Out[24]: 5

In [25]: obj3 # Recall

Out[25]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64

In [26]: obj3.apply(abs)

Out[26]: alice 1
bob 2
carol 3
dave 4
esther 5
frank 6
dtype: int64

In [27]: obj3 # Note: `.apply()` returned a copy, so the original is untouched

Out[27]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
dtype: int64

A Series may be named, too.

In [28]: print(obj3.name)

None

8
In [29]: obj3.name = 'peep'
obj3

Out[29]: alice 1
bob -2
carol 3
dave -4
esther 5
frank -6
Name: peep, dtype: int64

When we move on to DataFrame objects, you’ll see why names matter.

3 DataFrame objects
A pandas DataFrame object is a table whose columns are Series objects, all keyed on the same
index. It’s the perfect container for what we have been referring to as a tibble.

In [30]: cafes = DataFrame({'name': ['east pole', 'chrome yellow', 'brash', 'taproom', '3heart'
'zip': [30324, 30312, 30318, 30317, 30306, 30308, 30303, 30318],
'poc': ['jared', 'kelly', 'matt', 'jonathan', 'nhan', 'dale', 'kitt
print("type:", type(cafes))
print(cafes)

type: <class 'pandas.core.frame.DataFrame'>

name zip poc
0 east pole 30324 jared
1 chrome yellow 30312 kelly
2 brash 30318 matt
3 taproom 30317 jonathan
4 3heart 30306 nhan
5 spiller park pcm 30308 dale
6 refuge 30303 kitti
7 toptime 30318 nolan

In [31]: display(cafes) # Or just `cafes` as the last line of a cell

name zip poc

0 east pole 30324 jared
1 chrome yellow 30312 kelly
2 brash 30318 matt
3 taproom 30317 jonathan
4 3heart 30306 nhan
5 spiller park pcm 30308 dale
6 refuge 30303 kitti
7 toptime 30318 nolan

9
The DataFrame has named columns, which are stored as an Index (more later!):

In [32]: cafes.columns

Out[32]: Index(['name', 'zip', 'poc'], dtype='object')

Each column is a named Series:

In [33]: type(cafes['zip']) # Aha!

Out[33]: pandas.core.series.Series

As you might expect, these Series objects should all have the same index.

In [34]: cafes.index

Out[34]: RangeIndex(start=0, stop=8, step=1)

In [35]: cafes.index == cafes['zip'].index

Out[35]: array([ True, True, True, True, True, True, True, True])

In [36]: cafes['zip'].index == cafes['poc'].index

Out[36]: array([ True, True, True, True, True, True, True, True])

You may use complex indexing of columns.

In [37]: target_fields = ['zip', 'poc']

cafes[target_fields]

Out[37]: zip poc

0 30324 jared
1 30312 kelly
2 30318 matt
3 30317 jonathan
4 30306 nhan
5 30308 dale
6 30303 kitti
7 30318 nolan

But slices apply to rows.

In [38]: cafes[1::2]

Out[38]: name zip poc

1 chrome yellow 30312 kelly
3 taproom 30317 jonathan
5 spiller park pcm 30308 dale
7 toptime 30318 nolan

10
The index above is, by default, an integer range.

In [39]: cafes.index

Out[39]: RangeIndex(start=0, stop=8, step=1)

In [40]: cafes2 = cafes[['poc', 'zip']]

cafes2.index = cafes['name']
cafes2.index.name = None
cafes2

Out[40]: poc zip

east pole jared 30324
chrome yellow kelly 30312
brash matt 30318
taproom jonathan 30317
3heart nhan 30306
spiller park pcm dale 30308
refuge kitti 30303
toptime nolan 30318

Like Series, DataFrame has special attributes loc and iloc for label-based and integer-based
indexing, respectively. Since loc operator indexes exclusively with labels, there is also an iloc oper-
ator that indexes exclusively with integers to work consistently whether or not the index contains
integers. While you can select data by label this way, the preferred way to select index values is
with the special loc operator. The reason to prefer loc is because of the different treatment of in-
tegers when indexing with []. Regular []-based indexing will treat integers as labels if the index
contains integers, so the behavior differs depending on the data type of the index.

df.loc[rows] -- Select single row or subset of rows from the DataFrame by label
df.loc[:, cols] -- Select single column or subset of columns by label
df.loc[rows, cols] -- Select both row(s) and column(s) by label
df.iloc[rows] -- Select single row or subset of rows from the DataFrame by integer position
df.iloc[:, cols] -- Select single column or subset of columns by integer position
df.iloc[rows, cols] -- Select both row(s) and column(s) by integer position

You can access subsets of rows using the .loc field and index values:

In [41]: cafes2.loc[['chrome yellow', '3heart']]

Out[41]: poc zip

chrome yellow kelly 30312
3heart nhan 30306

Alternatively, you can use integer offsets via the .iloc field, which is 0-based.

In [42]: cafes2.iloc[[1, 3]]

Out[42]: poc zip

chrome yellow kelly 30312
taproom jonathan 30317

11
Adding columns is easy. Suppose every cafe has a 4-star rating on Yelp! and a two-dollar-sign
cost:

In [43]: cafes2['rating'] = 4.0

cafes2['price'] = '$$'
cafes2

Out[43]: poc zip rating price

east pole jared 30324 4.0 $$
chrome yellow kelly 30312 4.0 $$
brash matt 30318 4.0 $$
taproom jonathan 30317 4.0 $$
3heart nhan 30306 4.0 $$
spiller park pcm dale 30308 4.0 $$
refuge kitti 30303 4.0 $$
toptime nolan 30318 4.0 $$

And vector arithmetic should work on columns as expected.

In [44]: prices_as_ints = cafes2['price'].apply(lambda s: len(s))

prices_as_ints

Out[44]: east pole 2

chrome yellow 2
brash 2
taproom 2
3heart 2
spiller park pcm 2
refuge 2
toptime 2
Name: price, dtype: int64

In [45]: cafes2['value'] = cafes2['rating'] / prices_as_ints

cafes2

Out[45]: poc zip rating price value

east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$ 2.0
spiller park pcm dale 30308 4.0 $$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0

Because the columns are Series objects, there is an implicit matching that is happening on the
indexes. In the preceding example, it works because all the Series objects involved have identical
indexes.
However, the following will not work as intended because referencing rows yields copies.

12
For instance, suppose there is a price hike of one more '$' for being in the 30306 and 30308 zip
codes. (If you are in Atlanta, you may know that these are the zip codes that place you close to, or
in, Ponce City Market and the Eastside Beltline Trail!) Let’s increase the price there, on a copy of
the dataframe, cafes3.

In [46]: cafes3 = cafes2.copy()

cafes3

Out[46]: poc zip rating price value

In [47]: is_fancy = cafes3['zip'].isin({30306, 30308})

# Alternative:
#is_fancy = cafes3['zip'].apply(lambda z: z in {30306, 30308})
is_fancy

Out[47]: east pole False

chrome yellow False
brash False
taproom False
3heart True
spiller park pcm True
refuge False
toptime False
Name: zip, dtype: bool

In [48]: cafes3[is_fancy]

Out[48]: poc zip rating price value

3heart nhan 30306 4.0 $$ 2.0
spiller park pcm dale 30308 4.0 $$ 2.0

In [49]: # Recall: Legal Python for string concatenation

s = '$$'
s += '$'
print(s)

$$$

In [50]: cafes3[is_fancy]['price'] += '$'

13
<ipython-input-50-245dbd93431d>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/i

cafes3[is_fancy]['price'] += '$'

What does that error message mean? Let’s see if anything changed.

In [51]: cafes3

Out[51]: poc zip rating price value

Nope! When you slice horizontally, you get copies of the original data, not references to subsets
of the original data. Therefore, we’ll need different strategy.
Observe that the error message suggests a way!

In [52]: cafes3.loc[is_fancy, 'price'] += '$'

cafes3

Out[52]: poc zip rating price value

east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$$ 2.0
spiller park pcm dale 30308 4.0 $$$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0

A different approach. For pedagogical purposes, let’s see if we can go about solving this
problem in other ways to see what might or might not work.

In [53]: cafes4 = cafes2.copy() # Start over

cafes4

Out[53]: poc zip rating price value

east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0

14
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$ 2.0
spiller park pcm dale 30308 4.0 $$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0

Based on the earlier discussion, a well-educated first attempt might be to construct a Series
with a named index, where the index values for fancy neighborhoods have an additional '$', and
then use string concatentation.

In [54]: fancy_shops = cafes4.index[is_fancy]

fancy_shops

Out[54]: Index(['3heart', 'spiller park pcm'], dtype='object')

In [55]: fancy_markup = Series(['$'] * len(fancy_shops), index=fancy_shops)

fancy_markup

Out[55]: 3heart $
spiller park pcm $
dtype: object

In [56]: cafes4['price'] + fancy_markup

Out[56]: 3heart $$$

brash NaN
chrome yellow NaN
east pole NaN
refuge NaN
spiller park pcm $$$
taproom NaN
toptime NaN
dtype: object

Close! Remember that missing values are treated as NaN objects.

Exercise. Develop an alternative scheme.

In [57]: # Preliminary observation:

print("False * '$' == '{}'".format(False * '$'))
print("True * '$' == '{}'".format(True * '$'))

False * '$' == ''

True * '$' == '$'

In [58]: cafes4 = cafes2.copy()

cafes4['price'] += Series([x * '$' for x in is_fancy.tolist()], index=is_fancy.index)
cafes4

15
Out[58]: poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$$ 2.0
spiller park pcm dale 30308 4.0 $$$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0

More on apply() for DataFrame objects. As with a Series, there is a DataFrame.apply()

procedure. However, it’s meaning is a bit more nuanced because a DataFrame is generally 2-D
rather than 1-D.

In [59]: cafes4.apply(lambda x: repr(type(x))) # What does this do? What does the output tell y

Out[59]: poc <class 'pandas.core.series.Series'>

zip <class 'pandas.core.series.Series'>
rating <class 'pandas.core.series.Series'>
price <class 'pandas.core.series.Series'>
value <class 'pandas.core.series.Series'>
dtype: object

A useful parameter is axis:

In [60]: cafes4.apply(lambda x: repr(type(x)), axis=1) # What does this do? What does the outpu

Out[60]: east pole <class 'pandas.core.series.Series'>

chrome yellow <class 'pandas.core.series.Series'>
brash <class 'pandas.core.series.Series'>
taproom <class 'pandas.core.series.Series'>
3heart <class 'pandas.core.series.Series'>
spiller park pcm <class 'pandas.core.series.Series'>
refuge <class 'pandas.core.series.Series'>
toptime <class 'pandas.core.series.Series'>
dtype: object

And just to quickly verify what you get when axis=1:

In [61]: cafes4.apply(lambda x: print('==> ' + x.name + '\n' + repr(x)) if x.name == 'east pole

==> east pole

poc jared
zip 30324
rating 4.0
price $$
value 2.0
Name: east pole, dtype: object

16
Exercise. Use DataFrame.apply() to update the 'value' column in cafes4, which is out of
date given the update of the prices.
In [62]: cafes4 # Verify visually that `'value'` is out of date
Out[62]: poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
3heart nhan 30306 4.0 $$$ 2.0
spiller park pcm dale 30308 4.0 $$$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0
In [63]: def calc_value(row):
return row['rating'] / len(row['price'])

cafes4['value'] = cafes4.apply(calc_value, axis=1)

cafes4
Out[63]: poc zip rating price value
east pole jared 30324 4.0 $$ 2.000000
chrome yellow kelly 30312 4.0 $$ 2.000000
brash matt 30318 4.0 $$ 2.000000
taproom jonathan 30317 4.0 $$ 2.000000
3heart nhan 30306 4.0 $$$ 1.333333
spiller park pcm dale 30308 4.0 $$$ 1.333333
refuge kitti 30303 4.0 $$ 2.000000
toptime nolan 30318 4.0 $$ 2.000000
Another useful operation is gluing DataFrame objects together. There are several helpful op-
erations covered in Notebook 7; one not mentioned there, but useful in one of its exercises, is
.concat().
In [64]: # Split based on price
is_cheap = cafes4['price'] <= '$$'
cafes_cheap = cafes4[is_cheap]
cafes_pricey = cafes4[~is_cheap]

display(cafes_cheap)
display(cafes_pricey)
poc zip rating price value
east pole jared 30324 4.0 $$ 2.0
chrome yellow kelly 30312 4.0 $$ 2.0
brash matt 30318 4.0 $$ 2.0
taproom jonathan 30317 4.0 $$ 2.0
refuge kitti 30303 4.0 $$ 2.0
toptime nolan 30318 4.0 $$ 2.0

17
poc zip rating price value
3heart nhan 30306 4.0 $$$ 1.333333
spiller park pcm dale 30308 4.0 $$$ 1.333333

In [65]: # Never mind; recombine

pd.concat([cafes_cheap, cafes_pricey])

Out[65]: poc zip rating price value

east pole jared 30324 4.0 $$ 2.000000
chrome yellow kelly 30312 4.0 $$ 2.000000
brash matt 30318 4.0 $$ 2.000000
taproom jonathan 30317 4.0 $$ 2.000000
refuge kitti 30303 4.0 $$ 2.000000
toptime nolan 30318 4.0 $$ 2.000000
3heart nhan 30306 4.0 $$$ 1.333333
spiller park pcm dale 30308 4.0 $$$ 1.333333

3.1 More on index objects

A pandas Index, used by Series and DataFrame, is "list-like." It has a number of useful operations,
including set-like operations (e.g., testing for membership, intersection, union, difference):

In [66]: from pandas import Index

In [67]: cafes4.index

Out[67]: Index(['east pole', 'chrome yellow', 'brash', 'taproom', '3heart',

'spiller park pcm', 'refuge', 'toptime'],
dtype='object')

In [68]: cafes4.index.isin(['brash', '3heart'])

Out[68]: array([False, False, True, False, True, False, False, False])

In [69]: cafes4.index.union(['chattahoochee'])

Out[69]: Index(['3heart', 'brash', 'chattahoochee', 'chrome yellow', 'east pole',

'refuge', 'spiller park pcm', 'taproom', 'toptime'],
dtype='object')

In [70]: cafes4.index.difference(['chattahoochee', 'starbucks', 'bar crema'])

Out[70]: Index(['3heart', 'brash', 'chrome yellow', 'east pole', 'refuge',

'spiller park pcm', 'taproom', 'toptime'],
dtype='object')

If you need to change the index of a DataFrame, here is one way to do it.

18
In [71]: cafes5 = cafes4.reindex(Index(['3heart', 'east pole', 'brash', 'starbucks']))

display(cafes4)
display(cafes5)

poc zip rating price value

east pole jared 30324 4.0 $$ 2.000000
chrome yellow kelly 30312 4.0 $$ 2.000000
brash matt 30318 4.0 $$ 2.000000
taproom jonathan 30317 4.0 $$ 2.000000
3heart nhan 30306 4.0 $$$ 1.333333
spiller park pcm dale 30308 4.0 $$$ 1.333333
refuge kitti 30303 4.0 $$ 2.000000
toptime nolan 30318 4.0 $$ 2.000000

poc zip rating price value

3heart nhan 30306.0 4.0 $$$ 1.333333
east pole jared 30324.0 4.0 $$ 2.000000
brash matt 30318.0 4.0 $$ 2.000000
starbucks NaN NaN NaN NaN NaN

Observe that this reindexing operation matches the supplied index values against the existing
ones. (What happens to index values you leave out? What happens with new index values?)
Another useful operation is dropping the index (and replacing it with the default, integers).

In [72]: cafes6 = cafes4.reset_index(drop=True)

cafes6['name'] = cafes4.index
cafes6

Out[72]: poc zip rating price value name

0 jared 30324 4.0 $$ 2.000000 east pole
1 kelly 30312 4.0 $$ 2.000000 chrome yellow
2 matt 30318 4.0 $$ 2.000000 brash
3 jonathan 30317 4.0 $$ 2.000000 taproom
4 nhan 30306 4.0 $$$ 1.333333 3heart
5 dale 30308 4.0 $$$ 1.333333 spiller park pcm
6 kitti 30303 4.0 $$ 2.000000 refuge
7 nolan 30318 4.0 $$ 2.000000 toptime

Fin! That’s the end of these notes. With this information as background, you should be able to
complete Notebook 7.

19
part1

October 9, 2022

1 Pandas walk-through: Federal Election Commission dataset

This walk-through is adapted from Chapter 14.5 of Wes McKinney’s book, Python for Data Anal-
ysis (3rd edition).

In [1]: %matplotlib inline

import pandas as pd
print(pd.__version__)

Matplotlib is building the font cache; this may take a moment.

1.4.0

2 Setup
The following code cell imports a local Python module (stored in cse6040utils.py) and uses one
of its utility functions to open the sample dataset.

Note: Due to the size of the data file, we are not making it available for download. You
will need to run this notebook on Vocareum.

In [2]: from cse6040utils import download_dataset

local_data = download_dataset({'P00000001-ALL.csv': '31df639d0b5dbd3b6d755f91d6bf6fb4'}

'./resource/asnlib/publicdata/P00000001-ALL.csv' is ready!

3 Initial exploration and clean-up

In [3]: # Load CSV file
fecdata = pd.read_csv(local_data['P00000001-ALL.csv'])
fecdata.head()

<ipython-input-3-91c3b0ffd817>:2: DtypeWarning: Columns (6) have mixed types. Specify dtype opt
fecdata = pd.read_csv(local_data['P00000001-ALL.csv'])

1
Out[3]: cmte_id cand_id cand_nm contbr_nm \
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD

contbr_city contbr_st contbr_zip contbr_employer \

0 MOBILE AL 366010290.0 RETIRED
1 MOBILE AL 366010290.0 RETIRED
2 LANETT AL 368633403.0 INFORMATION REQUESTED
3 PIGGOTT AR 724548253.0 NONE
4 HOT SPRINGS NATION AR 719016467.0 NONE

contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc \

0 RETIRED 250.0 20-JUN-11 NaN
1 RETIRED 50.0 23-JUN-11 NaN
2 INFORMATION REQUESTED 250.0 05-JUL-11 NaN
3 RETIRED 250.0 01-AUG-11 NaN
4 RETIRED 300.0 20-JUN-11 NaN

memo_cd memo_text form_tp file_num

0 NaN NaN SA17A 736166
1 NaN NaN SA17A 736166
2 NaN NaN SA17A 749073
3 NaN NaN SA17A 749073
4 NaN NaN SA17A 736166

In [4]: fecdata.head()

Out[4]: cmte_id cand_id cand_nm contbr_nm \

0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD

contbr_city contbr_st contbr_zip contbr_employer \

0 MOBILE AL 366010290.0 RETIRED
1 MOBILE AL 366010290.0 RETIRED
2 LANETT AL 368633403.0 INFORMATION REQUESTED
3 PIGGOTT AR 724548253.0 NONE
4 HOT SPRINGS NATION AR 719016467.0 NONE

contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc \

0 RETIRED 250.0 20-JUN-11 NaN
1 RETIRED 50.0 23-JUN-11 NaN
2 INFORMATION REQUESTED 250.0 05-JUL-11 NaN

2
3 RETIRED 250.0 01-AUG-11 NaN
4 RETIRED 300.0 20-JUN-11 NaN

memo_cd memo_text form_tp file_num

0 NaN NaN SA17A 736166
1 NaN NaN SA17A 736166
2 NaN NaN SA17A 749073
3 NaN NaN SA17A 749073
4 NaN NaN SA17A 736166

In [5]: fecdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 1001731 non-null object
1 cand_id 1001731 non-null object
2 cand_nm 1001731 non-null object
3 contbr_nm 1001731 non-null object
4 contbr_city 1001712 non-null object
5 contbr_st 1001727 non-null object
6 contbr_zip 1001620 non-null object
7 contbr_employer 988002 non-null object
8 contbr_occupation 993301 non-null object
9 contb_receipt_amt 1001731 non-null float64
10 contb_receipt_dt 1001731 non-null object
11 receipt_desc 14166 non-null object
12 memo_cd 92482 non-null object
13 memo_text 97770 non-null object
14 form_tp 1001731 non-null object
15 file_num 1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB

Get a random sample:

In [6]: fecdata.sample(5)

Out[6]: cmte_id cand_id cand_nm contbr_nm \

907172 C00496067 P00003608 Cain, Herman HUTCHINSON, DAVID
117243 C00431445 P80003338 Obama, Barack TAFARODI, MAHMOUD
273563 C00431445 P80003338 Obama, Barack MOWRY, HENRY
547504 C00431445 P80003338 Obama, Barack NATHAN, AMY
809326 C00495820 P80000748 Paul, Ron HARTSOOK, ANDREW MR.

contbr_city contbr_st contbr_zip contbr_employer \

3
907172 EASTPOINT FL 323280644.0 NaN
117243 BUCKEYE AZ 853966956 RETIRED
273563 JACKSONVILLE FL 32250 MORGAN & MORGAN
547504 LARCHMONT NY 105382302 SELF-EMPLOYED
809326 ZANESVILLE OH 437012222 NONE

contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc \

907172 RETIRED 99.0 08-NOV-11 NaN
117243 RETIRED 88.0 31-MAR-12 NaN
273563 ATTORNEY 500.0 06-OCT-11 NaN
547504 WRITER 50.0 26-MAR-12 NaN
809326 RETIRED 100.0 06-MAR-12 NaN

memo_cd memo_text form_tp file_num

907172 NaN NaN SA17A 776884
117243 X * OBAMA VICTORY FUND 2012 SA18 780234
273563 X * OBAMA VICTORY FUND 2012 SA18 772372
547504 NaN NaN SA17A 780234
809326 NaN NaN SA17A 780012

Summarize numerical data (.describe()):

In [7]: fecdata.describe()

Out[7]: contb_receipt_amt file_num

count 1.001731e+06 1.001731e+06
mean 2.982352e+02 7.744948e+05
std 3.749667e+03 1.059822e+04
min -3.080000e+04 7.235110e+05
25% 3.500000e+01 7.719270e+05
50% 1.000000e+02 7.792250e+05
75% 2.500000e+02 7.802340e+05
max 2.014491e+06 7.878030e+05

Get a list of the unique candidates (unique_candidates):

In [8]: unique_candidates = fecdata['cand_nm'].unique()

unique_candidates

Out[8]: array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',

"Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
'Huntsman, Jon', 'Perry, Rick'], dtype=object)

Assign party affiliations (they are all Republicans except for Barack Obama):

In [9]: party_affiliations = {name: 'D' if name == 'Obama, Barack' else 'R' for name in unique_
party_affiliations

4
Out[9]: {'Bachmann, Michelle': 'R',
'Romney, Mitt': 'R',
'Obama, Barack': 'D',
"Roemer, Charles E. 'Buddy' III": 'R',
'Pawlenty, Timothy': 'R',
'Johnson, Gary Earl': 'R',
'Paul, Ron': 'R',
'Santorum, Rick': 'R',
'Cain, Herman': 'R',
'Gingrich, Newt': 'R',
'McCotter, Thaddeus G': 'R',
'Huntsman, Jon': 'R',
'Perry, Rick': 'R'}

In [10]: aff = {name: "D" if name == "Obama, Barack" else "R" for name in unique_candidates}
aff

Out[10]: {'Bachmann, Michelle': 'R',

'Romney, Mitt': 'R',
'Obama, Barack': 'D',
"Roemer, Charles E. 'Buddy' III": 'R',
'Pawlenty, Timothy': 'R',
'Johnson, Gary Earl': 'R',
'Paul, Ron': 'R',
'Santorum, Rick': 'R',
'Cain, Herman': 'R',
'Gingrich, Newt': 'R',
'McCotter, Thaddeus G': 'R',
'Huntsman, Jon': 'R',
'Perry, Rick': 'R'}

In [11]: candidate_sample = fecdata['cand_nm'].sample(5)

candidate_sample

Out[11]: 973734 Bachmann, Michelle

404526 Obama, Barack
271902 Obama, Barack
279449 Obama, Barack
553562 Obama, Barack
Name: cand_nm, dtype: object

In [12]: candidate_sample.map(party_affiliations)

Out[12]: 973734 R
404526 D
271902 D
279449 D
553562 D
Name: cand_nm, dtype: object

5
In [13]: fecdata['party'] = fecdata['cand_nm'].map(party_affiliations)

In [14]: fecdata.sample(5)

Out[14]: cmte_id cand_id cand_nm \

177713 C00431445 P80003338 Obama, Barack
539733 C00431445 P80003338 Obama, Barack
377571 C00431445 P80003338 Obama, Barack
704096 C00493692 P20002523 Roemer, Charles E. 'Buddy' III
391007 C00431445 P80003338 Obama, Barack

contbr_nm contbr_city contbr_st contbr_zip \

177713 BURRESS, SVETLANA WEST SACRAMENTO CA 956912544
539733 BARBER, MARSHA FAIRPORT NY 144503128
377571 CADENELLI, STEPHEN G HARWICH MA 26451567
704096 NAQUIN, GREG BATON ROUGE LA 708193916
391007 ROBINSON, ALICE B. WELLESLEY MA 24827843

contbr_employer contbr_occupation \
177713 KAISER RN
539733 WEBSTER CENTRAL SCHOOLS SPECIAL EDUCATION TEACHER
377571 CAPE COD NATIONAL GOLF CLUB GOLF COURSE MANAGER
704096 SELF-EMPLOYED CORPORATE FINANCIAL CONSULTANT
391007 RETIRED RETIRED

contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text \

177713 25.0 01-FEB-12 NaN NaN NaN
539733 50.0 31-MAR-12 NaN NaN NaN
377571 100.0 28-FEB-12 NaN NaN NaN
704096 100.0 02-APR-11 NaN NaN NaN
391007 150.0 11-JUL-11 NaN NaN NaN

form_tp file_num party

177713 SA17A 787803 D
539733 SA17A 780234 D
377571 SA17A 787803 D
704096 SA17A 736468 R
391007 SA17A 756218 D

4 Total contributions by party and candidate

What was the total amount of contributions (in millions of dollars)?

In [15]: fecdata['contb_receipt_amt'].sum()*1e-6 # millions of dollars

Out[15]: 298.75139514000006

Which party got more individual donations (transactions, not total dollars)?

6
In [16]: fecdata['party'].value_counts()

Out[16]: D 593746
R 407985
Name: party, dtype: int64

Which party got more total dollars?

In [17]: fecdata.groupby('party')['contb_receipt_amt'].sum()*1e-6

Out[17]: party
D 133.502591
R 165.248804
Name: contb_receipt_amt, dtype: float64

Filter all the data to include only the two main candidates, Romney and Obama.

In [18]: keep_candidates = {'Obama, Barack', 'Romney, Mitt'}

In [19]: matches = fecdata['cand_nm'].apply(lambda x: x in keep_candidates)

fecdata[matches].shape

Out[19]: (700975, 17)

In [20]: fecmain = fecdata[fecdata['cand_nm'].isin(keep_candidates)].copy()

print(fecmain['cand_nm'].unique())
display(fecmain.sample(5))
display(fecmain.groupby('cand_nm')['contb_receipt_amt'].sum()*1e-6)

['Romney, Mitt' 'Obama, Barack']

cmte_id cand_id cand_nm contbr_nm \

646094 C00431445 P80003338 Obama, Barack DELOACH, MAURICE
514901 C00431445 P80003338 Obama, Barack ROSENBLUM, PAMELA
636830 C00431445 P80003338 Obama, Barack TOWLER, MARTIN HARRIS
687400 C00431445 P80003338 Obama, Barack PASCHALL, DAN
65495 C00431171 P80003353 Romney, Mitt LINDSEY, DANIEL R.

contbr_city contbr_st contbr_zip contbr_employer \

646094 CHESAPEAKE VA 233254349 CTR
514901 NEW YORK NY 10024 SELF-EMPLOYED
636830 LLANO TX 786432033 INFORMATION REQUESTED
687400 SPOKANE VALLEY WA 992122545 RETIRED
65495 CLOVIS NM 881015816 LINDSEY LAW OFFICE

contbr_occupation contb_receipt_amt contb_receipt_dt \

646094 PIPEFITTER 100.0 19-MAR-12
514901 PSYCH ANALYST 25.0 13-OCT-11
636830 INFORMATION REQUESTED 100.0 05-APR-12

7
687400 RETIRED 15.0 22-APR-12
65495 ATTORNEY 250.0 09-APR-12

receipt_desc memo_cd memo_text form_tp file_num party

646094 NaN NaN NaN SA17A 780234 D
514901 NaN NaN NaN SA17A 772372 D
636830 NaN NaN NaN SA17A 785239 D
687400 NaN NaN NaN SA17A 785239 D
65495 NaN NaN NaN SA17A 785689 R

cand_nm
Obama, Barack 133.502591
Romney, Mitt 85.219249
Name: contb_receipt_amt, dtype: float64

5 Who contributes?
Get a list of top occupations:

In [21]: len(fecmain['contbr_occupation'].unique())

Out[21]: 33606

In [22]: fecmain['contbr_occupation'].value_counts()

Out[22]: RETIRED 177897

ATTORNEY 30219
INFORMATION REQUESTED 24785
HOMEMAKER 19742
PHYSICIAN 17259
...
ASSSISTANT PROFESSOR 1
C.C.D.O. 1
SENIOR INTERNATIONAL CORRECTIONS ADVIS 1
OPERATOR TECHNICIAN/UNION OFFICAL 1
COMPLIANCE ANALYST, STATE GOVERNMENT 1
Name: contbr_occupation, Length: 33605, dtype: int64

Replace synonyms: (also: dict.get())

In [23]: occ_mapping = {'INFORMATION REQUESTED': 'NOT PROVIDED',

'INFORMATION REQUESTED PER BEST EFFORTS': 'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)': 'NOT PROVIDED',
'C.E.O.': 'CEO'}

In [24]: fecmain['contbr_occupation'].map(occ_mapping)

8
Out[24]: 411 NaN
412 NaN
413 NaN
414 NaN
415 NaN
...
701381 NaN
701382 NaN
701383 NaN
701384 NaN
701385 NaN
Name: contbr_occupation, Length: 700975, dtype: object

In [25]: # .get()!
print(occ_mapping.get('PROFESSOR'))
print(occ_mapping.get('PROFESSOR', 'PROFESSOR'))

None
PROFESSOR

In [26]: fecmain['contbr_occupation'] = fecmain['contbr_occupation'].map(lambda x: occ_mapping.

In [27]: fecmain['contbr_occupation']

Out[27]: 411 FOREIGN SERVICE OFFICER

412 FOREIGN SERVICE OFFICER
413 INTELLIGENCE ANALYST
414 ENGINEER
415 COMPUTER SYSTEMS ENGINEER
...
701381 GS-15 INTERNATIONAL RELATIONS OFFICER
701382 RETIRED
701383 EDUCATOR
701384 GS-15 INTERNATIONAL RELATIONS OFFICER
701385 TEACHER
Name: contbr_occupation, Length: 700975, dtype: object

Synonymous employer mappings:

In [28]: emp_mapping = occ_mapping.copy()

emp_mapping['SELF'] = 'SELF-EMPLOYED'
emp_mapping['SELF EMPLOYED'] = 'SELF-EMPLOYED'
emp_mapping

Out[28]: {'INFORMATION REQUESTED': 'NOT PROVIDED',

'INFORMATION REQUESTED PER BEST EFFORTS': 'NOT PROVIDED',
'INFORMATION REQUESTED (BEST EFFORTS)': 'NOT PROVIDED',
'C.E.O.': 'CEO',
'SELF': 'SELF-EMPLOYED',
'SELF EMPLOYED': 'SELF-EMPLOYED'}

9
In [29]: fecmain['contbr_employer'] = fecmain['contbr_employer'].map(lambda x: emp_mapping.get(

In [30]: emp_mapping.get('prof','pro')

Out[30]: 'pro'

Create a "pivot table" that shows occupations as rows and party affiliation as columns, sum-
ming the individual contributions.

In [31]: by_occ = fecmain.pivot_table('contb_receipt_amt', index='contbr_occupation', columns='

by_occ

Out[31]: party D R
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER 100.0 NaN
AREA VICE PRESIDENT 250.0 NaN
RESEARCH ASSOCIATE 100.0 NaN
TEACHER 500.0 NaN
THERAPIST 3900.0 NaN
... ... ..
ZEPPOS AND ASSOCIATES 1000.0 NaN
ZONE MANAGER 135.0 NaN
ZOOKEEPER 35.0 NaN
ZOOLOGIST 400.0 NaN
ZOOLOGY EDUCATION 25.0 NaN

[33603 rows x 2 columns]

Determine which occupations account for $1 million or more in contributions. Compare the
amounts between the two party affiliations. (Bonus: Make a plot to compare these visually.)

In [32]: over_1mil = by_occ[by_occ.sum(axis=1) > 1e6]*1e-6

len(over_1mil)

Out[32]: 25

In [33]: over_1mil

Out[33]: party D R
contbr_occupation
ATTORNEY 11.126933 5.302579
BANKER 0.224084 0.993536
BUSINESS OWNER 0.449979 0.601530
CEO 2.071475 2.247242
CHAIRMAN 0.495547 0.752587
CONSULTANT 2.459813 1.404577
ENGINEER 0.950426 0.317249
EXECUTIVE 1.355161 2.230654
FINANCE 0.294031 1.118567

10
HOMEMAKER 4.243394 8.037251
INVESTOR 0.884133 1.494725
LAWYER 3.159392 0.007705
MANAGER 0.762693 0.565826
NOT EMPLOYED 1.708108 NaN
NOT PROVIDED 4.849802 11.173375
OWNER 0.998867 0.845396
PHYSICIAN 3.732387 1.332996
PRESIDENT 1.878010 2.403440
PROFESSOR 2.163571 0.160362
REAL ESTATE 0.528752 1.033896
RETIRED 25.270507 11.266949
SELF-EMPLOYED 0.667393 1.063738
STUDENT 0.627450 0.488743
TEACHER 1.250969 0.138838
WRITER 1.084089 0.090363

In [34]: sorted_occ = over_1mil.sum(axis=1).sort_values()

In [35]: over_1mil.sum(axis=1).sort_values()

Out[35]: contbr_occupation
BUSINESS OWNER 1.051509
STUDENT 1.116193
WRITER 1.174452
BANKER 1.217621
CHAIRMAN 1.248134
ENGINEER 1.267675
MANAGER 1.328519
TEACHER 1.389808
FINANCE 1.412599
REAL ESTATE 1.562648
NOT EMPLOYED 1.708108
SELF-EMPLOYED 1.731132
OWNER 1.844263
PROFESSOR 2.323933
INVESTOR 2.378858
LAWYER 3.167097
EXECUTIVE 3.585815
CONSULTANT 3.864390
PRESIDENT 4.281450
CEO 4.318717
PHYSICIAN 5.065384
HOMEMAKER 12.280645
NOT PROVIDED 16.023177
ATTORNEY 16.429512
RETIRED 36.537456
dtype: float64

11
In [36]: over_1mil_sorted = over_1mil.loc[sorted_occ.index]
over_1mil_sorted.plot(kind='barh', stacked=True, figsize=(10, 6));

6 Simple ranking
Determine largest donors:

In [37]: largest_donors = fecmain['contb_receipt_amt'].nlargest(7)

largest_donors

Out[37]: 325136 2014490.51

326651 1944042.43
344539 1679114.65
344419 1511192.17
319478 526246.17
335187 512710.91
257270 451726.00
Name: contb_receipt_amt, dtype: float64

In [38]: fecmain.loc[largest_donors.index]

Out[38]: cmte_id cand_id cand_nm \

325136 C00431445 P80003338 Obama, Barack
326651 C00431445 P80003338 Obama, Barack
344539 C00431445 P80003338 Obama, Barack
344419 C00431445 P80003338 Obama, Barack
319478 C00431445 P80003338 Obama, Barack
335187 C00431445 P80003338 Obama, Barack

12
257270 C00431445 P80003338 Obama, Barack

contbr_nm contbr_city contbr_st contbr_zip \

325136 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680
326651 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680
344539 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680.0
344419 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680.0
319478 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680
335187 OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680.0
257270 OBAMA VICTORY FUND 2012 - UNITEMIZED WASHINGTON DC 20003

contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt \

325136 NaN NaN 2014490.51 29-FEB-12
326651 NaN NaN 1944042.43 31-DEC-11
344539 NaN NaN 1679114.65 31-MAR-12
344419 NaN NaN 1511192.17 30-SEP-11
319478 NaN NaN 526246.17 30-APR-12
335187 NaN NaN 512710.91 31-JAN-12
257270 NaN NaN 451726.00 30-JUN-11

receipt_desc memo_cd memo_text form_tp file_num party

325136 NaN X * SA18 787803 D
326651 NaN X * SA18 772372 D
344539 NaN X * SA18 780234 D
344419 NaN X * SA18 756218 D
319478 NaN X * SA18 785239 D
335187 NaN X * SA18 775668 D
257270 NaN X * SA18 756214 D

Display largest donors, grouped by candidate:

In [39]: grouped = fecmain.groupby('cand_nm')

grouped['contb_receipt_amt'].nlargest(3)

Out[39]: cand_nm
Obama, Barack 325136 2014490.51
326651 1944042.43
344539 1679114.65
Romney, Mitt 65131 12700.00
41888 10000.00
90076 10000.00
Name: contb_receipt_amt, dtype: float64

In [40]: type(grouped)

Out[40]: pandas.core.groupby.generic.DataFrameGroupBy

.apply() for groups:

In [41]: grouped.apply(lambda x: type(x))

13
Out[41]: cand_nm
Obama, Barack <class 'pandas.core.frame.DataFrame'>
Romney, Mitt <class 'pandas.core.frame.DataFrame'>
dtype: object

Use .apply() to get DataFrame objects showing the largest donors, grouped by candidate and
occupation:

In [42]: def top_amounts_by_occupation(df, n=5):

# Fill me in!
totals = df.groupby('contbr_occupation')['contb_receipt_amt'].sum()
return totals.nlargest(n)

top_amounts_by_occupation(fecmain)

Out[42]: contbr_occupation
RETIRED 36537456.46
ATTORNEY 16429511.79
NOT PROVIDED 16023176.80
HOMEMAKER 12280645.16
PHYSICIAN 5065383.78
Name: contb_receipt_amt, dtype: float64

In [43]: grouped.apply(top_amounts_by_occupation, n=10)

Out[43]: cand_nm contbr_occupation

Obama, Barack RETIRED 25270507.23
ATTORNEY 11126932.97
NOT PROVIDED 4849801.96
HOMEMAKER 4243394.30
PHYSICIAN 3732387.44
LAWYER 3159391.87
CONSULTANT 2459812.71
PROFESSOR 2163571.08
CEO 2071474.79
PRESIDENT 1878009.95
Romney, Mitt RETIRED 11266949.23
NOT PROVIDED 11173374.84
HOMEMAKER 8037250.86
ATTORNEY 5302578.82
PRESIDENT 2403439.77
CEO 2247242.03
EXECUTIVE 2230653.79
INVESTOR 1494725.12
CONSULTANT 1404576.94
PHYSICIAN 1332996.34
Name: contb_receipt_amt, dtype: float64

14
7 Big vs. small donations
For each of the leading two candidates, did most of their money come from large or small dona-
tions?

In [44]: bins = [0] + [10**k for k in range(0, 8)]

bins

Out[44]: [0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]

In [45]: labels = pd.cut(fecmain['contb_receipt_amt'], bins, right=False)

labels[:5]

Out[45]: 411 [10, 100)

412 [100, 1000)
413 [100, 1000)
414 [10, 100)
415 [100, 1000)
Name: contb_receipt_amt, dtype: category
Categories (8, interval[int64, left]): [[0, 1) < [1, 10) < [10, 100) < [100, 1000) < [

In [46]: grouped = fecmain.groupby(['cand_nm', labels])

grouped.size()

Out[46]: cand_nm contb_receipt_amt

Obama, Barack [0, 1) 824
[1, 10) 12028
[10, 100) 300624
[100, 1000) 239811
[1000, 10000) 36440
[10000, 100000) 2
[100000, 1000000) 3
[1000000, 10000000) 4
Romney, Mitt [0, 1) 0
[1, 10) 1407
[10, 100) 22575
[100, 1000) 42991
[1000, 10000) 38177
[10000, 100000) 5
[100000, 1000000) 0
[1000000, 10000000) 0
dtype: int64

Fin!

NB 13
No ratings yet
NB 13
27 pages
Week 4 Homework: This Is A Preview of The Published Version of The Quiz
No ratings yet
Week 4 Homework: This Is A Preview of The Published Version of The Quiz
7 pages
NB 14
No ratings yet
NB 14
15 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
26 pages
Python Assignment
No ratings yet
Python Assignment
7 pages
Manipulating and Analyzing Data With Pandas
No ratings yet
Manipulating and Analyzing Data With Pandas
50 pages
NB 9
No ratings yet
NB 9
29 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
Unit 3 - Week 1 Quiz
No ratings yet
Unit 3 - Week 1 Quiz
3 pages
Sta 242-Bivariate Analysis-2-Joint MGF
No ratings yet
Sta 242-Bivariate Analysis-2-Joint MGF
9 pages
Tidy Data with Pandas: A Comprehensive Guide
No ratings yet
Tidy Data with Pandas: A Comprehensive Guide
25 pages
Python Programming Assignment Tasks
No ratings yet
Python Programming Assignment Tasks
12 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Basic Python
No ratings yet
Basic Python
111 pages
NB 15
No ratings yet
NB 15
20 pages
NB 10
0% (1)
NB 10
24 pages
Python NumPy Course: Multi-dimensional Data
No ratings yet
Python NumPy Course: Multi-dimensional Data
36 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
No ratings yet
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
87 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
C Programming Arrays
No ratings yet
C Programming Arrays
34 pages
Python Sets Presentation
No ratings yet
Python Sets Presentation
44 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
1-PPS Python Lab Manual CSEL202
No ratings yet
1-PPS Python Lab Manual CSEL202
86 pages
Digital Filter Structures Guide
No ratings yet
Digital Filter Structures Guide
32 pages
Python Assignment
No ratings yet
Python Assignment
2 pages
f19 HW03 Module02b Solns PDF
No ratings yet
f19 HW03 Module02b Solns PDF
8 pages
Classification & Regression Models
No ratings yet
Classification & Regression Models
32 pages
Unit I
100% (1)
Unit I
27 pages
ROS Cheat Sheet v1.01 1
No ratings yet
ROS Cheat Sheet v1.01 1
1 page
Cheat Sheet For Competitive Programming With Python 3 - by Utsav Chokshi - Cheat Sheets - Medium
No ratings yet
Cheat Sheet For Competitive Programming With Python 3 - by Utsav Chokshi - Cheat Sheets - Medium
10 pages
NumPy for Data Science Enthusiasts
No ratings yet
NumPy for Data Science Enthusiasts
119 pages
Beta & Gamma Functions
No ratings yet
Beta & Gamma Functions
16 pages
KNN Algorithm for Students
100% (1)
KNN Algorithm for Students
18 pages
Lists and Tuples Part 1 - ch07 - Part1
No ratings yet
Lists and Tuples Part 1 - ch07 - Part1
27 pages
Hennig 2021 Probabilistic Machine Learning
No ratings yet
Hennig 2021 Probabilistic Machine Learning
189 pages
Data Science Codealpha
No ratings yet
Data Science Codealpha
11 pages
Data Structures and Algorithms (DSA) in Python - Self Paced
No ratings yet
Data Structures and Algorithms (DSA) in Python - Self Paced
4 pages
CMOS Power System Design Guide
No ratings yet
CMOS Power System Design Guide
16 pages
A1 PDF
No ratings yet
A1 PDF
3 pages
Probability Distributions in Data Science - Towards Data Science
No ratings yet
Probability Distributions in Data Science - Towards Data Science
15 pages
Linked List PDF
No ratings yet
Linked List PDF
49 pages
Bma3201 Operations Research I
No ratings yet
Bma3201 Operations Research I
190 pages
Linear Regression Models Guide
100% (1)
Linear Regression Models Guide
61 pages
Linked List A5
100% (1)
Linked List A5
196 pages
Discrete Struc Ture: Counting, Peigonhole
No ratings yet
Discrete Struc Ture: Counting, Peigonhole
50 pages
Aim Training for Gamers
No ratings yet
Aim Training for Gamers
56 pages
Grupo 7 Build A Geospatial Dashboard in Python Using Greppo by Adithya Krishnan Towards Data Science
100% (1)
Grupo 7 Build A Geospatial Dashboard in Python Using Greppo by Adithya Krishnan Towards Data Science
13 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
Assignment-Questions-Python 2 Feb24
No ratings yet
Assignment-Questions-Python 2 Feb24
13 pages
Probability-The Science of Uncertainty and Data
No ratings yet
Probability-The Science of Uncertainty and Data
4 pages
k-Means Clustering Guide
No ratings yet
k-Means Clustering Guide
2 pages
Daa Project File
No ratings yet
Daa Project File
49 pages
Lecture 3 Mathematics For Machine Learning
No ratings yet
Lecture 3 Mathematics For Machine Learning
97 pages
About The Classification and Regression Supervised Learning Problems
No ratings yet
About The Classification and Regression Supervised Learning Problems
3 pages
Time Series Forecasting Business Report
No ratings yet
Time Series Forecasting Business Report
62 pages
LatestPythonLabManual-2023 Batch
No ratings yet
LatestPythonLabManual-2023 Batch
15 pages
Ip-Data Handling With Pandas
No ratings yet
Ip-Data Handling With Pandas
14 pages
Ip-Data Handling With Pandas
No ratings yet
Ip-Data Handling With Pandas
13 pages
Python Pandas Series
No ratings yet
Python Pandas Series
30 pages
Nb3 (Optional)
No ratings yet
Nb3 (Optional)
35 pages
Python Algorithm Exercises & Solutions
No ratings yet
Python Algorithm Exercises & Solutions
16 pages
Linear Regression for Data Science
No ratings yet
Linear Regression for Data Science
34 pages
NB 1
No ratings yet
NB 1
17 pages

Pandas for Data Science Students

Uploaded by

Pandas for Data Science Students

Uploaded by

part0

1 Supplemental notes on Pandas

In [1]: import pandas as pd # Standard idiom for loading pandas

print("=== pandas version: {} ===\n".format(pd.__version__))

=== pandas version: 1.4.0 ===

=== Python version ===

In [2]: from pandas import DataFrame, Series

`obj` has type `<class 'pandas.core.series.Series'>`:

In [4]: from random import randint

print("==> Estimating time to search the `list`:")

print("\n==> Estimating time to search the `Series`:")

print(f"\n==> (`list` time) divided by `Series` time is roughly {t_list_search.average

Creating random `list` and `Series` objects:

==> Estimating time to search the `list`:

==> Estimating time to search the `Series`:

In [5]: obj2 = Series([-1, '2', -3, '4', -5])

If you want to query the base type, use:

In [7]: print("obj.index: {}".format(obj.index))

obj.index: RangeIndex(start=0, stop=5, step=1)

In [8]: print("==> obj[2]:\n{}\n".format(obj[2]))

In [10]: I_pos = obj > 0

<class 'pandas.core.series.Series'> 0 False

In [12]: obj3 = Series([ 1, -2, 3, -4, 5, -6],

In fact, you can construct a Series from a dictionary directly:

In [15]: mujeres = [0, 2, 4] # list of integer offsets

* las mujeres of `obj3` at offsets [0, 2, 4]:

In [16]: hombres = ['bob', 'dave', 'frank'] # list of index values

* hombres, by their names, ['bob', 'dave', 'frank']:

In [17]: I_neg = obj3 < 0

In [19]: print('carol' in peeps)

Basic arithmetic works on Series as vector-like operations.

In [20]: print(obj3, "\n")

In [22]: obj_l = obj3[mujeres]

In [23]: obj3 + obj_l

Out[23]: alice 2.0

In [24]: abs(-5) # Python built-in function

In [25]: obj3 # Recall

In [27]: obj3 # Note: `.apply()` returned a copy, so the original is untouched

A Series may be named, too.

When we move on to DataFrame objects, you’ll see why names matter.

type: <class 'pandas.core.frame.DataFrame'>

In [31]: display(cafes) # Or just `cafes` as the last line of a cell

name zip poc

Out[32]: Index(['name', 'zip', 'poc'], dtype='object')

Each column is a named Series:

In [33]: type(cafes['zip']) # Aha!

Out[34]: RangeIndex(start=0, stop=8, step=1)

In [35]: cafes.index == cafes['zip'].index

In [36]: cafes['zip'].index == cafes['poc'].index

You may use complex indexing of columns.

In [37]: target_fields = ['zip', 'poc']

Out[37]: zip poc

But slices apply to rows.

Out[38]: name zip poc

Out[39]: RangeIndex(start=0, stop=8, step=1)

In [40]: cafes2 = cafes[['poc', 'zip']]

Out[40]: poc zip

In [41]: cafes2.loc[['chrome yellow', '3heart']]

Out[41]: poc zip

In [42]: cafes2.iloc[[1, 3]]

Out[42]: poc zip

In [43]: cafes2['rating'] = 4.0

Out[43]: poc zip rating price

And vector arithmetic should work on columns as expected.

In [44]: prices_as_ints = cafes2['price'].apply(lambda s: len(s))

Out[44]: east pole 2

In [45]: cafes2['value'] = cafes2['rating'] / prices_as_ints

Out[45]: poc zip rating price value

In [46]: cafes3 = cafes2.copy()

Out[46]: poc zip rating price value

In [47]: is_fancy = cafes3['zip'].isin({30306, 30308})

Out[47]: east pole False

Out[48]: poc zip rating price value

In [49]: # Recall: Legal Python for string concatenation

In [50]: cafes3[is_fancy]['price'] += '$'

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/i

print("=== pandas version: {} ===\n".format(pd.version))