2.1 Pandas Objects
2.1 Pandas Objects
Pandas Objects
Three fundamental Pandas data structures: The Series, Data Frame, and Index.
We will start our code sessions with the standard NumPy and Pandas imports:
In [1]: import numpy as np
import pandas as pd
Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation:
In [5]: data[1]
Out[5]: 0.5
In [6]: data[1:3]
Out[6]: 1 0.50 2 0.75`
dtype: float64
Series as Generalized NumPy Array
The Series object may appear to be basically interchangeable with a one-dimensional
NumPy array.
The essential difference is that while the NumPy array has an implicitly defined integer
index used to access the values, the Pandas Series has an explicitly defined index associated
with the values.
This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of any desired type.
So, if we wish, we can use strings as an index:
In [7]: data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data Out[7]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
And the item access works as expected:
In [8]: data['b']
Out[8]: 0.5
We can even use noncontiguous or nonsequential indices:
In [9]: data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
In [10]: data[5]
Out[10]: 0.5
California 39538223
Texas 29145505
Florida 21538187
New York 20201249
Pennsylvania 13002700
dtype: int64
From here, typical dictionary-style item access can be performed:
In [12]:
population['California']
Out[12]:
39538223
Unlike a dictionary, though, the Series also supports array-style operations such as slicing:
In [13]:
population['California':'Florida']
Out[13]:
California 39538223
Texas 29145505
Florida 21538187
dtype: int64
pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index defaults to an integer
sequence:
In [14]:
pd.Series([2, 4, 6])
Out[14]:
0 2
1 4
2 6
dtype: int64
Or data can be a scalar, which is repeated to fill the specified index:
In [15]:
pd.Series(5, index=[100, 200, 300])
Out[15]:
100 5
200 5
300 5
dtype: int64
To demonstrate this, let's first construct a new Series listing the area of each of the five states
discussed in the previous section (in square kilometers):
In [18]:
area_dict = {'California': 423967, 'Texas': 695662, 'Florida':
170312,
'New York': 141297, 'Pennsylvania': 119280}
area = pd.Series(area_dict)
area
Out[18]:
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
dtype: int64
Now that we have this along with the population Series from before, we can use a dictionary to
construct a single two-dimensional object containing this information:
In [19]:
states = pd.DataFrame({'population': population,
'area': area})
states
Out[19]:
population area
Pennsylvani
13002700 119280
a
Like the Series object, the DataFrame has an index attribute that gives access to the index labels:
In [20]:
states.index
Out[20]:
Index(['California', 'Texas', 'Florida', 'New York',
'Pennsylvania'], dtype='object')
Additionally, the DataFrame has a columns attribute, which is an Index object holding the column
labels:
In [21]:
states.columns
Out[21]:
Index(['population', 'area'], dtype='object')
Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array,
where both the rows and columns have a generalized index for accessing the data.
DataFrame as Specialized Dictionary
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary
maps a key to a value, a DataFrame maps a column name to a Series of column data. For example,
asking for the 'area' attribute returns the Series object containing the areas we saw earlier:
In [22]:
states['area']
Out[22]:
California 423967
Texas 695662
Florida 170312
New York 141297
Pennsylvania 119280
Name: area, dtype: int64
population
California 39538223
Texas 29145505
Florida 21538187
Pennsylvania 13002700
From a list of dicts
Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to
create some data:
In [24]:
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[24]:
a b
0 0 0
1 1 2
2 2 4
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN values (i.e., "Not
a Number"; see Handling Missing Data):
In [25]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Out[25]:
a b c
0 1.0 2 NaN
1 NaN 3 4.0
1300270
Pennsylvania 119280
0
a 0.471098 0.317396
b 0.614766 0.305971
population area
foo bar
c 0.533596 0.512377
A B
0 0 0.0
1 0 0.0
2 0 0.0
This immutability makes it safer to share indices between multiple DataFrames and arrays, without
the potential for side effects from inadvertent index modification.
Index as Ordered Set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on
many aspects of set arithmetic. The Index object follows many of the conventions used by Python's
built-in set data structure, so that unions, intersections, differences, and other combinations can be
computed in a familiar way:
In [35]: indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
In [36]: indA.intersection(indB)
Out[36]: Int64Index([3, 5, 7], dtype='int64')
In [37]: indA.union(indB)
Out[37]: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
In [38]: sindA.symmetric_difference(indB)
Int64Index([1, 2, 9, 11], dtype='int64')