Unit 2
Unit 2
Fast vectorized array operations for data munging and cleaning, subsetting and
filtering, transformation, and any other kinds of computations
Common array algorithms like sorting, unique, and set operations
" Eficient descriptive statistics and aggregating/summarizing data
Data alignment and relational data manipulations for merging and joining together
heterogeneous data sets
Expressing conditional logic as array expressions instead of loops with if-elif
else branches
Group-wise data manipulations (aggregation, transformation, function applica
tion). Much more on this in Chapter 5
While NumPy provides the computational foundation for these operations, you will
likely want to use pandas as your basis for most kinds of data analysis (especially for
structured or tabular data) as it provides a rich, high-level interface making most com
mon data tasks very concise and simple. pandas also provides some more domain
specific functionality like time series manipulation, which is not present in NumPy.
In this chapter and throughout the book, I use the standard NumPy
convention of always using import numpy as np. You are, of course,
g welcome to put from numpy import *in your code to avoid having to
write np., but I would caution you against makinga habit of this.
www.it-ebooks. info
In (12]: data.dtype
Out[12]: dtype(" float64')
This chapter will introduce you to the basics of using NumPy arrays, and should be
sufficient for following along with the rest of the book. While it's not necessary to have
adeep understanding of NumPy for many data analytical applications, becoming pro
ficient in array-oriented programming and thinking is a key step along the way to be
coming a scientific Python guru.
Creating ndarrays
The easiest way to create an array is to use the array function. This accepts any se
quence-like object (including other arrays) and produces a new NumPy array contain
ing the passed data. For example, a list is a good candidate for conversion:
In [13]: data1 = (6, 7.5, 8, 0, 1]
In [14]: arri = np.array(data1)
In [15]: arr1
Out [15]: array([ 6. , 7.5, 8. , 0. , 1. )
Nested sequences, like alist of equal-length lists, will be converted into a multidimen
sional array:
In [16]: data2 ([1, 2, 3, 4], [5, 6, 7, 8]]
www.it-ebooks. info
In [22]: arr2.dtype
Out[22]: dtype('int64')
In addition to np.array, there are a number of other functions for creating new arrays.
As examples, zeros and ones create arrays of 0's or 1's, respectively, with agiven length
or shape. empty creates an array without initializing its values to any particular value.
To create a higher dimensional array with these methods, pass a tuple for the shape:
In (23]: np.zeros(10)
Out [23]: array([ 0., 0.) 0., 0., 0., 0., 0., 0., 0., 0.])
4.9406s646e-324, 4.9
[[ 1.90723115e+083, S.73293533e-053],
[ -2.33568637e+124, -6.70608105e-012],
[ 4.42786966e+160, 1.27100354e+025]]])
Ir's not safe to assume that np.empty will return an array of all zeros. In
many cases, as previously shown, it will return uninitialized garbage
values.
www.it-ebooks. info
Function Description
empty, empty_like Create new arrays by allocating new memory, but do not populate with any values like
ones and zeros
eye, identity (reate asquare NxNidentity matrix (1's on the diagonal and O's elsewhere)
www.it-ebooks. info
Type Type Code Descriptlon
float128 f16 or g Extended-precision floating point
complex64, complex128, c8, c16, Complexnumbersrepresentedby two 32,64,or 128foats, respectively
complex256 c32
You can explicitly convert or cast an array from one dype to another using ndarray's
astype method:
In [31]: arr np.array([1, 2, 3, 4, 5))
In [32]: arr.dtype
Out[32]: dtype(" int64')
In (33]: float_ar = arr.astype(np.float64)
In (34]: float_arr.dtype
Out[34]: dtype"float64')
In this example, integers were cast to floating point. IfI cast some floating point num
bers to be of integer dtype, the decimal part will be truncated:
In [35): arr = np. array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [36]: arr
Out[36]: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
In [37]: arr.astype (np. int32)
Qut37]: array([ 3, -1, -2, 0, 12, 10), dtype=int32)
Should you have an array of strings representing numbers, you can use astype to convert
them to numeric form:
In [38]: numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string )
In (39): numeric _strings. astype(float)
Out [39): array([ 1.25, -9.6 , 42. J)
If casting were to fail for some reason (like a string that cannot be converted to
float64), a TypeError will be raised. See that I was a bit lazy and wrote float instead of
np.float64; NumPy is smart enough to alias the Python types to the equivalent dtypes.
You can also use another array's dtype attribute:
In [40]: int array = np.arange(10)
www.it-ebooks. info
In [41]: calibers np.array([.22, .270, 357, 380, .44, 50], dtype=np.float64)
In (42]: int array.astype(calibers.dtype)
Out [42]: array([ 0., 1., 2., 3., 4. 5., 6., 7., 8., 9.])
There are shorthand type code strings you can also use to refer to a dtype:
In [43]: empty uint32 = np.empty(8, dtype='u4')
In [44]: empty_uint32
Out [44):
array([ 0, 65904672, 0, 64856792, 0,
39438163, 0], dtype=uint32)
Calling astype always creates a new array (a copy of the data), even if
the new dtype is the same as the old dtype.
It's worth keeping in mind that floating point numbers, such as those
in float64 and float32 arrays, are only capable of approximating frac
tional quantities. In complex computations, you may accrue some
floating point error, making comparisons only valid up to acertain num
ber of decimal places.
www.it-ebooks.info
Operations between differently sized arrays is called broadcasting and will be discussed
in more detail in Chapter 12. Having adeep understanding of broadcasting is not nec
essary for most of this book.
www.it-ebooks. info
If you want a copy of a slice of an ndarray instead of a view, you will
need to explicitly copy the array; for example arr[5:8].copy().
With higher dimensional arrays, you have many more options. In atwo-dimensional
array, the elements at each index are no longer scalars but rather one-dimensional
arrays:
In [62]: arr2d = np. array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [63]: arr2d[2]
Out [63): array([7, 8, 9])
Thus, individual elements can be accessed recursively. But that is a bit too much work,
so you can pass a comma-separated list of indices to select individual elements. So these
are equivalent:
In [64]: arr2d[o][2)
Out [64]: 3
In [65]: arr2d[o, 2]
Out [65]: 3
See Figure 4-1 for an ilustration of indexing on a 2D array.
axis 1
1
2,0 2, 1 2,2
In multidimensional arrays, ifyouomit later indices, the returned object will be a lower
dimensional ndarray consisting of all the data along the higher dimensions. So in the
2 x 2x 3 array arr3d
In [66]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]1])
In (67]: arr3d
Out[67]:
array([[[ 1, 2, 3),
www.it-ebooks. info
[4, S, 6]1.
([ 7, 8, 9],
(10, 11, 12]]])
arr3d[0] is a 2 x 3 array:
In [68]: arr3d[o]
Out[68]:
array([[1, 2, 3],
[4, 5, 6]])
Both scalar values and arrays can be assigned to arr3d[o]:
In [69]: old values = arr3d[o].copy()
In [70]: arr3d[o] = 42
In [71]: arr3d
Out [71]:
array(([[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]])
In [72]: arr3d[o] = old_values
In [73]: arr3d
Out[73]:
array([[l 1, 2, 3]
[4, 5, 6]),
[I 7, 8, 9],
[10, 11, 12]]])
Similarly, arr3d[1, o] gives you all of the values whose indices start with (1, 0), form
ing a 1-dimensional array:
In [74]: arr3d[1, o]
Out[74): array([7, 8, 9])
Note that in all of these cases where subsections of the array have been selected, the
returned arrays are views.
www.it-ebooks. info
array([[1, 2, 3], array([ [4,
[1, 5,2, 6jj)
3],
[4, 5, 6]
[7, 8, 9]))
As you can see, it has sliced along axis 0, the first axis. Aslice, therefore, selects a range
of elements along an axis. You can pass multiple slices just like you can pass multiple
indexes:
In [78]: arr2d[:2, 1:]
Out [78]:
array([[2, 3).
[S, 6j])
When slicing like this, you always obtain array views of the samenumber ofdimensions.
By mixing integer indexes and slices, you get lower dimensional slices:
In [79]: arr2d[1, :2] Out
In olarr2d[2, :1)
Out[79]: array([4, si) array([7])
See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing:
In [81]: arr2d[:, :1]
Out [81]:
array([[1],
(
i71j)
Of course, assigning to a slice expression assigns to the whole selection:
In [82]: arr2d[:2, 1:] = 0
Boolean Indexing
Let's consider an example where we have some data in an array and an array of names
with duplicates. I'm going to use here the randn function in numpy.random to generate
some random normally distributed data:
In [83]: names np.array(['Bob', 'Joe', Will', 'Bob', 'Will', 'Joe', 'Joe'])
www.it-ebooks. info
[0.1913, 0.4544, 0.4519, 0. 5535],
[0.5994, 0.8174, -0.9297, -1.2564]])
Expression Shape
arr[:2, 1:] (2, 2)
arr[2] (3,)
arr[2, :] (3,)
arr[2:, :] (1, 3)
Suppose each name corresponds to a row in the data array. If we wanted to select all
the rows with corresponding name 'Bob". Like arithmetic operations, comparisons
(such
'Bob'
as ==) with arrays are also vectorized. Thus, comparing names with the string
yields boolean array:
a
In [87]: names == 'Bob
Out[87]: array([ True, False, False, True, False, False, False], dtype=bool)
This boolean array can be passed when indexing the array:
In [88]: data[names == 'Bob']
Out [88j:
array([[-o.048 , 0.5433, -0.2349, 1.2792],
[2.1452, 0.8799, -0.0523, 0.0672]])
The boolean array must be of the same lengrh as the axis it's indexing. You can even
mix and match boolean arrays with slices or integers (or sequences of integers, more
on this later):
In [89]: data[names = 'Bob', 2:]
Out [89]:
array([[-0.2349, 1.2792],
www.it-ebooks.info
[-0.0523, 0.0672]])
Setting values with boolean arrays works in a common-sense way. To set all of the
negative values in data to Owe need only do:
In [96]: data[data < o] = 0
In (97]: data
Out[97]:
array([[ 0. 0.5433, 0. , 1.2792]
0.5465, 0.0939, 0.
0 0. 0.7719, 0.3103],
2.1452, 0.8799, 0. 0.0672],
0. , 0. , 1.1503, 1.7289],
0.1913, 0.4544, 0.4519, 0.5535],
0.5994, 0.8174, 0. 0. j)
www.it-ebooks. info
Setting whole rows or columns using a lD boolean array is als0 easy:
In [98]: data[names != 'Joe'] -7
In [99]: data
Out [99]:
array([[ 7. 7.
0. 0.5465, 0.0939, 0,
7. 7.
7. 7.
7 7.
0.1913, 0.4544, 0.4519,
0.5994, 0.8174, 0.
o.553)
0.
Fancy Indexing
Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
Suppose we had a 8 x 4 array:
In [100]: arr = np.empty((8, 4))
In [101]: for i in range(8):
arr[i] i
In [102]: arr
Out[102]:
array([[ 0., 0., 0., 0.],
1., 1. 1., 1. J
2. 2.. 2. 2.
3"J»
4. 4) 4.) 4.],
5., 5., 5., 5.],
6., 6., 6., 6.],
7., 7., 7., 7.]))
To select out a subset of the rows in a particular order, you can simply pass a list or
ndarray of integers specifying the desired order:
In [103]: arr[[4, 3, 0, 6]]
Out [103] :
array([[ 4., 4. 4., 4.J,
3., 3, 3., 3.],
0., 0., 0., 0.J,
6., 6., 6., 6.11)
Hopefully this code did what you expected! Using negative indices select rows from
the end:
In [104]: arr[[-3, -5, -7]]
Out[104):
array([[ 5., 5., 5., 5.],
3., 3., 3., 3.],
I1., 1., 1., 1.]])
www.it-ebooks.info
index arrays does something slightly different; it selects a lD array of
Passing multiple indices:
elements corresponding to each tuple of
# more on reshape in Chapter (8, 4))
In (105]: arr = np.arange(32) .reshape(
In (106]: arr
Out [106] :
array([[ 0, 1, 2, 3],
4, 5, 6, 7],
8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]
[24, 25, 26, 27,
[28, 29, 30, 31])
www.it-ebooks. info
Out[111] : Out[112 ]:
array([[ 0, 1, 2, 3, 4] array([[ 0, S, 10],
[5, 6, 7, 8, 9], 1, 6, 11],
(10, 11, 12, 13, 14]]) [ 2, 7,, 121,
3, 8, 13],
4, 9, 14]])
When doing matrix computations, you will do this very often, like for example com
puting the inner matrix product X'X using np. dot:
In (113]: arr = np.random. randn(6, 3)
In [114]: np.dot(arr.T, arr)
Out [114]:
array([[ 2.584 , 1.8753, 0.88881,
1.8753, 6.6636, 0.3884],
[ o.8888, 0.3884, 3.9781]])
For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute
the axes (for extra mind bending):
In [115]: arr = np.arange(16) .reshape((2, 2, 4))
In [116]: arr
Out [116]:
array([([ 0, 1, 2, 3:
[4, 5, 6, 7]]),
[[8, 9, 10, 11),.
(12, 13, 14, 15j)])
In [117]: arr.transpose((1, 0, 2))
Out[117]:
array([[[ 0, 1, 2, 3),
[8, 9, 10, 11]],
[[ 4, 5, 6, 7),
(12, 13, 14, 15í])
Simple transposing with .T is just a special case of swapping axes. ndarray has the
method swapaxes which takes a pair of axis numbers:
In [118]: arr In [119]: arr.swapaxes (1, 2)
Out [118] : Out[119] :
array([[[ o, 4),
array(([[ 0, 1, 2, 7ii.
3),
[4, 5, 6, 1, 5],
2, 6],
([ 8, 9, 10, 3, 711,
[12, 13, 14, 15j1)
[[ 8, 12)],
9, 13],
[10, 14],
[11, 1s]]])
Swapaxes similarly returns a view on the data without making a copy.
www.it-ebooks. info