[go: up one dir, main page]

0% found this document useful (0 votes)
7 views25 pages

Pandas Shan Ver2

The document provides an overview of the Pandas library, highlighting its capabilities for data manipulation with DataFrames and Series, which are built on top of NumPy. It covers installation, basic operations, data indexing, and selection methods, as well as the creation of pivot tables for data summarization. Additionally, it illustrates how to construct DataFrames from various data structures and the use of the Index object in Pandas.

Uploaded by

amalamargret.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

Pandas Shan Ver2

The document provides an overview of the Pandas library, highlighting its capabilities for data manipulation with DataFrames and Series, which are built on top of NumPy. It covers installation, basic operations, data indexing, and selection methods, as well as the creation of pivot tables for data summarization. Additionally, it illustrates how to construct DataFrames from various data structures and the use of the Index object in Pandas.

Uploaded by

amalamargret.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Pandas

➢ A newer package built on top of NumPy


and provides an efficient implementation of a
DataFrame.
➢ Pandas implements a number of powerful
data operations familiar to users of both
database frameworks and spreadsheet
programs.
Installing and Starting
import pandas
pandas.__version__
Output: '0.18.1‘

import numpy as np
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75,
1.0])
print(data) Output: 01 0.25
0.50
2 0.75
3 1.00
dtype: float64
print(data.values)
Output: [ 0.25, 0.5 , 0.75, 1. ]
print(data.index)
Output: RangeIndex(start=0, stop=4, step=1)
print(data[1])
Output: 0.5
print(data[1:3])
Output 1 0.50
2 0.75
dtype: float64
Series as generalized NumPy array:
The essential difference is the presence of the
index: while the NumPy array has an implicitly
defined integer index used to access the values, the
Pandas Series has an explicitly defined index
associated with the values.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
print(data)
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
print(data['b'])
Output: 0.5
We can even use noncontiguous or nonsequential
indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
print(data)
Output: 2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
print(data[5])
Output: 0.5
Series as specialized dictionary:
A dictionary is a structure that maps arbitrary keys to a set of
arbitrary values, and a Series is a structure that maps typed keys to
a set of typed values.

population_dict = {'California': 38332521,


'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)

Output: California 38332521


Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
The Pandas DataFrame Object
• The next fundamental structure in Pandas is
the DataFrame.
• Like the Series object discussed in the
previous section, the DataFrame can be
thought of either as a generalization of a
NumPy array or as a specialization of a
Python dictionary.
• DataFrame as a generalized NumPy array:
A DataFrame is an analog of a two-dimensional
array with both flexible row indices and flexible column
names.
area_dict = {'California': 423967, 'Texas': 695662, 'New
York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
print(area)
Output: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
states = pd.DataFrame({'population': population, 'area':
area})
print(states)
Output: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

print(states.index)
Output:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'],
dtype='object')

print(states.columns)
Output: Index(['area', 'population'], dtype='object')
DataFrame as specialized dictionary:
we can also think of a DataFrame as a
specialization of a dictionary. Where a dictionary
maps a key to a value, a DataFrame maps a column
name to a Series of column data.
print(states['area'])
Output: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Constructing DataFrame objects:
From a single Series object:
print(pd.DataFrame(population, columns=['population']))
Output: population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
From a list of dicts:
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Output: a b
0 0 0
1 1 2
2 2 4
From a dictionary of Series objects:
print(pd.DataFrame({'population': population,
'area': area}))
Output: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
From a two-dimensional NumPy array:
print(pd.DataFrame(np.random.rand(3, 2), columns=[‘x’, ‘y'],
index=['a', 'b', 'c']) )
Output: foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718
• From a NumPy structured array:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(A)
Output: array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])
print(pd.DataFrame(A))
Output: A B
0 0 0.0
1 0 0.0
2 0 0.0
The Pandas Index Object
• This Index object is an interesting structure in
itself, and it can be thought of either as an
immutable array or as an ordered set
(technically a multiset, as Index objects may
contain repeated values).
An Index from a list of integers:
ind = pd.Index([2, 3, 5, 7, 11])
print(ind)
Output: Index([2, 3, 5, 7, 11], dtype='int64')
Index as immutable array:
print(ind[1])
Output: 3
print(ind[::2])
Output: Int64Index([2, 5, 11], dtype='int64')
print(ind.size, ind.shape, ind.ndim, ind.dtype)
Output: 5 (5,) 1 int64
ind[1] = 0
TypeError Traceback (most recent call last)
Index as ordered set:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
print(indA.intersection(indB)) # intersection
Output: Int64Index([3, 5, 7], dtype='int64')
print(indA.union(indB)) # union
Output: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
print(indA.symmetric_difference(indB)) # symmetric difference
Output: Int64Index([1, 2, 9, 11], dtype='int64')
Data Indexing and Selection
➢ We looked in detail at methods and tools to
access, set, and modify values in NumPy arrays.
➢ These included indexing , slicing, masking,
fancy indexing and combinations.
➢ Here we’ll look at similar means of accessing
and modifying values in Pandas Series and
DataFrame objects.
Data Selection in Series:
A Series object acts in many ways like a one dimensional NumPy
array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us
to understand the patterns of data indexing and selection in these arrays.
a) Series as dictionary:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
print(data)
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
print(data['b'])
Output: 0.5
b) Series as one- # masking
dimensional array:
# slicing by explicit index data[(data > 0.3) & (data
data['a':'c'] < 0.8)]
Output: a 0.25 Output: b 0.50
b 0.50 c 0.75
c 0.75 dtype: float64
dtype: float64 # fancy indexing
# slicing by implicit
integer index data[['a', 'e']]
data[0:2] Output: a 0.25
Output: a 0.25 e 1.25
b 0.50 dtype: float64
dtype: float64
c) Indexers: loc, iloc, and ix iloc attribute allows indexing and
slicing that always references
data = pd.Series(['a', 'b', the implicit Python-style index
'c'], index=[1, 3, 5]) print(data.iloc[1])
print(data) Output: 'b'
Output: 1 a print(data.iloc[1:3])
3 b Output: 3 b
5 c
5 c
dtype: object
dtype: object A third indexing attribute, ix, is
print(data.loc[1]) a hybrid of the two, and for
Series objects is equivalent to
Output: 'a' standard []-based indexing.
print(data.loc[1:3]) The purpose of the ix indexer
Output: 1 a will become more apparent in
the context of DataFrame
3 b objects, which we will discuss in
dtype: object a moment.
Pivot Tables
• A pivot table is a similar operation that is commonly
seen in spreadsheets and other programs that
operate on tabular data.
• The pivot table takes simple columnwise data as
input, and groups the entries into a two-dimensional
table that provides a multidimensional
summarization of the data.
• The difference between pivot tables and GroupBy
can sometimes cause confusion; it helps me to think
of pivot tables as essentially a multidimensional
version of GroupBy aggregation.
• That is, we splitapply-combine, but both the split and
the combine happen across not a onedimensional
index, but across a two-dimensional grid.
Motivating Pivot Tables:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
Pivot Tables by Hand
Using the vocabulary of GroupBy, we might proceed using
something like this: we group by class and gender, select
survival, apply a mean aggregate, com bine the resulting
groups, and then unstack the hierarchical index to reveal
the hidden multidimensionality.
Example:
titanic.groupby(['sex',
'class'])['survived'].aggregate('mean').unstack()
Pivot Table Syntax
titanic.pivot_table('survived', index='sex',
columns='class')

Multilevel pivot
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
Birthrate Data:
#This data can be found at
#https://raw.githubusercontent.com
births = pd.read_csv('births.csv')
births.head()
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade',
columns='gender', aggfunc='sum')

You might also like