keyboard_arrow_down Introducing Pandas Objects
import numpy as np
import pandas as pd
keyboard_arrow_down The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:
data = pd.Series([5,14,99,888])
data
0 5
1 14
2 99
3 888
dtype: int64
data[3]
888
data.values
array([ 5, 14, 99, 888])
data.index
RangeIndex(start=0, stop=4, step=1)
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
data[1]
14
data[1:3]
1 14
2 99
dtype: int64
keyboard_arrow_down Series as generalized NumPy array
Index is the difference between numpy and pandas
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c','d'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
And the item access works as expected:
data['b']
0.5
keyboard_arrow_down Series as specialized dictionary
The Series -as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:
student_dict = {'Ram': 123,
'Shyam': 124,
'Arun': 125}
students = pd.Series(student_dict)
students
Ram 123
Shyam 124
Arun 125
dtype: int64
By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be
performed:
#To access rollno of Ram
students['Ram']
123
keyboard_arrow_down The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame . Like the Series object discussed in the previous section, the DataFrame can be
thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these
perspectives.
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])
# print dataframe.
df
Name Age
0 tom 10
1 nick 15
2 juli 14
df.index
RangeIndex(start=0, stop=3, step=1)
Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:
df.columns
Index(['Name', 'Age'], dtype='object')
Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a
generalized index for accessing the data.
df['Name']
0 tom
1 nick
2 juli
Name: Name, dtype: object
keyboard_arrow_down Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.
keyboard_arrow_down From a single Series object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series :
pd.DataFrame(students, columns=['rollno'])
rollno
Ram 123
Shyam 124
Arun 125
keyboard_arrow_down From a list of dicts
Any list of dictionaries can be made into a DataFrame .
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
a b c
0 1.0 2 NaN
1 NaN 3 4.0
keyboard_arrow_down From a two-dimensional NumPy array
Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will
be used for each:
np.random.rand(3, 2)
array([[0.48925761, 0.81202557],
[0.37526746, 0.9834642 ],
[0.10226165, 0.37402615]])
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo bar
a 0.965321 0.512423
b 0.969355 0.437354
c 0.196705 0.719428
keyboard_arrow_down From a NumPy structured array
We covered structured arrays in Structured Data: NumPy's Structured Arrays. A Pandas DataFrame operates much like a structured array, and
can be created directly from one:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
A B
0 0 0.0
1 0 0.0
2 0 0.0
keyboard_arrow_down Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The
Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other
combinations can be computed in a familiar way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection- common elements
<ipython-input-71-b0dd807d5915>:1: FutureWarning: Index.__and__ operating as a set operation is deprecated, in the future this will be a
indA & indB # intersection- common elements
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union - all elements
Index([3, 3, 5, 7, 11], dtype='int64')
indA ^ indB # symmetric difference
Index([3, 0, 0, 0, 2], dtype='int64')
DATA INDEXING AND SELECTION
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
# masking
data[(data ==0.5) ]
b 0.5
dtype: float64
# fancy indexing
data[['a', 'd']]
a 0.25
d 1.00
dtype: float64
Indexers: loc, iloc, and ix
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1 a
3 b
5 c
dtype: object
# explicit index when indexing - user defined index
data[1]
'a'
# implicit index when slicing
data[1:3]
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose
certain indexing schemes.
First, the loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1] #Local means explicit
'a'
data.loc[1:3]
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-style index:
data.iloc[1:3] #Implicit
3 b
5 c
dtype: object
student= [['Ram', 123,80,85], ['Shyam', 124,70,75],
['Arun', 125,35,60], ['Gopal', 235,95,70]]
data = pd.DataFrame(student,columns=['Name','Rollno',"FDS_Mark","DS_Mark"])
data
Name Rollno FDS_Mark DS_Mark
0 Ram 123 80 85
1 Shyam 124 70 75
2 Arun 125 35 60
3 Gopal 235 95 70
#Select all roll numbers
data['Rollno']
0 123
1 124
2 125
3 235
Name: Rollno, dtype: int64
data['name_dept'] = data['Name'] + "_CSE A"
data
Name Rollno FDS_Mark DS_Mark name_dept
0 Ram 123 80 85 Ram_CSE A
1 Shyam 124 70 75 Shyam_CSE A
2 Arun 125 35 60 Arun_CSE A
3 Gopal 235 95 70 Gopal_CSE A
#Select first two rows
data[:2]
Name Rollno FDS_Mark DS_Mark name_dept
0 Ram 123 80 85 Ram_CSE A
1 Shyam 124 70 75 Shyam_CSE A
#Operating on Pandas Data
#Dividing mark column by 100
data['FDS_Mark']/100
FDS_Mark
0 0.80
1 0.70
2 0.35
3 0.95
dtype: float64
data['DS_Mark']-15
DS_Mark
0 70
1 60
2 45
3 55
dtype: int64
data['Total_mark']=data['FDS_Mark']+data['DS_Mark']
data
Name Rollno FDS_Mark DS_Mark name_dept Total_mark
0 Ram 123 80 85 Ram_CSE A 165
1 Shyam 124 70 75 Shyam_CSE A 145
2 Arun 125 35 60 Arun_CSE A 95
3 Gopal 235 95 70 Gopal_CSE A 165
data['Total_mark'].mean()
142.5
data['Total_mark'].median()
155.0
data['Total_mark'].mode()
0 165
dtype: int64
#Handling Missing Data
isnull(): Generate a boolean mask indicating missing values
notnull(): Opposite of isnull()
dropna(): Return a filtered version of the data
fillna(): Return a copy of the data with missing values filled or imputed
import numpy as np
data = pd.Series([1, np.nan, 'hello', None])
data
0
0 1
1 NaN
2 hello
3 None
dtype: object
data.isnull()
0 False
1 True
2 False
3 True
dtype: bool
data.dropna() # Inplace changes original copy
0 1
2 hello
dtype: object
data
0 1
1 NaN
2 hello
3 None
dtype: object
#Filling Null Values
data.fillna(0)
0 1
1 0
2 hello
3 0
dtype: object
# forward-fill
data.fillna(method='ffill')
0 1
1 1
2 hello
3 hello
dtype: object
Start coding or generate with AI.