Python - Pandas
# What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on
statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
Where is the Pandas Codebase?
The source code for Pandas is located at this github repository
https://github.com/pandas-dev/pandas
1
1. Pandas Getting Started
1.1 Installation of Pandas
If you have Python and PIP already installed on a system, then installation
of Pandas is very easy.
Install it using this command:
C:\Users\Your Name>pip install pandas
1.2 Import Pandas
Once Pandas is installed, import it in your applications by adding the
"import" keyword:
Syntax : import pandas
Now Pandas is imported and ready to use.
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
OUTPUT :
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
1.3 Pandas as pd
Pandas is usually imported under the pd alias.
2
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the "as" keyword while importing:
### Syntax : import pandas as pd
Now the Pandas package can be referred to as pd instead of pandas.
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
OUTPUT :
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
1.4 Checking Pandas Version
The version string is stored under __version__ attribute.
import pandas as pd
print(pd.__version__)
OUTPUT :
1.2.2
2. Pandas Series
2.1 What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
3
Example 2.1 : Create a simple Pandas Series from a list - int, float, string
import pandas as pd
a = [1, 2, 3]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1
1 2
2 3
dtype: int64
The datatype of the elements in the Series is int64.
Based on the values present in the series, the datatype of the series is
decided.
import pandas as pd
a = [1.1, 2.2, 3.3]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1.1
1 2.2
2 3.3
dtype: float64
import pandas as pd
a = ["apple", "banana", "orange"]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 apple
1 banana
2 orange
dtype: object
4
import pandas as pd
a = [1, "banana", 3]
myvar = pd.Series(a)
print(myvar)
OUTPUT :
0 1
1 banana
2 3
dtype: object
2.2 Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
This label can be used to access a specified value.
Example 2.2 : Return the second value of the Series:
import pandas as pd
a = [1, 2, 3, 4, 5, 6, 7]
myvar = pd.Series(a)
print(myvar[1])
OUTPUT :
2
import pandas as pd
a = [1, 2, 3, 4, 5, 6, 7]
myvar = pd.Series(a)
print(myvar[1:4])
OUTPUT :
1 2
2 3
3 4
dtype: int64
5
2.3 Create Labels
With the "index argument", you can name your own labels.
Example 2.3 : Create you own labels
import pandas as pd
a = [1, 2, 3]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
OUTPUT :
x 1
y 2
z 3
dtype: int64
When you have created labels, you can access an item by referring to the label.
Example 2.4 : Return the value of "y":
import pandas as pd
a = [1, 2, 3]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar["y"])
OUTPUT :
2
3. Pandas DataFrames
3.1What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.
In Python Pandas module, DataFrame is a very basic and important type.
6
To create a DataFrame from different sources of data or other Python
datatypes, we can use "DataFrame()" constructor.
Syntax of DataFrame() class :
DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Example 3.1 Create an Empty DataFrame
To create an empty DataFrame, pass no arguments to pandas.DataFrame() class.
In this example, we create an empty DataFrame and print it to the console
output.
import pandas as pd
df = pd.DataFrame()
print(df)
OUTPUT :
Empty DataFrame
Columns: []
Index: []
Example 3.2 : Create a simple Pandas DataFrame
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
OUTPUT :
calories duration
0 420 50
1 380 40
2 390 45
7
Example 3.3 Create a simple Pandas DataFrame with Lables - Index
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
OUTPUT :
calories duration
day1 420 50
day2 380 40
day3 390 45
3.2 Create Pandas DataFrame from List of Lists?
To create Pandas DataFrame from list of lists, you can pass this list of
lists as data argument to "pandas.DataFrame()".
Each inner list inside the outer list is transformed to a row in resulting
DataFrame.
Example 3.4: Create DataFrame from List of Lists
import pandas as pd
#list of lists
data = [['a1', 'b1', 'c1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3']]
df = pd.DataFrame(data)
print(df)
OUTPUT :
0 1 2
0 a1 b1 c1
1 a2 b2 c2
2 a3 b3 c3
8
Example 3.5: Create DataFrame from List of Lists with Column Names & Index
import pandas as pd
#list of lists
data = [['a1', 'b1', 'c1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3']]
columns = ['C1', 'C2', 'C3']
index = ['R1', 'R2', 'R3']
df = pd.DataFrame(data, index, columns)
print(df)
OUTPUT :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
Example 3.5: Create DataFrame from List of Lists with Different List Lengths
import pandas as pd
#list of lists
data = [['a1', 'b1', 'c1', 'd1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3']]
df = pd.DataFrame(data)
print(df)
OUTPUT :
0 1 2 3
0 a1 b1 c1 d1
1 a2 b2 c2 None
2 a3 b3 c3 None
3.3 Create Pandas DataFrame from Python Dictionary
You can create a DataFrame from Dictionary by passing a dictionary as the
data argument to DataFrame() class.
9
Example 3.6: Create DataFrame from Dictionary
import pandas as pd
mydictionary = {'names': ['raju', 'ramu', 'ravi', 'akash'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]}
#create dataframe using dictionary
df_marks = pd.DataFrame(mydictionary)
print(df_marks)
OUTPUT :
names physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
Shape or Dimensions of Pandas DataFrame
To get the shape of Pandas DataFrame, use "DataFrame.shape".
The shape property returns a tuple representing the dimensionality of the
DataFrame.
The format of shape would be (rows, columns).
Example: DataFrame Shape
In the following example, we will find the shape of DataFrame.
Also, you can get the number of rows or number of columns using index on the
shape.
import pandas as pd
data = [['a1', 'b1', 'c1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3'],
['a4', 'b4', 'c4']]
columns = ['C1', 'C2', 'C3']
index = ['R1', 'R2', 'R3', 'R4']
df = pd.DataFrame(data, index, columns)
print('The DataFrame is :\n', df)
#get dataframe shape
shape = df.shape
print('\nDataFrame Shape :', shape)
print('\nNumber of rows :', shape[0])
print('\nNumber of columns :', shape[1])
10
OUTPUT :
The DataFrame is :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
R4 a4 b4 c4
DataFrame Shape : (4, 3)
Number of rows : 4
Number of columns : 3
Print Information of Pandas DataFrame
To print information of Pandas DataFrame, call DataFrame.info() method.
The DataFrame.info() method returns nothing but just prints information about
this DataFrame.
Example : Print DataFrame Information
In the following program, we have created a DataFrame.
We shall print this DataFrame’s information using DataFrame.info() method.
import pandas as pd
df = pd.DataFrame(
[['abc', 22],
['xyz', 25],
['pqr', 31]],
columns=['name', 'age'])
print(df)
df.info()
OUTPUT :
name age
0 abc 22
1 xyz 25
2 pqr 31
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
11
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 3 non-null object
1 age 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
import pandas as pd
data = [['a1', 'b1', 'c1'],
['a2', 'b2', 'c2'],
['a3', 'b3', 'c3'],
['a4', 'b4', 'c4']]
columns = ['C1', 'C2', 'C3']
index = ['R1', 'R2', 'R3', 'R4']
df = pd.DataFrame(data, index, columns)
print(df)
df.info()
OUTPUT :
C1 C2 C3
R1 a1 b1 c1
R2 a2 b2 c2
R3 a3 b3 c3
R4 a4 b4 c4
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, R1 to R4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C1 4 non-null object
1 C2 4 non-null object
2 C3 4 non-null object
dtypes: object(3)
memory usage: 128.0+ bytes
12
import pandas as pd
mydictionary = {'names': ['raju', 'ramu', 'ravi', 'akash'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]}
#create dataframe using dictionary
df_marks = pd.DataFrame(mydictionary)
print(df_marks)
df_marks.info()
OUTPUT :
names physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 names 4 non-null object
1 physics 4 non-null int64
2 chemistry 4 non-null int64
3 algebra 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 256.0+ bytes
Pandas Read CSV
A simple way to store big data sets is to use CSV files (comma separated
files).
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
import pandas as pd
#load dataframe from csv
df = pd.read_csv("pandas.csv")
#print dataframe
print(df)
13
OUTPUT :
Name maths physics chemisry
0 a 11 21 31
1 b 12 22 32
2 c 13 23 32
3 d 14 24 34
Note : If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
import pandas as pd
#load dataframe from csv
df = pd.read_csv("pandas1.csv")
#print dataframe
print(df)
OUTPUT :
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
[169 rows x 4 columns]
Tip: use to_string() to print the entire DataFrame.
import pandas as pd
#load dataframe from csv
df = pd.read_csv("pandas1.csv")
#print dataframe
print(df.to_string())
OUTPUT :
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
14
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.0
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
11 60 100 120 250.7
…
…..
Pandas DataFrame to Excel
Example : Write DataFrame to Excel File
You can write the DataFrame to Excel File without mentioning any sheet name.
The step by step process is given below:
1. Have your DataFrame ready. In this example we shall initialize a DataFrame
with some rows and columns.
2. Create an Excel Writer with the name of the output excel file, to which
you would like to write our DataFrame.
3. Call to_excel() function on the DataFrame with the Excel Writer passed as
argument.
4. Save the Excel file using save() method of Excel Writer.
import pandas as pd
# create dataframe
df_marks = pd.DataFrame({'name': ['raju', 'ramu', 'ravi', 'akash'],
'physics': [68, 74, 77, 78],
'chemistry': [84, 56, 73, 69],
'algebra': [78, 88, 82, 87]})
# create excel writer object
writer = pd.ExcelWriter('output2.xlsx')
# write dataframe to excel (a.b(dat))
df_marks.to_excel(writer)
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
print(df_marks)
15
OUTPUT :
DataFrame is written successfully to Excel File.
name physics chemistry algebra
0 raju 68 84 78
1 ramu 74 56 88
2 ravi 77 73 82
3 akash 78 69 87
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('pandas.csv')
print(df)
df.plot()
plt.show()
OUTPUT :
Name maths physics chemisry
0 a 11 21 31
1 b 12 22 32
2 c 13 23 32
3 d 14 24 34
16