Pandas Introduction: What Is Python Pandas Used For?
Pandas Introduction: What Is Python Pandas Used For?
It is built on top of the NumPy library which means that a lot of the
structures of NumPy are used or replicated in Pandas.
The data produced by Pandas is often used as input for plotting functions
in Matplotlib, statistical analysis in SciPy, and machine learning algorithms
in Scikit-learn.
You must be wondering, Why should you use the Pandas Library. Python's
Pandas library is the best tool to analyze, clean, and manipulate data.
Data Visualization.
Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in
the system or not. If not, then we need to install it on our system using
the pip command.
Step 2: Locate the folder using the cd command where the python-pip file
has been installed.
Importing Pandas
After the Pandas have been installed in the system, you need to import
the library. This module is generally imported as follows:
import pandas as pd
Indexing and
Selecting Data with
Pandas
Indexing in Pandas :
Indexing in pandas means simply selecting particular rows and
columns of data from a DataFrame. Indexing could mean selecting
all the rows and some of the columns, some of the rows and all of
the columns, or some of each of the rows and columns. Indexing
can also be known as Subset Selection.
Uses labels or names to select data. You can specify row labels
and column names.
Example:
df.loc['row_label', 'column_name']
Example:
df.iloc[integer_row_position, integer_column_position]
Boolean indexing:
Example:
df[df['column_name'] > 0]
CODE:-
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df['Name'])
print(df[['Name', 'Age']])
# Conditional selection
print(df[df['Age'] > 30]) # Select rows where Age is greater than
30
OUTPUT:-
0 Alice
Name Age
0 Alice 25
1 Bob 30
Name City
2 Charlie Chicago
4 Eve Phoenix
Name City
3 David Houston
Name Age
2 Charlie 35
3 David 40
4 Eve 45
1. Operating on Data
in Pandas
#### Syntax and Explanation of Common Operations:
- *Creating a DataFrame:*
import pandas as pd
# Example data
data = {
# Creating a DataFrame
df = pd.DataFrame(data)
- *Selecting Data:*
- *Aggregating Data:*
mean_salary = df['Salary'].mean()
OUTPUT
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 75000
3 David 40 80000
4 Emily 45 55000
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
4 Emily 45
2 Charlie 35 75000
3 David 40 80000
4 Emily 45 55000
68800.0
### 2. Handling
Missing Data
### Explanation of Functions:
- *.isnull()*:
- *Syntax:* DataFrame.isnull()
- *Explanation:* Returns a boolean DataFrame indicating where
values are NaN (missing).
- *.dropna()*:
- *.fillna()*:
- *.MultiIndex.from_tuples()*:
- *.loc[]*:
df_missing = pd.DataFrame(data_missing)
is_null = df_missing.isnull()
df_cleaned = df_missing.dropna()
df_filled = df_missing.fillna(value=0)
output
A B C
0 1.0 10.0 a
1 2.0 NaN b
3 4.0 40.0 d
4 5.0 50.0 e
A B C
A B C
0 1.0 10.0 a
3 4.0 40.0 d
4 5.0 50.0 e
A B C
0 1.0 10.0 a
1 2.0 0.0 b
2 0.0 30.0 0
3 4.0 40.0 d
4 5.0 50.0 e
### 3. Hierarchical
Indexing
#### Syntax and Explanation of Hierarchical Indexing:
OUTPUT
Values
Letter Number
A 1 10
2 20
B 1 30
2 40
10
30
## Vectorized String
Operations
Vectorized string operations in pandas allow you to efficiently
apply string methods to entire columns or Series of data. This is
particularly useful when you need to clean or transform text data
in bulk. Here are some key points:
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],
df = pd.DataFrame(data)
df['name_upper'] = df['name'].str.upper()
df['first_name'] = df['name'].str.extract(r'^(\w+)')
OUTPUT
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(10), index=dates)
ts_monthly = ts.resample('M').mean()
3. *Time Zone Handling*: Pandas supports time zone localization
and conversion operations.
ts_utc = ts.tz_localize('UTC')
ts_ny = ts_utc.tz_convert('America/New_York')
ts.plot()
plt.show()
OUTPUT
2024-01-01 -0.432560
2024-01-02 -0.173636
2024-01-03 0.293211
2024-01-04 0.047759
2024-01-05 0.991461
2024-01-06 0.914069
2024-01-07 0.281746
2024-01-08 0.647789
2024-01-09 0.151357
2024-01-10 0.443611
2024-01-31 0.234511
2024-02-29 0.434195
ts_utc:
ts_ny:
|
|
1.0 | *
0.5 | * *
0.0 | * *
-0.5 | *
-1.0 | * *
-1.5 | *
-2.0 | * *
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
EX
import pandas as pd
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(df)
print(filtered_df)
OUTPUT
Original DataFrame:
column1 column2
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
0 10 50 60
1 20 40 60
2 30 30 60
3 40 20 60
4 50 10 60
2 30 30 60
3 40 20 60
## 1. Concat and
Append
*Concatenation (pd.concat)*:
import pandas as pd
# Example DataFrames
print(result)
Output:
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
*Appending (df.append)*:
import pandas as pd
# Example DataFrames
appended = df1.append(df2)
print(appended)
Output:
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
import pandas as pd
# Example DataFrames
print(merged)
Output:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
*Join (df.join)*:
Joining is used to combine columns of two DataFrames based on
index.
import pandas as pd
# Example DataFrames
# Joining on index
joined = left.join(right)
print(joined)
Output:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
### 3. Aggregation
and Grouping
*Aggregation (df.agg or groupby)*:
import pandas as pd
# Example DataFrame
df = pd.DataFrame(data)
print(grouped)
Output:
Value
Category
A 30
B 30
### 4. Pivot Tables
*Pivot (df.pivot_table)*:
import pandas as pd
# Example DataFrame
df = pd.DataFrame(data)
print(pivot_table)
Output:
Category A B
Date