[go: up one dir, main page]

0% found this document useful (0 votes)
8 views28 pages

Pandas Introduction: What Is Python Pandas Used For?

Pandas is an open-source Python library designed for data manipulation and analysis, built on top of NumPy. It provides powerful tools for data cleaning, merging, handling missing data, and supports operations like grouping and visualization. The document also covers installation, indexing, selecting data, handling missing values, hierarchical indexing, vectorized string operations, and working with time series data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views28 pages

Pandas Introduction: What Is Python Pandas Used For?

Pandas is an open-source Python library designed for data manipulation and analysis, built on top of NumPy. It provides powerful tools for data cleaning, merging, handling missing data, and supports operations like grouping and visualization. The document also covers installation, indexing, selecting data, handling missing values, hierarchical indexing, vectorized string operations, and working with time series data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Pandas Introduction

Pandas is a powerful and open-source Python library. The Pandas library is


used for data manipulation and analysis. Pandas consist of data structures
and functions to perform efficient operations on data.

What is Python Pandas used for?


The Pandas library is generally used for data science, but have you
wondered why? This is because the Pandas library is used in conjunction
with other libraries that are used for data science.

It is built on top of the NumPy library which means that a lot of the
structures of NumPy are used or replicated in Pandas.

The data produced by Pandas is often used as input for plotting functions
in Matplotlib, statistical analysis in SciPy, and machine learning algorithms
in Scikit-learn.

You must be wondering, Why should you use the Pandas Library. Python's
Pandas library is the best tool to analyze, clean, and manipulate data.

Here is a list of things that we can do using Pandas.

Data set cleaning, merging, and joining.

Easy handling of missing data (represented as NaN) in floating point as


well as non-floating point data.

Columns can be inserted and deleted from DataFrame and higher-


dimensional objects.

Powerful group by functionality for performing split-apply-combine


operations on data sets.

Data Visualization.

Getting Started with Pandas


Let's see how to start working with the Python Pandas library:

Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in
the system or not. If not, then we need to install it on our system using
the pip command.

Follow these steps to install Pandas:

Step 1: Type 'cmd' in the search box and open it.

Step 2: Locate the folder using the cd command where the python-pip file
has been installed.

Step 3: After locating it, type the command:

pip install pandas

Importing Pandas
After the Pandas have been installed in the system, you need to import
the library. This module is generally imported as follows:

import pandas as pd

Indexing and
Selecting Data with
Pandas
Indexing in Pandas :
Indexing in pandas means simply selecting particular rows and
columns of data from a DataFrame. Indexing could mean selecting
all the rows and some of the columns, some of the rows and all of
the columns, or some of each of the rows and columns. Indexing
can also be known as Subset Selection.

Indexing Methods in Pandas:

Label-based indexing using .loc[]:

Uses labels or names to select data. You can specify row labels
and column names.

Example:

df.loc['row_label', 'column_name']

Integer-based indexing using .iloc[]:

Uses integer positions to select data. It's similar to Python's


typical 0-based indexing.

Example:

df.iloc[integer_row_position, integer_column_position]

Boolean indexing:

Uses boolean vectors to filter data. It selects rows or columns


where the condition is True.

Example:

df[df['column_name'] > 0]

CODE:-

import pandas as pd

# Create a sample DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],


'Age': [25, 30, 35, 40, 45],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston',


'Phoenix'],

'Salary': [50000, 60000, 75000, 80000, 70000]

df = pd.DataFrame(data)

# Selecting a single column

print(df['Name'])

# Selecting multiple columns

print(df[['Name', 'Age']])

# Selecting rows by index label

print(df.loc[1]) # Select row with index label 1

# Selecting specific rows and columns

print(df.loc[[0, 2, 4], ['Name', 'City']]) # Rows 0, 2, 4 and columns


'Name' and 'City'

# Selecting rows by index position

print(df.iloc[3]) # Select row with index position 3 (zero-indexed)

# Selecting specific rows and columns by index position

print(df.iloc[[1, 3], [0, 2]]) # Rows 1 and 3, columns 0 and 2


(Name and City)

# Conditional selection
print(df[df['Age'] > 30]) # Select rows where Age is greater than
30

OUTPUT:-

0 Alice

Name: Name, dtype: object

Name Age

0 Alice 25

1 Bob 30

Name: Bob, Age: 30, City: Los Angeles, Salary: 60000

Name City

0 Alice New York

2 Charlie Chicago

4 Eve Phoenix

Name: David, Age: 40, City: Houston, Salary: 80000

Name City

1 Bob Los Angeles

3 David Houston

Name Age

2 Charlie 35

3 David 40

4 Eve 45
1. Operating on Data
in Pandas
#### Syntax and Explanation of Common Operations:

- *Creating a DataFrame:*

import pandas as pd

# Example data

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],

'Age': [25, 30, 35, 40, 45],

'Salary': [50000, 60000, 75000, 80000, 55000]

# Creating a DataFrame

df = pd.DataFrame(data)

- *Selecting Data:*

# Selecting specific columns

selected_columns = df[['Name', 'Age']]


- *Filtering Data:*

# Filtering based on conditions (Age > 30)

filtered_data = df[df['Age'] > 30]

- *Adding a New Column:*

# Adding a new column (e.g., Bonus)

df['Bonus'] = df['Salary'] * 0.1

- *Aggregating Data:*

# Calculating mean salary

mean_salary = df['Salary'].mean()

OUTPUT
Name Age Salary

0 Alice 25 50000

1 Bob 30 60000

2 Charlie 35 75000

3 David 40 80000

4 Emily 45 55000

Name Age

0 Alice 25
1 Bob 30

2 Charlie 35

3 David 40

4 Emily 45

Name Age Salary

2 Charlie 35 75000

3 David 40 80000

4 Emily 45 55000

Name Age Salary Bonus

0 Alice 25 50000 5000.0

1 Bob 30 60000 6000.0

2 Charlie 35 75000 7500.0

3 David 40 80000 8000.0

4 Emily 45 55000 5500.0

68800.0

### 2. Handling
Missing Data
### Explanation of Functions:

- *.isnull()*:

- *Syntax:* DataFrame.isnull()
- *Explanation:* Returns a boolean DataFrame indicating where
values are NaN (missing).

- *.dropna()*:

- *Syntax:* DataFrame.dropna(axis=0, how='any', thresh=None,


subset=None, inplace=False)

- *Explanation:* Drops rows or columns with missing values


(NaN).

- *.fillna()*:

- *Syntax:* DataFrame.fillna(value=None, method=None,


axis=None, inplace=False, limit=None, downcast=None)

- *Explanation:* Fills missing values (NaN) using specified


methods or values.

- *.MultiIndex.from_tuples()*:

- *Syntax:* pd.MultiIndex.from_tuples(tuples, names=None)

- *Explanation:* Creates a multi-level index object from tuples.

- *.loc[]*:

- *Syntax:* DataFrame.loc[label] or DataFrame.loc[row_label,


column_label]

- *Explanation:* Accesses a group of rows and columns by


label(s) or a boolean array.

- *Creating a DataFrame with Missing Values:*


data_missing = {

'A': [1, 2, None, 4, 5],

'B': [10, None, 30, 40, 50],

'C': ['a', 'b', None, 'd', 'e']

df_missing = pd.DataFrame(data_missing)

- *Detecting Missing Values:*

# Detecting missing values

is_null = df_missing.isnull()

- *Dropping Rows with Any Missing Values:*

# Dropping rows with any missing values

df_cleaned = df_missing.dropna()

- *Filling Missing Values with a Specified Value:*

# Filling missing values with 0

df_filled = df_missing.fillna(value=0)

output
A B C

0 1.0 10.0 a

1 2.0 NaN b

2 NaN 30.0 None

3 4.0 40.0 d

4 5.0 50.0 e

A B C

0 False False False

1 False True False

2 True False True

3 False False False

4 False False False

A B C

0 1.0 10.0 a

3 4.0 40.0 d

4 5.0 50.0 e

A B C

0 1.0 10.0 a

1 2.0 0.0 b

2 0.0 30.0 0

3 4.0 40.0 d

4 5.0 50.0 e

### 3. Hierarchical
Indexing
#### Syntax and Explanation of Hierarchical Indexing:

- *Creating a DataFrame with Hierarchical Indexing:*

# Creating a DataFrame with hierarchical indexing

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B',


2)], names=['Letter', 'Number'])

df_hierarchical = pd.DataFrame({'Values': [10, 20, 30, 40]},


index=index)

- *Accessing Data with Hierarchical Indexing:*

# Accessing data with hierarchical indexing

value_A1 = df_hierarchical.loc[('A', 1), 'Values']

- *Aggregating Data with Hierarchical Indexing:*

# Aggregating data with hierarchical indexing

sum_A_values = df_hierarchical.loc['A', 'Values'].sum()

OUTPUT
Values

Letter Number
A 1 10

2 20

B 1 30

2 40

10

30

These explanations and syntaxes should give you a clear


understanding of how to use these Pandas functions and methods
effectively for data manipulation, handling missing data, and
working with hierarchical indexing.

## Vectorized String
Operations
Vectorized string operations in pandas allow you to efficiently
apply string methods to entire columns or Series of data. This is
particularly useful when you need to clean or transform text data
in bulk. Here are some key points:

1. *Series.str Methods*: Pandas provides a str accessor that


exposes a set of string methods, similar to Python's built-in string
methods. These methods can be accessed through Series.str.

import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie'],

'city': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

# Example: Convert names to uppercase

df['name_upper'] = df['name'].str.upper()

2. *Handling Missing Values*: Vectorized string operations


gracefully handle missing values (NaN) without raising errors,
which simplifies data cleaning workflows.

# Example: Replace missing cities with a default value

df['city'] = df['city'].str.replace('NaN', 'Unknown')

3. *Regular Expressions*: String methods in pandas support


regular expressions (regex=True), enabling complex pattern
matching and extraction tasks.

# Example: Extract the first name using regular expressions

df['first_name'] = df['name'].str.extract(r'^(\w+)')

OUTPUT

name city name _upper first _name

0 Alice New York ALICE Alice

1 Bob Los Angeles BOB Bob


2 Charlie Chicago CHARLIE Charlie

### Working with


Time Series
Pandas provides robust support for working with time series
data, including powerful tools for manipulating, analyzing, and
visualizing time-indexed data. Here are some key features:

1. *DateTime Index*: Pandas has a specialized DatetimeIndex to


efficiently handle time-series data.

import pandas as pd

import numpy as np

# Create a time series with a datetime index

dates = pd.date_range('2024-01-01', periods=10)

ts = pd.Series(np.random.randn(10), index=dates)

2. *Resampling and Frequency Conversion*: Easily change the


frequency of your time series data using resample().

# Resample to monthly frequency

ts_monthly = ts.resample('M').mean()
3. *Time Zone Handling*: Pandas supports time zone localization
and conversion operations.

# Convert to a different time zone

ts_utc = ts.tz_localize('UTC')

ts_ny = ts_utc.tz_convert('America/New_York')

4. *Time Series Plotting*: Pandas integrates with Matplotlib to


provide convenient plotting methods for time series data.

import matplotlib.pyplot as plt

# Plot the time series

ts.plot()

plt.show()

OUTPUT
2024-01-01 -0.432560

2024-01-02 -0.173636

2024-01-03 0.293211

2024-01-04 0.047759

2024-01-05 0.991461

2024-01-06 0.914069

2024-01-07 0.281746

2024-01-08 0.647789

2024-01-09 0.151357

2024-01-10 0.443611

Freq: D, dtype: float64


ts_monthly:

2024-01-31 0.234511

2024-02-29 0.434195

Freq: M, dtype: float64

ts_utc:

2024-01-01 00:00:00+00:00 -0.432560

2024-01-02 00:00:00+00:00 -0.173636

2024-01-03 00:00:00+00:00 0.293211

2024-01-04 00:00:00+00:00 0.047759

2024-01-05 00:00:00+00:00 0.991461

2024-01-06 00:00:00+00:00 0.914069

2024-01-07 00:00:00+00:00 0.281746

2024-01-08 00:00:00+00:00 0.647789

2024-01-09 00:00:00+00:00 0.151357

2024-01-10 00:00:00+00:00 0.443611

Freq: D, dtype: float64

ts_ny:

2024-01-01 05:00:00-05:00 -0.432560

2024-01-02 05:00:00-05:00 -0.173636

2024-01-03 05:00:00-05:00 0.293211

2024-01-04 05:00:00-05:00 0.047759

2024-01-05 05:00:00-05:00 0.991461

2024-01-06 05:00:00-05:00 0.914069

2024-01-07 05:00:00-05:00 0.281746

2024-01-08 05:00:00-05:00 0.647789

2024-01-09 05:00:00-05:00 0.151357

2024-01-10 05:00:00-05:00 0.443611

Freq: D, dtype: float64

|
|

1.0 | *

0.5 | * *

0.0 | * *

-0.5 | *

-1.0 | * *

-1.5 | *

-2.0 | * *

+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

2024-01-01 2024-01-02 2024-01-03 2024-01-04 2024-01-05 2024-01-06 2024-01-07 2024-


01-08 2024-01-09 2024-01-10
### High-
Performance Pandas:
eval() and query()
Pandas provides eval() and query() functions for high-
performance operations, especially useful for large datasets.
These functions leverage Numexpr library under the hood for
efficient computation.

1. *eval()*: Performs expression evaluation on DataFrame


columns.

# Example: Compute a new column using eval()

df.eval('total = column1 + column2', inplace=True)

- Supports arithmetic operations (+, -, *, /), comparison


operators, and function calls.

- Useful for complex calculations involving large datasets where


performance is critical.

2. *query()*: Filters rows based on a boolean expression.

# Example: Filter rows using query()

filtered_df = df.query('column1 > 0 and column2 < 100')


- Provides a more readable and expressive syntax for filtering
compared to traditional boolean indexing.

- Can significantly improve performance for large DataFrames.

EX

import pandas as pd

# Create a sample DataFrame

data = {'column1': [10, 20, 30, 40, 50],

'column2': [50, 40, 30, 20, 10]}

df = pd.DataFrame(data)

print("Original DataFrame:")

print(df)

# Compute a new column using eval()

df.eval('total = column1 + column2', inplace=True)

print("\nDataFrame after adding a new column using eval():")

print(df)

# Filter rows using query()

filtered_df = df.query('column1 > 20 and column2 < 40')

print("\nFiltered DataFrame using query():")

print(filtered_df)

OUTPUT

Original DataFrame:

column1 column2

0 10 50
1 20 40

2 30 30

3 40 20

4 50 10

DataFrame after adding a new column using eval():

column1 column2 total

0 10 50 60

1 20 40 60

2 30 30 60

3 40 20 60

4 50 10 60

Filtered DataFrame using query():

column1 column2 total

2 30 30 60

3 40 20 60

## 1. Concat and
Append
*Concatenation (pd.concat)*:

Concatenation is used to combine DataFrames along a particular


axis (either rows or columns).

import pandas as pd
# Example DataFrames

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

'B': ['B3', 'B4', 'B5']})

# Concatenate along rows (axis=0)

result = pd.concat([df1, df2])

print(result)

Output:

A B

0 A0 B0

1 A1 B1

2 A2 B2

0 A3 B3

1 A4 B4

2 A5 B5

*Appending (df.append)*:

Appending is used to add rows from one DataFrame to another


DataFrame.

import pandas as pd
# Example DataFrames

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],

'B': ['B3', 'B4', 'B5']})

# Append df2 to df1

appended = df1.append(df2)

print(appended)

Output:

A B

0 A0 B0

1 A1 B1

2 A2 B2

0 A3 B3

1 A4 B4

2 A5 B5

### 2. Merge and


Join
*Merge (pd.merge)*:

Merging is used to combine DataFrames using one or more keys.

import pandas as pd

# Example DataFrames

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],

'A': ['A0', 'A1', 'A2', 'A3']})

df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],

'B': ['B0', 'B1', 'B2', 'B3']})

# Merge on 'key' column

merged = pd.merge(df1, df2, on='key')

print(merged)

Output:

key A B

0 K0 A0 B0

1 K1 A1 B1

2 K2 A2 B2

3 K3 A3 B3

*Join (df.join)*:
Joining is used to combine columns of two DataFrames based on
index.

import pandas as pd

# Example DataFrames

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

'B': ['B0', 'B1', 'B2']},

index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C1', 'C2'],

'D': ['D0', 'D1', 'D2']},

index=['K0', 'K1', 'K2'])

# Joining on index

joined = left.join(right)

print(joined)

Output:

A B C D

K0 A0 B0 C0 D0

K1 A1 B1 C1 D1

K2 A2 B2 C2 D2
### 3. Aggregation
and Grouping
*Aggregation (df.agg or groupby)*:

Aggregation allows you to perform calculations on groups of data.

import pandas as pd

# Example DataFrame

data = {'Category': ['A', 'B', 'A', 'B', 'A'],

'Value': [10, 20, 30, 40, 50]}

df = pd.DataFrame(data)

# Group by 'Category' and calculate mean of 'Value'

grouped = df.groupby('Category').agg({'Value': 'mean'})

print(grouped)

Output:

Value

Category

A 30

B 30
### 4. Pivot Tables
*Pivot (df.pivot_table)*:

Pivot tables allow you to summarize and aggregate data inside a


DataFrame.

import pandas as pd

# Example DataFrame

data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-


01-02'],

'Category': ['A', 'B', 'A', 'B'],

'Value': [10, 20, 30, 40]}

df = pd.DataFrame(data)

# Create a pivot table

pivot_table = df.pivot_table(index='Date', columns='Category',


values='Value', aggfunc='sum')

print(pivot_table)

Output:

Category A B

Date

2023-01-01 10.0 20.0

2023-01-02 30.0 40.0


These examples cover the basic usage of each concept in Pandas.
They should help you get started with manipulating and analyzing
data using these powerful tools.

You might also like