14oct Pandas 2024
14oct Pandas 2024
1. What is Pandas?
Pandas is a powerful open-source library used for data manipulation and analysis.
It provides easy-to-use data structures and functions to work with structured data
like tables (rows and columns). Pandas is widely used in data science and machine
learning for handling large datasets.
2. Installation
To install pandas, you can use the package installer pip:
bash
pip install pandas
This command installs pandas, allowing you to start using its features in your
projects.
3. Data Structures
Pandas provides two main data structures:
- Series: A one-dimensional array-like object (similar to a list or array). It
is labeled and can hold any type of data (e.g., integers, strings).
- DataFrame: A two-dimensional table with rows and columns, similar to a
spreadsheet or SQL table. It can hold multiple data types and is the primary data
structure in pandas.
4. Basic Operations
With pandas, you can perform a variety of operations:
- Creating DataFrames: You can create a DataFrame from lists, dictionaries, or
reading data from files (like CSV or Excel).
- Viewing Data: You can view parts of the DataFrame using methods
like .head(), .tail(), and .info() to understand the structure of the data.
- Manipulating Data: This includes selecting specific rows or columns, filtering
data, adding or removing columns, and performing operations on the data.
1. Install Pandas
(Run this in your terminal or command prompt)
bash
pip install pandas
2. Create a Series
import pandas as pd
import pandas as pd
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 30, 35,
28, 22]}
df = pd.DataFrame(data)
print(df.head()) Viewing first 5 rows of the DataFrame
import pandas as pd
import pandas as pd
import pandas as pd
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df['City'] = ['New York', 'Los Angeles', 'Chicago'] Adding new column 'City'
print(df)
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['NY',
'LA', 'CHI']}
df = pd.DataFrame(data)
df = df.drop('City', axis=1) Dropping the 'City' column
print(df)
import pandas as pd
1. Reading Data
Data is often stored in different formats like CSV, Excel, JSON, SQL, or APIs, and
it needs to be loaded into Python for analysis. Pandas provides easy-to-use
functions for this.
- Loading CSV Files (pd.read_csv()): This function reads data from CSV files
into a DataFrame.
- Loading Excel Files (pd.read_excel()): This function reads data from Excel
files into a DataFrame.
- Loading JSON Files (pd.read_json()): This function reads JSON data into a
DataFrame.
- Loading Data from SQL Databases (pd.read_sql()): This function reads data from
SQL databases into a DataFrame.
- Loading Data from APIs: Data can also be loaded from web APIs by making HTTP
requests and converting the response into a DataFrame.
2. Writing Data
After analyzing or manipulating data, it’s often saved back to a file or sent to
another system. Pandas can write data to multiple formats like CSV, Excel, or JSON.
- Saving Data as CSV (DataFrame.to_csv()): This function saves DataFrame data to
a CSV file.
- Saving Data as Excel (DataFrame.to_excel()): This function saves DataFrame
data to an Excel file.
- Saving Data as JSON (DataFrame.to_json()): This function saves DataFrame data
to a JSON file.
1. DataFrame Indexing
Indexing allows selecting specific rows or columns from a DataFrame. It's important
for working with specific parts of your data.
- Selecting Rows/Columns with .loc[]: Used to select data by label (index names
or column names).
- Selecting Rows/Columns with .iloc[]: Used to select data by position (row or
column numbers).
- Conditional Filtering: Used to filter rows based on a condition (e.g., select
rows where age > 30).
2. Adding/Deleting Columns
We can easily modify data by adding new columns or removing unwanted ones.
- Adding Columns: New columns can be added by assigning values to a new column
name.
- Deleting Columns: Columns can be removed using the .drop() function.
3. Conditional Filtering
python
import pandas as pd
5. Delete a Column
python
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
df = df.drop('City', axis=1) Remove the 'City' column
print(df)
6. Rename Columns
python
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
selected_columns = df.loc[:, ['Name', 'City']] Select 'Name' and 'City'
columns
print(selected_columns)
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
selected_columns = df.iloc[:, [0, 2]] Select first and third columns
print(selected_columns)
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
filtered_data = df[(df['Age'] > 25) & (df['City'] == 'New York')] Age > 25 and
lives in New York
print(filtered_data)
2. Duplicates
Sometimes data may have duplicate rows, which can lead to inaccurate results.
- Identifying Duplicates: We can find duplicate rows using the .duplicated()
function (shows True for duplicates).
- Removing Duplicates: We can remove duplicates using the .drop_duplicates()
function.
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
print(df.isna())
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_filled = df.fillna(0)
print(df_filled)
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_filled = df.fillna('Unknown')
print(df_filled)
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_cleaned = df.dropna()
print(df_cleaned)
5. Identify Duplicates
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'John'], 'Age': [28, 22, 35, 28]}
df = pd.DataFrame(data)
print(df.duplicated())
6. Remove Duplicates
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'John'], 'Age': [28, 22, 35, 28]}
df = pd.DataFrame(data)
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
import pandas as pd
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter', 'Lucy'], 'Age': [28, 22, None, 30]}
df = pd.DataFrame(data)
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'Location':
[None, 'NY', 'CA']}
df = pd.DataFrame(data)
df_cleaned = df.dropna(axis=1)
print(df_cleaned)
- **Sum of Values (.sum()):** This function returns the sum of the values in a
specified column, helping to
gauge the total amount of data.
- **Mean Value (.mean()):** This function calculates the average of the values in a
specified column, providing insight into the central tendency.
- **Median Value (.median()):** This function calculates the median, which is the
middle value when data is sorted, offering a robust measure of central tendency.
- **Sorting Values (.sort_values()):** This function sorts the data based on the
values in a specified column, making it easier to identify trends and outliers.
3. Group By
- **Grouping Data (.groupby()):** This function allows you to group the data based
on one or more columns and apply aggregate functions to summarize the data. For
example, it can compute the mean or count for each group, providing insights into
various segments.
Companies like Spotify analyze user data using groupby to compute insights like top
artists by region or time. By grouping data, they can identify user preferences,
trends, and areas for targeted marketing.
1. **Descriptive Statistics**
python
import pandas as pd
data = {'Age': [28, 22, 35, 30, 25]}
df = pd.DataFrame(data)
print(df.describe())
2. **Sum of Values**
python
import pandas as pd
3. **Mean Value**
python
import pandas as pd
4. **Count of Values**
python
import pandas as pd
5. **Median Value**
python
import pandas as pd
6. **Value Counts**
python
import pandas as pd
7. **Sorting Values**
python
import pandas as pd
data = {
'Artist': ['A', 'B', 'A', 'C', 'B'],
'Streams': [100, 150, 200, 300, 250]
}
df = pd.DataFrame(data)
grouped = df.groupby('Artist')['Streams'].sum()
print(grouped)