[go: up one dir, main page]

0% found this document useful (0 votes)
27 views13 pages

14oct Pandas 2024

Pandas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views13 pages

14oct Pandas 2024

Pandas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 1: Introduction to Pandas

- What is Pandas? Overview of pandas library.


- Installation: Installing pandas (pip install pandas).
- Data Structures: Introduction to pandas Series and DataFrame.
- Basic Operations: Creating, viewing, and manipulating DataFrames.
- Use in Companies: Top companies like Google, Facebook, and Netflix use pandas
for data analysis,
to load and explore data from CSV, Excel, or SQL databases.

1. What is Pandas?
Pandas is a powerful open-source library used for data manipulation and analysis.
It provides easy-to-use data structures and functions to work with structured data
like tables (rows and columns). Pandas is widely used in data science and machine
learning for handling large datasets.

2. Installation
To install pandas, you can use the package installer pip:
bash
pip install pandas

This command installs pandas, allowing you to start using its features in your
projects.

3. Data Structures
Pandas provides two main data structures:
- Series: A one-dimensional array-like object (similar to a list or array). It
is labeled and can hold any type of data (e.g., integers, strings).
- DataFrame: A two-dimensional table with rows and columns, similar to a
spreadsheet or SQL table. It can hold multiple data types and is the primary data
structure in pandas.

4. Basic Operations
With pandas, you can perform a variety of operations:
- Creating DataFrames: You can create a DataFrame from lists, dictionaries, or
reading data from files (like CSV or Excel).
- Viewing Data: You can view parts of the DataFrame using methods
like .head(), .tail(), and .info() to understand the structure of the data.
- Manipulating Data: This includes selecting specific rows or columns, filtering
data, adding or removing columns, and performing operations on the data.

Real-World Use in Companies:


Top companies like Google, Facebook, and Netflix use pandas for data analysis. They
load data from sources like CSV, Excel, or SQL databases and use pandas to explore,
clean, and manipulate the data before building machine learning models or
generating insights. For instance, Netflix might load user behavior data to analyze
and improve their recommendation system using pandas.

1. Install Pandas
(Run this in your terminal or command prompt)
bash
pip install pandas
2. Create a Series

import pandas as pd

data = [10, 20, 30, 40]


series = pd.Series(data) Creating a Series
print(series)

3. Create a DataFrame from a Dictionary

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}


df = pd.DataFrame(data) Creating DataFrame from dictionary
print(df)

4. View First 5 Rows of a DataFrame

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 30, 35,
28, 22]}
df = pd.DataFrame(data)
print(df.head()) Viewing first 5 rows of the DataFrame

5. View DataFrame Information

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}


df = pd.DataFrame(data)
print(df.info()) Viewing information about the DataFrame

6. Select a Column from DataFrame

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}


df = pd.DataFrame(data)
print(df['Name']) Selecting the 'Name' column

7. Filter Rows Based on a Condition

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}


df = pd.DataFrame(data)
filtered_df = df[df['Age'] > 25] Filtering rows where Age > 25
print(filtered_df)

8. Add a New Column to DataFrame

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
df['City'] = ['New York', 'Los Angeles', 'Chicago'] Adding new column 'City'
print(df)

9. Drop a Column from DataFrame

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['NY',
'LA', 'CHI']}
df = pd.DataFrame(data)
df = df.drop('City', axis=1) Dropping the 'City' column
print(df)

10. Save DataFrame to CSV

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}


df = pd.DataFrame(data)
df.to_csv('output.csv', index=False) Saving DataFrame to a CSV file

Chapter 2: Loading Data


- Reading Data: Loading data from CSV, Excel, JSON, SQL, and APIs using functions
like pd.read_csv(),
pd.read_excel(), pd.read_sql(), etc.
- Writing Data: Saving DataFrames to different formats (CSV, Excel, JSON).
- Use in Companies: Data ingestion is crucial for firms like Amazon, which
process
large datasets from different formats and sources in their analytics pipelines.

1. Reading Data
Data is often stored in different formats like CSV, Excel, JSON, SQL, or APIs, and
it needs to be loaded into Python for analysis. Pandas provides easy-to-use
functions for this.
- Loading CSV Files (pd.read_csv()): This function reads data from CSV files
into a DataFrame.
- Loading Excel Files (pd.read_excel()): This function reads data from Excel
files into a DataFrame.
- Loading JSON Files (pd.read_json()): This function reads JSON data into a
DataFrame.
- Loading Data from SQL Databases (pd.read_sql()): This function reads data from
SQL databases into a DataFrame.
- Loading Data from APIs: Data can also be loaded from web APIs by making HTTP
requests and converting the response into a DataFrame.

2. Writing Data
After analyzing or manipulating data, it’s often saved back to a file or sent to
another system. Pandas can write data to multiple formats like CSV, Excel, or JSON.
- Saving Data as CSV (DataFrame.to_csv()): This function saves DataFrame data to
a CSV file.
- Saving Data as Excel (DataFrame.to_excel()): This function saves DataFrame
data to an Excel file.
- Saving Data as JSON (DataFrame.to_json()): This function saves DataFrame data
to a JSON file.

Real-World Use in Companies:


Companies like Amazon need to ingest and process large datasets from various
formats such as CSV, Excel, and JSON for analytics. For example, sales data may
come from CSV files, inventory data from Excel, and customer feedback data from a
JSON API. Data ingestion is a critical part of their analytics pipeline to gather
insights and make data-driven decisions.

1. Read Data from CSV File


python
import pandas as pd

df = pd.read_csv('data.csv') Reading CSV file


print(df.head()) Display the first 5 rows

2. Read Data from Excel File


python
import pandas as pd

df = pd.read_excel('data.xlsx') Reading Excel file


print(df.head()) Display the first 5 rows

3. Read Data from JSON File


python
import pandas as pd

df = pd.read_json('data.json') Reading JSON file


print(df.head()) Display the first 5 rows

4. Read Data from SQL Database


python
import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db') Connect to SQL database


df = pd.read_sql('SELECT FROM users', conn) Reading from SQL
print(df.head()) Display the first 5 rows

5. Read Data from API


python
import pandas as pd
import requests

response = requests.get('https://api.example.com/data') Make API request


df = pd.DataFrame(response.json()) Convert API response to DataFrame
print(df.head()) Display the first 5 rows
6. Save Data to CSV File
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df.to_csv('output.csv', index=False) Save DataFrame to CSV file

7. Save Data to Excel File


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df.to_excel('output.xlsx', index=False) Save DataFrame to Excel file

8. Save Data to JSON File


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df.to_json('output.json') Save DataFrame to JSON file

9. Read Specific Columns from CSV File


python
import pandas as pd

df = pd.read_csv('data.csv', usecols=['Name', 'Age']) Load only 'Name' and


'Age' columns
print(df.head()) Display the first 5 rows

10. Load Data from Excel with Specific Sheet


python
import pandas as pd

df = pd.read_excel('data.xlsx', sheet_name='Sheet2') Load data from 'Sheet2'


print(df.head()) Display the first 5 rows

Chapter 3: DataFrame Manipulation


- DataFrame Indexing: Selecting rows/columns using .loc[], .iloc[], and conditional
filtering.
- Adding/Deleting Columns: Modifying data by adding or dropping columns.
- Renaming Columns and Index: Using .rename().
- Use in Companies: Companies like Airbnb use this for cleaning and transforming
data from user
listings to analyze trends and improve services.

1. DataFrame Indexing
Indexing allows selecting specific rows or columns from a DataFrame. It's important
for working with specific parts of your data.
- Selecting Rows/Columns with .loc[]: Used to select data by label (index names
or column names).
- Selecting Rows/Columns with .iloc[]: Used to select data by position (row or
column numbers).
- Conditional Filtering: Used to filter rows based on a condition (e.g., select
rows where age > 30).

2. Adding/Deleting Columns
We can easily modify data by adding new columns or removing unwanted ones.
- Adding Columns: New columns can be added by assigning values to a new column
name.
- Deleting Columns: Columns can be removed using the .drop() function.

3. Renaming Columns and Index


Renaming is useful when column names or index labels need to be changed for clarity
or consistency.
- Using .rename(): This function allows you to rename column names or row index
labels.

Real-World Use in Companies:


Companies like Airbnb use DataFrame manipulation to clean and transform user
listing data. For example, Airbnb might rename columns, filter listings based on
location, or add new columns to calculate metrics like price per night. These steps
help in analyzing trends and improving services for hosts and guests.

1. Select Rows by Index Name (.loc[])


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
selected_row = df.loc[1] Select row with index 1 (Anna)
print(selected_row)

2. Select Rows by Position (.iloc[])


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
selected_row = df.iloc[2] Select row at position 2 (Peter)
print(selected_row)

3. Conditional Filtering
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
filtered_data = df[df['Age'] > 30] Select rows where Age > 30
print(filtered_data)
4. Add a New Column
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df['City'] = ['New York', 'Boston', 'Chicago'] Adding a new column
print(df)

5. Delete a Column
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
df = df.drop('City', axis=1) Remove the 'City' column
print(df)

6. Rename Columns
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'}) Renaming columns
print(df)

7. Select Multiple Columns by Label (.loc[])


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
selected_columns = df.loc[:, ['Name', 'City']] Select 'Name' and 'City'
columns
print(selected_columns)

8. Select Multiple Columns by Position (.iloc[])


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
selected_columns = df.iloc[:, [0, 2]] Select first and third columns
print(selected_columns)

9. Filter Rows Based on Multiple Conditions


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'City': ['New
York', 'Boston', 'Chicago']}
df = pd.DataFrame(data)
filtered_data = df[(df['Age'] > 25) & (df['City'] == 'New York')] Age > 25 and
lives in New York
print(filtered_data)

10. Rename Index Labels


python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
df = df.rename(index={0: 'A', 1: 'B', 2: 'C'}) Renaming row index labels
print(df)

Chapter 4: Data Cleaning


- Handling Missing Data: Detecting and filling missing values with isna(),
fillna(), and dropna().
- Duplicates: Identifying and removing duplicate rows.
- Data Type Conversion: Converting data types with .astype().
- Use in Companies: Data cleaning is essential for data-driven firms like Uber to
ensure the quality of their datasets (e.g., handling missing data from rider or
driver inputs).

1. Handling Missing Data


When data is incomplete, it often has missing values that need to be managed. This
is important because missing data can affect the results of analysis.
- Detecting Missing Values (isna()): This function helps to find if any data is
missing (shows True where data is missing).
- Filling Missing Values (fillna()): This function allows you to fill missing
data with a specific value (e.g., fill with 0 or the average).
- Removing Missing Values (dropna()): This function removes rows or columns with
missing data.

2. Duplicates
Sometimes data may have duplicate rows, which can lead to inaccurate results.
- Identifying Duplicates: We can find duplicate rows using the .duplicated()
function (shows True for duplicates).
- Removing Duplicates: We can remove duplicates using the .drop_duplicates()
function.

3. Data Type Conversion


Data in different columns may not always be in the correct type (e.g., numbers
stored as text). Converting data types is important to ensure calculations and
operations work as expected.
- Converting Data Types (astype()): This function is used to change the data
type of a column (e.g., convert text to numbers or dates).

Real-World Use in Companies:


Companies like Uber need to ensure data quality for better decision-making. For
example, rider and driver data may have missing or incorrect entries (like missing
pickup locations or incorrect driver ratings). Data cleaning helps fix these issues
so that analysis and predictions are accurate.
1. Detect Missing Values

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
print(df.isna())

2. Fill Missing Values with 0

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_filled = df.fillna(0)
print(df_filled)

3. Fill Missing Values with a Specific Value

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_filled = df.fillna('Unknown')
print(df_filled)

4. Remove Rows with Missing Values

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', None], 'Age': [28, 22, 35, 30]}
df = pd.DataFrame(data)
df_cleaned = df.dropna()
print(df_cleaned)

5. Identify Duplicates

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', 'John'], 'Age': [28, 22, 35, 28]}
df = pd.DataFrame(data)
print(df.duplicated())

6. Remove Duplicates

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', 'John'], 'Age': [28, 22, 35, 28]}
df = pd.DataFrame(data)
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

7. Convert Data Type (Text to Numbers)


import pandas as pd

data = {'Name': ['John', 'Anna'], 'Age': ['28', '22']}


df = pd.DataFrame(data)
df['Age'] = df['Age'].astype(int)
print(df)

8. Convert Data Type (Text to Date)

import pandas as pd

data = {'Date': ['2023-01-01', '2023-02-01']}


df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print(df)

9. Fill Missing Values with Column Average

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter', 'Lucy'], 'Age': [28, 22, None, 30]}
df = pd.DataFrame(data)
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

10. Remove Columns with Missing Values

import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 22, 35], 'Location':
[None, 'NY', 'CA']}
df = pd.DataFrame(data)
df_cleaned = df.dropna(axis=1)
print(df_cleaned)

Chapter 5: Data Exploration


- Descriptive Statistics: Summarizing data
with .describe(), .sum(), .mean(), .count(), .median(), etc.
- Value Counts and Sorting: Using .value_counts() and .sort_values().
- Group By: Grouping data with .groupby() for aggregating statistics.
- Use in Companies: Firms like Spotify analyze user data using groupby to compute
insights like top artists by region
or time.

Chapter 5: Data Exploration

Data exploration is a critical step in understanding your dataset. It involves


summarizing and analyzing the
data to derive insights, detect patterns, and identify potential issues before
proceeding to modeling or further analysis.
1. Descriptive Statistics

Descriptive statistics provide a summary of the central tendency, dispersion, and


shape of a dataset's distribution.
It helps in understanding the characteristics of the data.

- **Summarizing Data (.describe()):** This function generates descriptive


statistics for numerical columns,
including count, mean, standard deviation, min, and max values, and quartiles.

- **Sum of Values (.sum()):** This function returns the sum of the values in a
specified column, helping to
gauge the total amount of data.

- **Mean Value (.mean()):** This function calculates the average of the values in a
specified column, providing insight into the central tendency.

- **Count of Values (.count()):** This function counts the number of non-null


entries in a specified column, useful for understanding how much data is present.

- **Median Value (.median()):** This function calculates the median, which is the
middle value when data is sorted, offering a robust measure of central tendency.

2. Value Counts and Sorting

Understanding the frequency of unique values in a column is essential for


categorical data analysis.

- **Value Counts (.value_counts()):** This function counts the occurrences of each


unique value in a specified column, helping identify the distribution of
categorical data.

- **Sorting Values (.sort_values()):** This function sorts the data based on the
values in a specified column, making it easier to identify trends and outliers.

3. Group By

Grouping data is crucial for aggregating statistics and performing operations on


subsets of the dataset.

- **Grouping Data (.groupby()):** This function allows you to group the data based
on one or more columns and apply aggregate functions to summarize the data. For
example, it can compute the mean or count for each group, providing insights into
various segments.

Real-World Use in Companies

Companies like Spotify analyze user data using groupby to compute insights like top
artists by region or time. By grouping data, they can identify user preferences,
trends, and areas for targeted marketing.

1. **Descriptive Statistics**
python
import pandas as pd
data = {'Age': [28, 22, 35, 30, 25]}
df = pd.DataFrame(data)
print(df.describe())

2. **Sum of Values**
python
import pandas as pd

data = {'Sales': [200, 150, 300]}


df = pd.DataFrame(data)
print(df['Sales'].sum())

3. **Mean Value**
python
import pandas as pd

data = {'Age': [28, 22, 35, 30]}


df = pd.DataFrame(data)
print(df['Age'].mean())

4. **Count of Values**
python
import pandas as pd

data = {'Name': ['John', 'Anna', 'Peter']}


df = pd.DataFrame(data)
print(df['Name'].count())

5. **Median Value**
python
import pandas as pd

data = {'Age': [28, 22, 35, 30]}


df = pd.DataFrame(data)
print(df['Age'].median())

6. **Value Counts**
python
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Red', 'Green']}


df = pd.DataFrame(data)
print(df['Color'].value_counts())

7. **Sorting Values**
python
import pandas as pd

data = {'Age': [28, 22, 35, 30]}


df = pd.DataFrame(data)
print(df.sort_values(by='Age'))
8. **Group By Example**
python
import pandas as pd

data = {
'Artist': ['A', 'B', 'A', 'C', 'B'],
'Streams': [100, 150, 200, 300, 250]
}
df = pd.DataFrame(data)
grouped = df.groupby('Artist')['Streams'].sum()
print(grouped)

You might also like