0% found this document useful (0 votes)

29 views33 pages

DevOps Session 3 Pandas

The document provides an overview of Python libraries, specifically focusing on Pandas for data engineering. It covers key concepts such as DataFrames, Series, data selection, filtering, cleaning, transformation, and common operations for data manipulation. Additionally, it discusses performance optimization techniques for handling large datasets and best practices for efficient data processing.

Uploaded by

msalaarm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views33 pages

DevOps Session 3 Pandas

Uploaded by

msalaarm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Engineering

Professional Training
Libraries in Python
Libraries in Python
Libraries in Python are collections of modules and packages that provide pre-written code to perform various tasks. They
help simplify coding by providing reusable functions and classes for specific functionalities, such as data analysis, machine
learning, web development, and more.
Python Library –
Pandas
Key Concepts

We use DataFrames like a virtual spreadsheet. Imagine we have a table with

columns like ‘Cars’ and ‘In stock’. Each column can hold different types of
DataFrame information, like Cara brand and their amount in stock. We can easily perform
operations on this table, like finding the total number of cars in stock.

A Series is like a single column in our table. For example, the ‘Cars’ column is a
Series Series containing names like ‘VW’, ‘Ford’ and ‘Toyota.’ We can work with Series
individually or as part of a DataFrame.
Key Concepts

Think of an index as a label for each row or column. It helps us identify specific data
Index points. For instance, if we have a list of car brands, the index could be their assigned
number.

Selection and To get specific data, we use selection and slicing. We can say, “Give me all VW cars
Slicing in stock”, and Pandas will fetch them.

Pandas helps us tidy up data. If there are missing values (like empty cells), we can
Data Cleaning either fill them or remove them. We can also drop columns we don’t need.

Data Sometimes, we need to reshape or merge data to make it suitable for analysis. For
instance, if we have data on sales per month and want to see the total sales for the
Transformation year, Pandas makes that transformation easy.
Key Concepts

Filtering and If we want to find all which brand has more cars in stock than 20, Pandas can do it. It
Sorting can also sort data by any other criteria.

Aggregation Let’s say we want to find the average number of cars in stock on a specific day. We
and Grouping can group data by day and calculate the averages.

Data We can load data from files (like CSV or Excel) and save our work for later. This
Input/Output way, we can share our analyses with others.

Although Pandas doesn’t create fancy charts, it can work with visualization libraries
Visualization like Matplotlib to show our data in graphs and plots.
Dataframes in Pandas
pandas.DataFrame
• Two-dimensional, size-mutable, potentially heterogeneous tabular data.

• Data structure also contains labeled axes (rows and columns).

• Arithmetic operations align on both row and column labels.

• Can be thought of as a dict-like container for Series objects.

• The primary pandas data structure.

Parameters
data: ndarray (structured or homogeneous),
Iterable, dict, or DataFrame dtype: dtype, default None
Dict can contain Series, arrays, constants, dataclass or list-like Data type to force. Only a single dtype is allowed. If None,
objects. If data is a dict, column order follows insertion-order. If a infer.
dict contains Series which have an index defined, it is aligned by its
index. This alignment also occurs if data is a Series or a DataFrame
itself. Alignment is done on Series/DataFrame inputs.
copy: bool or None, default None
If data is a list of dicts, column order follows insertion-order.
Copy data from inputs.
For dict data, the default of None behaves like copy=True. For
DataFrame or 2d ndarray input, the default of None behaves like
index: Index or array-like copy=False.
If data is a dict containing one or more Series (possibly of different
Index to use for resulting frame. dtypes), copy=False will ensure that these inputs are not copied.
Will default to RangeIndex if no indexing information part of input
data and no index provided.

columns: Index or array-like

Column labels to use for resulting frame when data does not have
them, defaulting to RangeIndex(0, 1, 2, …, n).
If data contains column labels, will perform column selection
instead.
Creating a Dataframe
To create a dataframe, you must first create a dictionary.

A dictionary is a list of values linked to keys.

The keys are separated from their values with colons and brackets as shown below.

In this case, the dictionary keys will become the column names for the DataFrame. The key would be “Grades” and the
values would be “A, B, C, D, F”.
Creating a Dataframe
Dictionary Methods

Methods Usage

Values() Return a list of all values in the dictionary

Update() Updates the dictionary with the specified key-value pairs

Returns the value of the specified key. If the key does not exist insert the key, with the
setdefault() specified value

clear() Removes all the elements from the dictionary

keys() Returns a list containing the keys of the dictionary

pop() Removes the element with the specified key

Dictionary Methods

Methods Usage

popitem() Removes the last inserted key-value pair

get() Returns the value of the specified key

items() Returns a list containing a tuple for each key value pair

copy() Returns a copy of the dictionary

fromkeys() Returns a dictionary with the specified keys and value

Common Dataframe Operations
Selecting Data: .loc[], .iloc[], []

Filtering Data: Conditional filters, .query()

Sorting: .sort_values()

Modifying Columns: Adding new columns, .apply()

Handling Missing Data: .isnull(), .fillna(), .dropna()

Aggregating Data: .groupby(), .agg()

Merging DataFrames: .merge(), .concat()

Reshaping Data: .pivot(), .melt()

Time Series Operations: .to_datetime(), .resample()

Optimization: Using vectorized operations & category dtype

Advanced
Data Wrangling
and Manipulation
using Pandas
Loading and Inspecting Data
Reading Data:
• pd.read_csv('file.csv')

• pd.read_excel('file.xlsx')

• pd.read_json('file.json')

• pd.read_sql(query, conn)

Inspecting Data:
• .head(), .tail()

• .info(), .describe()

• .shape, .columns, .dtypes

Data Cleaning
Handling Missing Values:
• Detect: .isnull().sum()

• Drop: .dropna()

• Fill: .fillna(value)

Handling Duplicates:
• Check: .duplicated()

• Remove: .drop_duplicates()

Changing Data Types:

• Convert: .astype(type)

Renaming Columns:

• .rename(columns={'old': 'new'})
Data Transformation & Manipulation
Selecting & Filtering Data:
• .loc[row_label, col_label]

• .iloc[row_index, col_index]

• Boolean indexing: df[df['column'] > value]

Advanced Filtering Techniques:

• Multi-condition filtering: df[(df['col1'] > value) & (df['col2'] < value)]

• Query method: df.query('col1 > value & col2 < value')

Sorting Data:
• .sort_values(by='column', ascending=True)

Adding & Modifying Columns:

• Creating new columns: df['new_col'] = df['col1'] + df['col2']

• Applying functions: .apply(function), .map(dictionary)

Data Transformation & Manipulation
Grouping & Aggregation:
• .groupby('column').agg({'col': 'sum’})

Complex Aggregation:
• Using multiple functions: .groupby('column').agg({'col1': ['sum', 'mean'], 'col2': 'max'})

• Custom aggregation functions using lambda

Merging & Joining DataFrames:

• .merge(df1, df2, on='key')

• .concat([df1, df2], axis=0)

Advanced Merging & Joining:

• Merging on multiple keys: .merge(df1, df2, on=['key1', 'key2'])

• Handling different join types: left, right, outer, inner

Data Reshaping & Advanced Indexing
Melting & Pivoting:
• .melt(id_vars=['id'], value_vars=['A', 'B'])

• .pivot(index='id', columns='var', values='value’)

MultiIndexing:
• Setting multiple indexes: .set_index(['col1', 'col2'])

• Accessing data using .loc[] with MultiIndex

Stacking & Unstacking:

• .stack() & .unstack() for reshaping data
Working with Time Series Data
Converting to DateTime:
• pd.to_datetime(df['date_column'])

Extracting Date Components:

• .dt.year, .dt.month, .dt.day

Resampling & Rolling Windows:

• .resample('M').sum()

• .rolling(window=3).mean()

Time Zone Handling

• .tz_localize('UTC') & .tz_convert('US/Eastern')

Performance Optimization
• Use category dtype for memory efficiency

• Vectorized operations vs. loops (prefer apply() over loops)

• Use .apply() efficiently instead of row-wise iterations

• Optimizing with NumPy:

• Using .values for faster computations

• Leveraging numba for performance boosts

Exporting Data
Saving Data:
• .to_csv('output.csv')

• .to_excel('output.xlsx')

• .to_sql('table', conn, if_exists='replace')

Handling
Large Datasets
and Optimizing
Performance in Pandas
1. Selective Column Loading
The most straightforward way to improve performance is to load only the columns you need:

import pandas as pd

# Instead of reading all columns

df_full = pd.read_csv('large_dataset.csv')

# Specify only the columns you need

selected_columns = ['user_id', 'age', 'income', 'purchase_amount']

df_optimized = pd.read_csv('large_dataset.csv', usecols=selected_columns)

2. Chunking Large Datasets
For extremely large files, use the chunksize parameter to read data in manageable chunks:

# Read 10,000 rows at a time

chunk_size = 10000

for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):

# Process each chunk

processed_chunk = process_data(chunk)

# Perform operations without loading entire file into memory

3. Deep Dive: usecols
The usecols parameter is a powerful tool in pandas for optimizing data loading. It allows you to select specific columns,
dramatically reducing memory usage and loading time.

Basic Usage

# Load only specific columns

df = pd.read_csv('large_dataset.csv',

usecols=['user_id', 'age', 'income'])

Advanced usecols Techniques

• Using Column Indices

# Select columns by index

df = pd.read_csv('large_dataset.csv’,
usecols=[0, 2, 5]) # First, third, and sixth columns
3. Deep Dive: usecols
• Dynamic Column Selection

# Select columns dynamically

def column_selector(column_name):

return column_name.startswith('user_') or \

column_name in ['age', 'income’]

df = pd.read_csv('large_dataset.csv',

usecols=column_selector)
3. Deep Dive: usecols
• Combining with Dtype Optimization

# Specify both usecols and dtypes

dtypes = {

user_id': 'int32',

'age': 'int8',

'income': 'float32'

df = pd.read_csv('large_dataset.csv',

usecols=dtypes.keys(),

dtype=dtypes)
4. Leveraging dask for Larger-than-Memory Datasets
For datasets too large to fit in memory, consider dask:

import dask.dataframe as dd

# Read large CSV using dask

ddf = dd.read_csv('large_dataset.csv',

usecols=['user_id', 'age', 'income'])

# Compute operations lazily

result = ddf.groupby('age').income.mean().compute()
Best Practices

Profile Your Data Memory Matters

Use df.info() Monitor memory usage with

to understand your dataset's df.memory_usage()
structure

Choose Wisely Consider Alt.

Libraries
Select only columns crucial for dask or vaex for extremely large
your analysis datasets
Let’s move on to the lab

• Open Databricks Environment

• Login to Your Account

• Open your workspace

• Open the "Lab#3: Python for Data Engineers – Part 2"

Notebook

Complete HTML XML JS CSS WT-Course-Material
No ratings yet
Complete HTML XML JS CSS WT-Course-Material
174 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
HTML Note Imp HTML
No ratings yet
HTML Note Imp HTML
165 pages
Pandas 6 1716219621
No ratings yet
Pandas 6 1716219621
17 pages
Python Pandas
No ratings yet
Python Pandas
177 pages
HTML
No ratings yet
HTML
12 pages
Unit 4 Final
No ratings yet
Unit 4 Final
100 pages
HTML Tutorial
No ratings yet
HTML Tutorial
42 pages
HTML-Notes 1
No ratings yet
HTML-Notes 1
27 pages
Smart Card
No ratings yet
Smart Card
19 pages
The Racers Life
No ratings yet
The Racers Life
74 pages
Operating Instructions Estefold 4210 + 4211
No ratings yet
Operating Instructions Estefold 4210 + 4211
17 pages
Salesforce Health Cloud
No ratings yet
Salesforce Health Cloud
5 pages
Ol8 Relnotes8 1
No ratings yet
Ol8 Relnotes8 1
64 pages
Actc HTML Notes
No ratings yet
Actc HTML Notes
48 pages
Tech Quiz
No ratings yet
Tech Quiz
73 pages
Pandas
No ratings yet
Pandas
27 pages
Pandas
No ratings yet
Pandas
86 pages
HTX710 HTX715 SM
100% (1)
HTX710 HTX715 SM
148 pages
Pandas
No ratings yet
Pandas
30 pages
UN Data Analysis Pandas Matplotlib
No ratings yet
UN Data Analysis Pandas Matplotlib
28 pages
Unit-1 Python Pandas
No ratings yet
Unit-1 Python Pandas
56 pages
Unit - 1 - Python Pandas
No ratings yet
Unit - 1 - Python Pandas
176 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
HTML Notes
No ratings yet
HTML Notes
22 pages
Digital Home Theater Receiver
No ratings yet
Digital Home Theater Receiver
10 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
19 pages
Pandas
No ratings yet
Pandas
4 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
40 NumPy and Pandas Interview Questions With Answers 1740141557
No ratings yet
40 NumPy and Pandas Interview Questions With Answers 1740141557
6 pages
Pandas For Data Science
No ratings yet
Pandas For Data Science
42 pages
USED LAPTOPS - Shenzhen BOVIN Technology COMPANY PROFILE
No ratings yet
USED LAPTOPS - Shenzhen BOVIN Technology COMPANY PROFILE
7 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
Parchive
No ratings yet
Parchive
3 pages
Gadesktop3 2 1 User Guide
No ratings yet
Gadesktop3 2 1 User Guide
35 pages
Advance OOPS Assignment
No ratings yet
Advance OOPS Assignment
11 pages
HTML Cheat Sheet - Copie
No ratings yet
HTML Cheat Sheet - Copie
9 pages
Data Science Python Cheat Sheet
No ratings yet
Data Science Python Cheat Sheet
25 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
EST Mastery
No ratings yet
EST Mastery
45 pages
Sruthi Resume
No ratings yet
Sruthi Resume
2 pages
1 Pandas Basics
No ratings yet
1 Pandas Basics
13 pages
1-Pandas Cheat Sheet
No ratings yet
1-Pandas Cheat Sheet
7 pages
Phase Shift Keying: Digital Modulation
No ratings yet
Phase Shift Keying: Digital Modulation
24 pages
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
100% (1)
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
2 pages
Ipl Data Anlysis
No ratings yet
Ipl Data Anlysis
20 pages
System Forensic
No ratings yet
System Forensic
18 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
Pandas
No ratings yet
Pandas
14 pages
Manuscript Template - JTEIN 2024 Ok
No ratings yet
Manuscript Template - JTEIN 2024 Ok
4 pages
Motoman DX100 MH165 Instruction Manual
No ratings yet
Motoman DX100 MH165 Instruction Manual
93 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Pandas Methods
No ratings yet
Pandas Methods
6 pages
DSL Pandas
No ratings yet
DSL Pandas
87 pages
Pandas Notes Design
No ratings yet
Pandas Notes Design
5 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
An Economist's Guide To Visualizing Data: Jonathan A. Schwabish
No ratings yet
An Economist's Guide To Visualizing Data: Jonathan A. Schwabish
28 pages
Pandas
No ratings yet
Pandas
13 pages
Rational Numbers - Definition, Types, Properties, Standard Form of Rational Numbers
No ratings yet
Rational Numbers - Definition, Types, Properties, Standard Form of Rational Numbers
7 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
CutToolCDR-CUT-5.3.2 Help
0% (1)
CutToolCDR-CUT-5.3.2 Help
9 pages
Top 50 Pandas Interview Questions and Answers (2024)
No ratings yet
Top 50 Pandas Interview Questions and Answers (2024)
34 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Pandas Notes
No ratings yet
Pandas Notes
6 pages
Pandas
No ratings yet
Pandas
9 pages
Users Manual 4698957
No ratings yet
Users Manual 4698957
2 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Internship Report SHAHBAZ
No ratings yet
Internship Report SHAHBAZ
10 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
Pandas
No ratings yet
Pandas
8 pages
10.python Lists
No ratings yet
10.python Lists
53 pages
Module1-Cheat-Sheet-LINE PLOT
No ratings yet
Module1-Cheat-Sheet-LINE PLOT
3 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
ARDUINOBRITPPT
No ratings yet
ARDUINOBRITPPT
23 pages
E Budget System of COA PDF
No ratings yet
E Budget System of COA PDF
73 pages
Study Material IP XII
No ratings yet
Study Material IP XII
116 pages
DMC Ground Control Manual v2
0% (1)
DMC Ground Control Manual v2
31 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Pandas
No ratings yet
Pandas
41 pages
Pandas in Python 16sept2022
No ratings yet
Pandas in Python 16sept2022
8 pages
Chapter - 6 Dictionary
100% (2)
Chapter - 6 Dictionary
25 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Microsoft Testkings PL-900 v2020-05-15 by Callum 37q
No ratings yet
Microsoft Testkings PL-900 v2020-05-15 by Callum 37q
28 pages
Curriculum Vitae Vishal Agarwal: S.S.D Inter College, Hasanpur
No ratings yet
Curriculum Vitae Vishal Agarwal: S.S.D Inter College, Hasanpur
2 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
FSCM - S - 4 HANA Credit Management Configuration
100% (16)
FSCM - S - 4 HANA Credit Management Configuration
17 pages
Level 1 Quiz
No ratings yet
Level 1 Quiz
4 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet