[go: up one dir, main page]

0% found this document useful (0 votes)
29 views33 pages

DevOps Session 3 Pandas

The document provides an overview of Python libraries, specifically focusing on Pandas for data engineering. It covers key concepts such as DataFrames, Series, data selection, filtering, cleaning, transformation, and common operations for data manipulation. Additionally, it discusses performance optimization techniques for handling large datasets and best practices for efficient data processing.

Uploaded by

msalaarm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

DevOps Session 3 Pandas

The document provides an overview of Python libraries, specifically focusing on Pandas for data engineering. It covers key concepts such as DataFrames, Series, data selection, filtering, cleaning, transformation, and common operations for data manipulation. Additionally, it discusses performance optimization techniques for handling large datasets and best practices for efficient data processing.

Uploaded by

msalaarm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Engineering

Professional Training
Libraries in Python
Libraries in Python
Libraries in Python are collections of modules and packages that provide pre-written code to perform various tasks. They
help simplify coding by providing reusable functions and classes for specific functionalities, such as data analysis, machine
learning, web development, and more.
Python Library –
Pandas
Key Concepts

We use DataFrames like a virtual spreadsheet. Imagine we have a table with


columns like ‘Cars’ and ‘In stock’. Each column can hold different types of
DataFrame information, like Cara brand and their amount in stock. We can easily perform
operations on this table, like finding the total number of cars in stock.

A Series is like a single column in our table. For example, the ‘Cars’ column is a
Series Series containing names like ‘VW’, ‘Ford’ and ‘Toyota.’ We can work with Series
individually or as part of a DataFrame.
Key Concepts

Think of an index as a label for each row or column. It helps us identify specific data
Index points. For instance, if we have a list of car brands, the index could be their assigned
number.

Selection and To get specific data, we use selection and slicing. We can say, “Give me all VW cars
Slicing in stock”, and Pandas will fetch them.

Pandas helps us tidy up data. If there are missing values (like empty cells), we can
Data Cleaning either fill them or remove them. We can also drop columns we don’t need.

Data Sometimes, we need to reshape or merge data to make it suitable for analysis. For
instance, if we have data on sales per month and want to see the total sales for the
Transformation year, Pandas makes that transformation easy.
Key Concepts

Filtering and If we want to find all which brand has more cars in stock than 20, Pandas can do it. It
Sorting can also sort data by any other criteria.

Aggregation Let’s say we want to find the average number of cars in stock on a specific day. We
and Grouping can group data by day and calculate the averages.

Data We can load data from files (like CSV or Excel) and save our work for later. This
Input/Output way, we can share our analyses with others.

Although Pandas doesn’t create fancy charts, it can work with visualization libraries
Visualization like Matplotlib to show our data in graphs and plots.
Dataframes in Pandas
pandas.DataFrame
• Two-dimensional, size-mutable, potentially heterogeneous tabular data.

• Data structure also contains labeled axes (rows and columns).

• Arithmetic operations align on both row and column labels.

• Can be thought of as a dict-like container for Series objects.

• The primary pandas data structure.


Parameters
data: ndarray (structured or homogeneous),
Iterable, dict, or DataFrame dtype: dtype, default None
Dict can contain Series, arrays, constants, dataclass or list-like Data type to force. Only a single dtype is allowed. If None,
objects. If data is a dict, column order follows insertion-order. If a infer.
dict contains Series which have an index defined, it is aligned by its
index. This alignment also occurs if data is a Series or a DataFrame
itself. Alignment is done on Series/DataFrame inputs.
copy: bool or None, default None
If data is a list of dicts, column order follows insertion-order.
Copy data from inputs.
For dict data, the default of None behaves like copy=True. For
DataFrame or 2d ndarray input, the default of None behaves like
index: Index or array-like copy=False.
If data is a dict containing one or more Series (possibly of different
Index to use for resulting frame. dtypes), copy=False will ensure that these inputs are not copied.
Will default to RangeIndex if no indexing information part of input
data and no index provided.

columns: Index or array-like


Column labels to use for resulting frame when data does not have
them, defaulting to RangeIndex(0, 1, 2, …, n).
If data contains column labels, will perform column selection
instead.
Creating a Dataframe
To create a dataframe, you must first create a dictionary.

A dictionary is a list of values linked to keys.

The keys are separated from their values with colons and brackets as shown below.

In this case, the dictionary keys will become the column names for the DataFrame. The key would be “Grades” and the
values would be “A, B, C, D, F”.
Creating a Dataframe
Dictionary Methods

Methods Usage

Values() Return a list of all values in the dictionary

Update() Updates the dictionary with the specified key-value pairs

Returns the value of the specified key. If the key does not exist insert the key, with the
setdefault() specified value

clear() Removes all the elements from the dictionary

keys() Returns a list containing the keys of the dictionary

pop() Removes the element with the specified key


Dictionary Methods

Methods Usage

popitem() Removes the last inserted key-value pair

get() Returns the value of the specified key

items() Returns a list containing a tuple for each key value pair

copy() Returns a copy of the dictionary

fromkeys() Returns a dictionary with the specified keys and value


Common Dataframe Operations
Selecting Data: .loc[], .iloc[], []

Filtering Data: Conditional filters, .query()

Sorting: .sort_values()

Modifying Columns: Adding new columns, .apply()

Handling Missing Data: .isnull(), .fillna(), .dropna()

Aggregating Data: .groupby(), .agg()

Merging DataFrames: .merge(), .concat()

Reshaping Data: .pivot(), .melt()

Time Series Operations: .to_datetime(), .resample()

Optimization: Using vectorized operations & category dtype


Advanced
Data Wrangling
and Manipulation
using Pandas
Loading and Inspecting Data
Reading Data:
• pd.read_csv('file.csv')

• pd.read_excel('file.xlsx')

• pd.read_json('file.json')

• pd.read_sql(query, conn)

Inspecting Data:
• .head(), .tail()

• .info(), .describe()

• .shape, .columns, .dtypes


Data Cleaning
Handling Missing Values:
• Detect: .isnull().sum()

• Drop: .dropna()

• Fill: .fillna(value)

Handling Duplicates:
• Check: .duplicated()

• Remove: .drop_duplicates()

Changing Data Types:


• Convert: .astype(type)

Renaming Columns:

• .rename(columns={'old': 'new'})
Data Transformation & Manipulation
Selecting & Filtering Data:
• .loc[row_label, col_label]

• .iloc[row_index, col_index]

• Boolean indexing: df[df['column'] > value]

Advanced Filtering Techniques:


• Multi-condition filtering: df[(df['col1'] > value) & (df['col2'] < value)]

• Query method: df.query('col1 > value & col2 < value')

Sorting Data:
• .sort_values(by='column', ascending=True)

Adding & Modifying Columns:


• Creating new columns: df['new_col'] = df['col1'] + df['col2']

• Applying functions: .apply(function), .map(dictionary)


Data Transformation & Manipulation
Grouping & Aggregation:
• .groupby('column').agg({'col': 'sum’})

Complex Aggregation:
• Using multiple functions: .groupby('column').agg({'col1': ['sum', 'mean'], 'col2': 'max'})

• Custom aggregation functions using lambda

Merging & Joining DataFrames:


• .merge(df1, df2, on='key')

• .concat([df1, df2], axis=0)

Advanced Merging & Joining:


• Merging on multiple keys: .merge(df1, df2, on=['key1', 'key2'])

• Handling different join types: left, right, outer, inner


Data Reshaping & Advanced Indexing
Melting & Pivoting:
• .melt(id_vars=['id'], value_vars=['A', 'B'])

• .pivot(index='id', columns='var', values='value’)

MultiIndexing:
• Setting multiple indexes: .set_index(['col1', 'col2'])

• Accessing data using .loc[] with MultiIndex

Stacking & Unstacking:


• .stack() & .unstack() for reshaping data
Working with Time Series Data
Converting to DateTime:
• pd.to_datetime(df['date_column'])

Extracting Date Components:


• .dt.year, .dt.month, .dt.day

Resampling & Rolling Windows:


• .resample('M').sum()

• .rolling(window=3).mean()

Time Zone Handling

• .tz_localize('UTC') & .tz_convert('US/Eastern')


Performance Optimization
• Use category dtype for memory efficiency

• Vectorized operations vs. loops (prefer apply() over loops)

• Use .apply() efficiently instead of row-wise iterations

• Optimizing with NumPy:

• Using .values for faster computations

• Leveraging numba for performance boosts


Exporting Data
Saving Data:
• .to_csv('output.csv')

• .to_excel('output.xlsx')

• .to_sql('table', conn, if_exists='replace')


Handling
Large Datasets
and Optimizing
Performance in Pandas
1. Selective Column Loading
The most straightforward way to improve performance is to load only the columns you need:

import pandas as pd

# Instead of reading all columns

df_full = pd.read_csv('large_dataset.csv')

# Specify only the columns you need

selected_columns = ['user_id', 'age', 'income', 'purchase_amount']

df_optimized = pd.read_csv('large_dataset.csv', usecols=selected_columns)


2. Chunking Large Datasets
For extremely large files, use the chunksize parameter to read data in manageable chunks:

# Read 10,000 rows at a time

chunk_size = 10000

for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):

# Process each chunk

processed_chunk = process_data(chunk)

# Perform operations without loading entire file into memory


3. Deep Dive: usecols
The usecols parameter is a powerful tool in pandas for optimizing data loading. It allows you to select specific columns,
dramatically reducing memory usage and loading time.

Basic Usage

# Load only specific columns

df = pd.read_csv('large_dataset.csv',

usecols=['user_id', 'age', 'income'])

Advanced usecols Techniques

• Using Column Indices

# Select columns by index


df = pd.read_csv('large_dataset.csv’,
usecols=[0, 2, 5]) # First, third, and sixth columns
3. Deep Dive: usecols
• Dynamic Column Selection

# Select columns dynamically

def column_selector(column_name):

return column_name.startswith('user_') or \

column_name in ['age', 'income’]

df = pd.read_csv('large_dataset.csv',

usecols=column_selector)
3. Deep Dive: usecols
• Combining with Dtype Optimization

# Specify both usecols and dtypes

dtypes = {

user_id': 'int32',

'age': 'int8',

'income': 'float32'

df = pd.read_csv('large_dataset.csv',

usecols=dtypes.keys(),

dtype=dtypes)
4. Leveraging dask for Larger-than-Memory Datasets
For datasets too large to fit in memory, consider dask:

import dask.dataframe as dd

# Read large CSV using dask

ddf = dd.read_csv('large_dataset.csv',

usecols=['user_id', 'age', 'income'])

# Compute operations lazily

result = ddf.groupby('age').income.mean().compute()
Best Practices

Profile Your Data Memory Matters

Use df.info() Monitor memory usage with


to understand your dataset's df.memory_usage()
structure

Choose Wisely Consider Alt.


Libraries
Select only columns crucial for dask or vaex for extremely large
your analysis datasets
Let’s move on to the lab

• Open Databricks Environment

• Login to Your Account

• Open your workspace

• Open the "Lab#3: Python for Data Engineers – Part 2"


Notebook

You might also like