DevOps Session 3 Pandas
DevOps Session 3 Pandas
Professional Training
Libraries in Python
Libraries in Python
Libraries in Python are collections of modules and packages that provide pre-written code to perform various tasks. They
help simplify coding by providing reusable functions and classes for specific functionalities, such as data analysis, machine
learning, web development, and more.
Python Library –
Pandas
Key Concepts
A Series is like a single column in our table. For example, the ‘Cars’ column is a
Series Series containing names like ‘VW’, ‘Ford’ and ‘Toyota.’ We can work with Series
individually or as part of a DataFrame.
Key Concepts
Think of an index as a label for each row or column. It helps us identify specific data
Index points. For instance, if we have a list of car brands, the index could be their assigned
number.
Selection and To get specific data, we use selection and slicing. We can say, “Give me all VW cars
Slicing in stock”, and Pandas will fetch them.
Pandas helps us tidy up data. If there are missing values (like empty cells), we can
Data Cleaning either fill them or remove them. We can also drop columns we don’t need.
Data Sometimes, we need to reshape or merge data to make it suitable for analysis. For
instance, if we have data on sales per month and want to see the total sales for the
Transformation year, Pandas makes that transformation easy.
Key Concepts
Filtering and If we want to find all which brand has more cars in stock than 20, Pandas can do it. It
Sorting can also sort data by any other criteria.
Aggregation Let’s say we want to find the average number of cars in stock on a specific day. We
and Grouping can group data by day and calculate the averages.
Data We can load data from files (like CSV or Excel) and save our work for later. This
Input/Output way, we can share our analyses with others.
Although Pandas doesn’t create fancy charts, it can work with visualization libraries
Visualization like Matplotlib to show our data in graphs and plots.
Dataframes in Pandas
pandas.DataFrame
• Two-dimensional, size-mutable, potentially heterogeneous tabular data.
The keys are separated from their values with colons and brackets as shown below.
In this case, the dictionary keys will become the column names for the DataFrame. The key would be “Grades” and the
values would be “A, B, C, D, F”.
Creating a Dataframe
Dictionary Methods
Methods Usage
Returns the value of the specified key. If the key does not exist insert the key, with the
setdefault() specified value
Methods Usage
items() Returns a list containing a tuple for each key value pair
Sorting: .sort_values()
• pd.read_excel('file.xlsx')
• pd.read_json('file.json')
• pd.read_sql(query, conn)
Inspecting Data:
• .head(), .tail()
• .info(), .describe()
• Drop: .dropna()
• Fill: .fillna(value)
Handling Duplicates:
• Check: .duplicated()
• Remove: .drop_duplicates()
Renaming Columns:
• .rename(columns={'old': 'new'})
Data Transformation & Manipulation
Selecting & Filtering Data:
• .loc[row_label, col_label]
• .iloc[row_index, col_index]
Sorting Data:
• .sort_values(by='column', ascending=True)
Complex Aggregation:
• Using multiple functions: .groupby('column').agg({'col1': ['sum', 'mean'], 'col2': 'max'})
MultiIndexing:
• Setting multiple indexes: .set_index(['col1', 'col2'])
• .rolling(window=3).mean()
• .to_excel('output.xlsx')
import pandas as pd
df_full = pd.read_csv('large_dataset.csv')
chunk_size = 10000
processed_chunk = process_data(chunk)
Basic Usage
df = pd.read_csv('large_dataset.csv',
def column_selector(column_name):
return column_name.startswith('user_') or \
df = pd.read_csv('large_dataset.csv',
usecols=column_selector)
3. Deep Dive: usecols
• Combining with Dtype Optimization
dtypes = {
user_id': 'int32',
'age': 'int8',
'income': 'float32'
df = pd.read_csv('large_dataset.csv',
usecols=dtypes.keys(),
dtype=dtypes)
4. Leveraging dask for Larger-than-Memory Datasets
For datasets too large to fit in memory, consider dask:
import dask.dataframe as dd
ddf = dd.read_csv('large_dataset.csv',
result = ddf.groupby('age').income.mean().compute()
Best Practices