PUNJAB COLLEGE FAISALABAD
(A CONSTITUENT COLLEGE FAISALABAD)
Assignment # 1
Course Title:
Programing of Big Data
Assignment Title:
Data Wrangling & Data Manipulation
with Pandas
Submitted to:
Prof. Muhammad Ali
Submitted by:
Muhammad Abdullah (082)
Class:
BSCS(6A)
Date of Submission: 13-May-2024
Data Wrangling & Data Manipulation with Pandas
Data Wrangling:
Data wrangling, also known as data munging, is the process of cleaning, structuring, and enriching
raw data into a desired format for better decision-making in less time. It involves transforming and
mapping data from its raw form into another format with the intent of making it more appropriate
and valuable for a variety of downstream purposes such as analytics. Data wrangling often
includes dealing with missing or inconsistent data, handling outliers, and converting data types.
Data Wrangling:
1. Data Loading:
o This entails extracting data from varied sources like CSV files, Excel spreadsheets,
SQL databases, JSON, HTML, etc., and importing them into Pandas DataFrame
objects.
import pandas as pd
# Load data from a JSON file into a DataFrame
df = pd.read_json('file_path.json')
2. Data Cleaning:
o Handling Missing Values: Identifying and dealing with missing data using methods
like dropping missing values (dropna()), filling missing values (fillna()), or
imputation techniques.
o Removing Duplicates: Detecting and removing duplicate rows from the dataset
using drop_duplicates().
o Correcting Errors: Identifying and correcting errors in data entries, such as typos or
inconsistencies.
o Dealing with Inconsistencies: Ensuring consistency in data format, units, and
conventions throughout the dataset.
df.dropna(inplace=True) # Drop rows with missing values
df.fillna(value, inplace=True) # Fill missing values with a specified value
df.drop_duplicates(inplace=True) # Remove duplicate rows
df['column'] = df['column'].apply(lambda x: correct_error(x)) # Apply a function to correct errors
df['column'] = df['column'].str.lower() # Convert text to lowercase
3. Data Transformation:
o Converting Data Types: Changing the data type of columns to the appropriate
format using astype() or specific parsing functions.
o Normalizing Data: Scaling numerical data to a common scale to facilitate
comparison.
o Creating Derived Variables: Generating new features or variables from existing
ones through mathematical operations, transformations, or feature engineering.
o Handling Text Data: Processing and cleaning text data by removing punctuation,
stop words, or performing stemming/lemmatization.
o Handling Categorical Data: Encoding categorical variables into numerical
representations using techniques like one-hot encoding or label encoding.
# Converting Data Types
df['column'] = df['column'].astype('float64') # Convert data type of a column
# Normalizing Data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized_column'] = scaler.fit_transform(df[['column']])
# Creating Derived Variables
df['new_column'] = df['column1'] * df['column2'] # Create a new column based on existing columns
4. Data Integration:
o Combining Datasets: Merging or concatenating multiple datasets with similar
structures but different observations or variables using merge() or concat().
o Resolving Schema Differences: Handling schema differences between datasets,
such as mismatched column names or data types, through renaming or type
conversion.
o Handling Data Redundancy: Identifying and removing redundant or irrelevant data
to streamline the dataset.
# Combining Datasets
merged_df = pd.merge(df1, df2, on='key_column', how='inner') # Merge two DataFrames on a common
column
5. Data Reduction:
o Aggregating Data: Summarizing data by grouping it based on certain attributes and
calculating aggregate statistics using groupby() and aggregation functions like
sum(), mean(), count(), etc.
o Sampling Data: Selecting a representative subset of data for analysis using
techniques like random sampling or stratified sampling.
o Dimensionality Reduction: Reducing the number of features or variables in the
dataset through techniques like principal component analysis (PCA) or feature
selection.
# Aggregating Data
aggregated_df = df.groupby('category')['value'].sum() # Aggregate data by category
# Sampling Data
sample_df = df.sample(n=100, random_state=42) # Sample 100 random rows
//****************************************************//
Data Manipulation:
Data manipulation refers to the process of transforming and reshaping data to extract insights or
prepare it for further analysis. This involves various operations such as filtering, sorting,
aggregating, merging, and pivoting data to derive meaningful conclusions. Data manipulation aims
to organize and structure data in a way that facilitates analysis and decision-making.
Data Manipulation:
1. Filtering Data:
o Selecting Rows: Filtering rows based on specific conditions using boolean indexing
or the query() method.
o Selecting Columns: Extracting specific columns from the dataset based on their
names or indices.
# Selecting Rows
filtered_df = df[df['column'] > 10] # Filter rows based on a condition
# Selecting Columns
selected_columns_df = df[['column1', 'column2']] # Select specific columns
2. Sorting Data:
o Sorting Rows: Arranging rows of the dataset in ascending or descending order
based on one or more columns using sort_values().
# Sorting Rows
sorted_df = df.sort_values(by='column', ascending=False) # Sort DataFrame by
a column in descending order
3. Grouping and Aggregating:
o Grouping Data: Grouping data based on one or more categorical variables using
groupby().
o Aggregating Data: Computing summary statistics (e.g., sum, mean, count) within
each group using aggregation functions.
# Grouping Data
grouped_df = df.groupby('category')
# Aggregating Data
aggregated_df = grouped_df['value'].mean() # Compute mean value per category
4. Merging and Joining:
o Combining DataFrames: Merging or joining multiple DataFrames based on
common columns or indices using merge() or join().
o Handling Join Types: Specifying the type of join (e.g., inner join, outer join, left join,
right join) to control how rows are combined.
# Combining DataFrames
merged_df = pd.merge(df1, df2, on='key_column', how='inner') # Merge two
DataFrames based on a common column
5. Reshaping Data:
o Pivoting Data: Restructuring data from long to wide format or vice versa using
pivot_table(), melt(), stack(), and unstack().
# Pivoting Data
pivoted_df = df.pivot_table(index='date', columns='category', values='value',
aggfunc='sum') # Pivot DataFrame
# Melting Data
melted_df = pd.melt(df, id_vars=['id'], value_vars=['variable1',
'variable2']) # Melt DataFrame from wide to long format
6. Applying Functions:
o Element-wise Operations: Applying functions or operations to individual elements
or rows/columns of the dataset using apply() or vectorized operations.
o Group-wise Operations: Applying functions to groups of data within each group
using apply() or transform().
# Element-wise Operations
df['new_column'] = df['column'].apply(lambda x: custom_function(x)) # Apply
a custom function to each element in a column
# Group-wise Operations
grouped_df = df.groupby('category')
aggregated_df = grouped_df['value'].apply(lambda x:
custom_aggregation_function(x)) # Apply a custom aggregation function to
each group
Data Wrangling vs. Data Manipulation with Pandas