[go: up one dir, main page]

0% found this document useful (0 votes)
8 views10 pages

EDA LAB ASSIGNMENT2

The document outlines an assignment focused on exploratory data analytics using the Pandas library in Python, covering tasks such as merging datasets, reshaping data with melt() and pivot(), and aggregating total sales by region. It also introduces Matplotlib for creating various visualizations, including line plots, scatter plots, histograms, and bar charts, along with customization techniques. The assignment provides step-by-step instructions and sample code for practical application of these concepts.

Uploaded by

vinaynaidu6872
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

EDA LAB ASSIGNMENT2

The document outlines an assignment focused on exploratory data analytics using the Pandas library in Python, covering tasks such as merging datasets, reshaping data with melt() and pivot(), and aggregating total sales by region. It also introduces Matplotlib for creating various visualizations, including line plots, scatter plots, histograms, and bar charts, along with customization techniques. The assignment provides step-by-step instructions and sample code for practical application of these concepts.

Uploaded by

vinaynaidu6872
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Name : kola vinay kumar Subject :EXPLORATORY DATA ANALYTICS (EDA)

HT. No : 2403B05107 Semester : M. Tech(I/II)

ASSIGNMENT 02:

Data Transformation and Aggregation


Objective: Practice transforming and reshaping datasets.
Tasks:
• Merge multiple datasets (e.g., orders and customers datasets).
• Perform reshaping using melt() and pivot() functions in Pandas.
• Aggregate data (e.g., calculate the total sales for each region).
• Tools: Pandas.

Data Transformation and Aggregation with Pandas


The objective of this exercise is to practice transforming and reshaping datasets using the
Pandas library in Python. We will focus on merging datasets, reshaping data
using melt() and pivot(), and aggregating data to calculate total sales for each region.
1. Merging Datasets
Merging is the process of combining two or more datasets based on a common key. In Pandas, this is
typically done using the merge() function.
2. Reshaping Data
Reshaping refers to changing the layout of a DataFrame. This can be done using:
• Melt: Converts a DataFrame from wide format to long format.
• Pivot: Converts a DataFrame from long format to wide format.
3. Aggregating Data
Aggregation involves summarizing data, such as calculating totals, averages, or counts. In Pandas,
this can be done using functions like groupby() and agg().
Tools
Pandas: A powerful data manipulation and analysis library for Python.
Step 1: Import Libraries and Create Sample Datasets
import pandas as pd

# Sample customers dataset


customers_data = {
'customer_id': [1, 2, 3],
'customer_name': ['Alice', 'Bob', 'Charlie'],
'region': ['North', 'South', 'East']
}

customers_df = pd.DataFrame(customers_data)

# Sample orders dataset


orders_data = {
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 1, 3],
'order_amount': [250, 150, 300, 200],
'order_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
}
orders_df = pd.DataFrame(orders_data)

print("Customers DataFrame:")
print(customers_df)
print("\nOrders DataFrame:")
print(orders_df)

OUTPUT:

Customers DataFrame:
customer_id customer_name region
0 1 Alice North
1 2 Bob South
2 3 Charlie East

Orders DataFrame:
order_id customer_id order_amount order_date
0 101 1 250 2023-01-01
1 102 2 150 2023-01-02
2 103 1 300 2023-01-03
3 104 3 200 2023-01-04

Step 2: Merge Datasets


We will merge the orders_df and customers_df on the customer_id column.
# Merging datasets
merged_df = pd.merge(orders_df, customers_df, on='customer_id', how='inner')

print("\nMerged DataFrame:")
print(merged_df)

Output:

order_id customer_id order_amount order_date customer_name


0 101 1250 2023-01-01 Alice
1 103 300 2023-01-03 Alice
2 102 150 2023-01-02 Bob
3 104 200 2023-01-04 Charlie

Reshaping Using melt()


The melt() function is used to convert a wide format dataset into a long format.

# Reshaping the merged data using melt


melted_df = pd.melt(merged_df, id_vars=['order_id', 'customer_id', 'name'],
value_vars=['product', 'quantity', 'price', 'region'])

print(melted_df)
OUTPUT:

order_id customer_id name variable value


0 101 C1 Alice product Laptop
1 104 C1 Alice product Monitor
2 102 C2 Bob product Phone
3 105 C2 Bob product Keyboard
4 103 C3 Charlie product Tablet
...
3. Reshaping Using pivot()
The pivot() function converts long format back to wide format.

# Pivot table example


pivot_df = merged_df.pivot(index='order_id', columns='product', values='price')

print(pivot_df)

OUTPUT:
product Keyboard Laptop Monitor Phone Tablet
order_id
101 NaN 1000.0 NaN NaN NaN
102 NaN NaN NaN 500.0 NaN
103 NaN NaN NaN NaN 300.0
104 NaN NaN 200.0 NaN NaN
105 50.0 NaN NaN NaN NaN

The pivot() function rearranges the data so that products become columns and the price values are
filled accordingly.

Aggregating Data
We will calculate total sales per region.

# Calculate total sales per region


merged_df['total_sales'] = merged_df['quantity'] * merged_df['price']

# Grouping by region and summing the sales


sales_per_region = merged_df.groupby('region')['total_sales'].sum().reset_index()

print(sales_per_region)

OUTPUT:
region total_sales
0 East 1400
1 North 300
2 West 1150

• We multiply quantity and price to get total_sales.


• Then we group by region and sum the total_sales.
Merged datasets on customer_id.
✔ Reshaped data using melt() (long format) and pivot() (wide format).
✔ Aggregated total sales by region.
This provides a complete workflow of merging, reshaping, and aggregating datasets using
Pandas.

2. Creating and Customizing Plots


• Objective: Use Matplotlib to create various types of visualizations.
• Tasks:
Create line plots, scatter plots, histograms, and bar charts.
Customize plots with legends, annotations, titles, and axis labels.
Generate subplots and explore 3D plotting.
Tools: Matplotlib.
Matplotlib is a powerful library in Python used for creating static, animated, and interactive
visualizations. It provides functionalities to create various types of plots, such as line plots, scatter
plots, histograms, and bar charts.
With Matplotlib, we can:
• Customize plots with titles, labels, legends, and annotations.
• Generate multiple subplots in one figure.
• Explore 3D visualizations using mpl_toolkits.mplot3d.

Importing Required Libraries


Before creating plots, we need to import Matplotlib.

import matplotlib.pyplot as plt


import numpy as np
from mpl_toolkits.mplot3d import Axes3D # For 3D plots
2. Creating Various Types of Plots
2.1 Line Plot
A line plot is used to visualize trends over time.
# Data for the line plot
x = np.linspace(0, 10, 100) # 100 points from 0 to 10
y = np.sin(x) # Sine function

# Creating the line plot


plt.figure(figsize=(8, 5)) # Set figure size
plt.plot(x, y, label='sin(x)', color='blue', linestyle='--', linewidth=2)

# Customization
plt.title("Line Plot of sin(x)", fontsize=14)
plt.xlabel("X values", fontsize=12)
plt.ylabel("Y values", fontsize=12)
plt.legend()
plt.grid(True)

# Show plot
plt.show()
OUTPUT:
Scatter Plot

A scatter plot is used to show relationships between two variables.

# Generate random data


np.random.seed(42)
x = np.random.rand(50) * 10 # Random x values
y = np.random.rand(50) * 10 # Random y values

# Creating the scatter plot


plt.figure(figsize=(8, 5))
plt.scatter(x, y, color='red', marker='o', edgecolors='black', alpha=0.75)

# Customization
plt.title("Scatter Plot of Random Data", fontsize=14)
plt.xlabel("X Axis", fontsize=12)
plt.ylabel("Y Axis", fontsize=12)
plt.grid(True)

# Show plot
plt.show()
Output:
A scatter plot with red circles representing random data points.

Histogram
A histogram is used to represent the distribution of a dataset.
# Generate random data
data = np.random.randn(1000) # 1000 data points following normal distribution

# Creating the histogram


plt.figure(figsize=(8, 5))
plt.hist(data, bins=30, color='green', edgecolor='black', alpha=0.7)

# Customization
plt.title("Histogram of Normally Distributed Data", fontsize=14)
plt.xlabel("Data Bins", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.grid(True)

# Show plot
plt.show()

OUTPUT:

A histogram with 30 bins, showing a normal distribution.


Bar Chart
A bar chart is used to compare categorical data.
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [10, 25, 15, 30]

# Creating the bar chart


plt.figure(figsize=(8, 5))
plt.bar(categories, values, color=['blue', 'orange', 'green', 'red'])

# Customization
plt.title("Bar Chart Example", fontsize=14)
plt.xlabel("Categories", fontsize=12)
plt.ylabel("Values", fontsize=12)
plt.grid(axis='y', linestyle='--')

# Show plot
plt.show()

OUTPUT:

Customizing Plots : Adding Legends, Titles, and Annotations


Customization helps in better interpretation of data.
# Creating a simple line plot with annotations
x = np.linspace(0, 10, 100)
y = np.cos(x)

plt.figure(figsize=(8, 5))
plt.plot(x, y, label='cos(x)', color='purple')

# Adding text annotation


plt.annotate('Peak', xy=(0, 1), xytext=(2, 1.2),
arrowprops=dict(facecolor='black', shrink=0.05))

# Customization
plt.title("Customized Line Plot", fontsize=14)
plt.xlabel("X values", fontsize=12)
plt.ylabel("Y values", fontsize=12)
plt.legend()
plt.grid(True)

# Show plot
plt.show()

OUTPUT:

Subplots
We can create multiple plots in a single figure.

fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# First subplot: Line plot


axs[0, 0].plot(x, y, color='blue')
axs[0, 0].set_title("Line Plot")

# Second subplot: Scatter plot


axs[0, 1].scatter(x, y, color='red')
axs[0, 1].set_title("Scatter Plot")

# Third subplot: Histogram


axs[1, 0].hist(data, bins=20, color='green')
axs[1, 0].set_title("Histogram")

# Fourth subplot: Bar Chart


axs[1, 1].bar(categories, values, color='orange')
axs[1, 1].set_title("Bar Chart")

# Adjust layout
plt.tight_layout()
plt.show()
OUTPUT:

3D Plotting
Matplotlib allows 3D visualization using Axes3D.

fig = plt.figure(figsize=(8, 6))


ax = fig.add_subplot(111, projection='3d')

# Generate 3D data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# Creating the 3D surface plot


ax.plot_surface(X, Y, Z, cmap='viridis')

# Customization
ax.set_title("3D Surface Plot")
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")
ax.set_zlabel("Z Axis")

# Show plot
plt.show()
OUTPUT:

Plot Type Purpose


Line Plot Visualize trends over time
Scatter Plot Show relationships between two variables
Histogram Display data distribution
Bar Chart Compare categorical data
Subplots Show multiple plots in one figure
3D Plot Visualize complex 3D data

Key Customizations
✔ Titles (title())
✔ Axis labels (xlabel(), ylabel())
✔ Legends (legend())
✔ Annotations (annotate())
✔ Grid (grid(True))

Matplotlib provides a flexible and powerful way to visualize data. By mastering these concepts, you
can create highly customized and informative plots for data analysis and presentation.

You might also like