0% found this document useful (0 votes)

53 views15 pages

EDA Report Week2

The Exploratory Data Analysis (EDA) of the Superstore Sales dataset aims to uncover patterns and anomalies in business transactions, logistics, and customer behavior. Key findings include the negative correlation between discounts and profit margins, the impact of duplicate records on sales metrics, and the underperformance of the South region in profitability. The report emphasizes the importance of data cleaning, monitoring, and strategic recommendations to enhance data quality and inform business decisions.

Uploaded by

Shreya Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views15 pages

EDA Report Week2

Uploaded by

Shreya Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Exploratory Data Analysis (EDA) Report

1.Introduction
The goal of this Exploratory Data Analysis (EDA) is to conduct an in-depth examination
of a publicly available dataset focused on business transactions, logistics, and customer
behavior. For this purpose, the Superstore Sales dataset has been selected due to its
diverse and multidimensional features that span across sales figures, profit margins,
shipping timelines, customer demographics, and product categories.

Through this EDA, the objective is to identify key patterns, uncover hidden trends, and
detect potential anomalies that may affect the reliability of business insights derived from
the data. It also aims to evaluate the integrity and consistency of the data, which are
essential prerequisites for making data-driven business decisions.

This process sets the groundwork for a comprehensive data quality assessment. By
identifying inconsistencies such as outliers, duplicate records, missing values, or skewed
distributions, businesses can better understand the risks and opportunities inherent in their
operations. The findings from this EDA can directly influence strategic initiatives such as
sales forecasting, customer segmentation, inventory optimization, and profitability
enhancement.

2. Methodology
The methodology followed for the EDA process includes:

1.Data Loading and Inspection

This step involves importing the dataset into the Python environment using libraries such
as Pandas. Initial inspection is conducted by checking the structure, shape, column
names, and data types to understand the dataset's dimensions and attributes. Functions
like head (), info (), and describe () help identify any immediate anomalies, missing
values, or incorrect data types.

2.Data Cleaning
This stage focuses on correcting inaccuracies in the dataset. Common cleaning operations
include removing duplicate entries, handling missing values either by deletion or
imputation, converting date columns into proper datetime formats, and standardizing
inconsistent data. Cleaning ensures that the dataset is reliable and free of obvious data
quality issues before analysis begins.
3.Descriptive Statistics
This involves summarizing the dataset using measures such as mean, median, standard
deviation, minimum, maximum, and interquartile range. These metrics provide insight
into the central tendencies and variability of numerical features, forming the basis for
further analysis.

4.Distribution Analysis
Distribution plots like histograms, box plots, and KDE plots are used to visualize the
frequency and spread of the data. This step helps identify skewed distributions, potential
data entry errors, and unusual trends that may require deeper investigation.

5.Outlier Detection
Outliers can significantly skew the results of an analysis. Techniques like the
Interquartile Range (IQR) method and Z-score analysis are applied to detect and flag
these values. Outliers are visualized through box plots or scatter plots and may be
removed or separately treated.

6.Correlation Analysis
To understand relationships between numeric variables, a correlation matrix is computed.
Heatmaps help visualize strong or weak correlations, which inform the selection of
relevant variables for modelling and highlight potential multicollinearity issues.

7.Data Quality Insights

During the EDA, quality concerns such as missing entries, inconsistent formats, incorrect
data types, and implausible values are documented. Identifying these issues early helps
prevent errors in downstream analytics and supports robust decision-making.

8.Business Impact Mapping

The final step links technical findings to real-world implications. For instance, missing
customer data might affect personalization efforts, while duplicated records could distort
revenue reporting. This mapping highlights the significance of EDA in a business
context.

3. Analysis and Results

3.1 Data Overview

The dataset was imported using the pandas library, which is widely used for data
manipulation in Python. The initial inspection was conducted using df.info () and df.head
() functions. This helped reveal the overall structure, total number of records (9,994),
column data types, and the presence of mixed data types (e.g., numerical, categorical, and
datetime). The dataset consists of 21 columns, each representing various aspects of sales,
customer information, and order logistics. Understanding this structure laid the
foundation for more in-depth exploration.

3.2 Missing Values

A check for missing values was conducted using df.isnull(). sum (). It was observed that
the dataset contains minimal missing entries, primarily in the 'Postal Code' and 'Customer
Name' fields. Although the proportion is low, even a small number of missing values in
customer-related data can impact customer segmentation and delivery optimization.
These values were handled using appropriate cleaning methods to ensure the dataset's
completeness.

3.3 Duplicates

Duplicate records were identified using the df.duplicated(). sum () function, which found
17 redundant entries. Duplicate rows can cause significant distortions in sales metrics and
customer behaviour analysis. These rows were removed to prevent overestimation of
revenue and to maintain the integrity of the dataset during further analysis.

3.4 Descriptive Statistics

Descriptive statistical analysis was conducted using df.describe(). It revealed important

insights such as:

• The Sales column had a wide range, with several high-value entries causing the
data to be positively skewed.
• The Profit column also exhibited skewness, with some entries having negative
profit, indicating loss-making transactions.
• The Discount field showed that discounts go up to 80%, which could be a
contributing factor to the negative profits.
This analysis provided the groundwork for detecting irregularities and interpreting
business performance.

3.5 Outlier Detection

Outliers were detected using box plots and statistical techniques like the Interquartile
Range (IQR) method. Significant outliers were found in both the Sales and Profit
columns. These extreme values, while sometimes valid, can heavily influence averages
and skew analytical models. Identifying them helps in deciding whether they represent
real business cases or data entry errors that need to be corrected or excluded.

3.6 Correlation Matrix

Correlation analysis was performed using a correlation matrix and heatmap visualizations
(sns.heatmap). The following relationships were observed:

• A negative correlation between Discount and Profit, indicating that higher

discounts may result in lower or negative profit margins.
• A weak correlation between Sales and Profit, suggesting that higher sales volume
does not necessarily translate into higher profitability. These findings are crucial
for guiding pricing and discounting strategies.

3.7 Category-wise Insights

A segmented analysis was performed based on key categorical variables such as

Segment, Region, and Category. Key insights include:

• The Consumer segment generates the highest volume of sales, indicating strong
B2C performance.
• The Technology category, although not the most frequently sold, delivers higher
margins, making it a strategic area for focused growth.
• The Western region contributes the most to overall revenue, while some regions
like the South show comparatively lower profit margins, possibly due to higher
shipping costs or discount rates. These insights provide a clear direction for
region-specific and category-specific business strategies.

4. Data Cleaning Techniques – Pseudocode

import pandas as pd

# Load the dataset (update the path to your actual CSV file)

df = pd.read_csv("Superstore.csv")

# Remove duplicates

df = df.drop_duplicates()

# Handle missing values

df = df.dropna(subset=['Customer Name'])

# Convert 'Order Date' to datetime

df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%y')

# Detect and remove outliers in 'Sales' using IQR

Q1 = df['Sales']. quantile (0.25)

Q3 = df['Sales']. quantile (0.75)

IQR = Q3 - Q1

df = df[(df['Sales'] >= (Q1 - 1.5 * IQR)) & (df['Sales'] <= (Q3 + 1.5 * IQR))]

# Optional: Save cleaned dataset

df.to_csv("Superstore_cleaned.csv", index=False)

print ("Data cleaning complete. Cleaned file saved as 'Superstore_cleaned.csv'.")

Output
5. Visualizations
Code:
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import os

# Load the cleaned dataset

df = pd.read_csv("Superstore_cleaned.csv")

# Convert 'Order Date' to datetime

df["Order Date"] = pd.to_datetime(df["Order Date"])

# Create a folder to store visualizations

output_dir = "eda_visuals"

os.makedirs(output_dir, exist_ok=True)

# Set Seaborn style

sns.set(style="whitegrid")

# 1a: Sales Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

sns.histplot(df["Sales"], kde=True, color="skyblue")

plt.title("Sales Distribution")

plt.subplot(1, 2, 2)

sns.boxplot(x=df["Sales"], color="lightgreen")
plt.title("Sales Boxplot")

plt.tight_layout()

plt.savefig(f"{output_dir}/sales_distribution.png")

plt.close()

# Profit Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

sns.histplot(df["Profit"], kde=True, color="orange")

plt.title("Profit Distribution")

plt.subplot(1, 2, 2)

sns.boxplot(x=df["Profit"], color="salmon")

plt.title("Profit Boxplot")

plt.tight_layout()

plt.savefig(f"{output_dir}/profit_distribution.png")

plt.close()

# 1b: Profit by Category

plt.figure(figsize=(8, 5))

avg_profit = df.groupby("Category")["Profit"].mean().sort_values()

sns.barplot(x=avg_profit.values, y=avg_profit.index, palette="Set2")

plt.title("Average Profit by Category")

plt.xlabel("Average Profit")

plt.savefig(f"{output_dir}/profit_by_category.png")

plt.close()

# 1c: Sales by Segment

plt.figure(figsize=(6, 6))
segment_sales = df.groupby("Segment")["Sales"].sum()

segment_sales.plot(kind="pie", autopct="%1.1f%%", startangle=140, colors=["#66b3ff",

"#99ff99", "#ffcc99"])

plt.title("Sales Contribution by Segment")

plt.ylabel("")

plt.savefig(f"{output_dir}/sales_by_segment.png")

plt.close()

# 1d: Discount vs Profit

plt.figure(figsize=(7, 5))

sns.scatterplot(x="Discount", y="Profit", data=df, hue="Category", palette="Set1")

plt.title("Discount vs Profit")

plt.savefig(f"{output_dir}/discount_vs_profit.png")

plt.close()

# 1e: Correlation Heatmap

plt.figure(figsize=(6, 5))

corr = df[["Sales", "Quantity", "Discount", "Profit"]].corr()

sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Heatmap")

plt.savefig(f"{output_dir}/correlation_heatmap.png")

plt.close()
5.1 Correlation Heatmap

5.2 Discount vs Profit

5.3 Average Profit by Category

5.4 Profit Distribution and Profit Boxplot

5.5 Sales Contribution by Segment

5.6 Sales Distribution and Sales Boxplot

6. Findings Summary
• High Discount Levels Reduce Profit Margins:
The analysis clearly shows a strong negative correlation between discount levels and
profit margins. Transactions involving heavy discounts often resulted in losses, indicating
that over-discounting may not be a sustainable strategy. This insight can guide pricing
and promotional decisions to maintain profitability.

• Duplicate Records Inflate Transaction Counts:

Initial inspection revealed the presence of duplicate entries, which, if not removed, could
lead to misleading metrics such as inflated sales volume or customer frequency. Cleaning
these records ensures the accuracy of key performance indicators used in business
reporting.

• South Region Underperforms in Profitability:

Regional analysis indicates that the South region contributes less to overall profit
compared to other regions. This may be due to a combination of lower sales, higher
returns, or logistic inefficiencies. Targeted marketing or operational strategies could help
improve performance in this region.

• Outliers Can Distort Analytical Insights:

Significant outliers were observed in both the Sales and Profit columns. These extreme
values can skew averages and affect the reliability of statistical models. Identifying and
treating these outliers is crucial for making data-driven decisions and for building robust
predictive models.

7. Linking EDA to Data Quality and Business

Impact
• Duplicate Records:
o Can inflate overall sales figures and transaction counts.

o Lead to misleading performance reports and analytics.

o Affect inventory management and marketing campaign decisions due to

overestimated demand.
• Missing Values:
o Missing customer names or postal codes reduce the ability to personalize
services or offers.

o Hinder accurate customer segmentation and targeting.

o Cause difficulties in region-wise analysis and reporting.

• Outliers in Sales/Profit:
o Extreme values distort statistical summaries like mean or standard
deviation.

o Can hide real trends or trigger misleading KPIs.

o Affect forecasting models, budgeting, and performance analysis.

• Wrong Data Types:

o Data stored in incorrect formats (e.g., dates as text) can break automation
and analytics pipelines.

o Lead to incorrect grouping, filtering, or computations in dashboards and

ETL processes.

o Impact consistency and reliability of business intelligence systems.

• Negative Profits:
o Indicate potential financial losses from faulty pricing, excessive
discounting, or high fulfilment costs.

o Highlight inefficiencies that need strategic intervention.

o Affect overall business profitability and require close monitoring of

discount policies.

8. Strategic Recommendations
• Validate and Monitor Discount Thresholds:
Based on the observed negative correlation between high discount rates and profit
margins, it is essential to set and enforce appropriate discount thresholds.
Implementing dynamic discount strategies tailored to product categories and
customer segments can help maintain profitability without compromising
competitiveness.

• Establish Regular Data Cleaning Routines:

A consistent data cleaning schedule should be adopted to detect and remove
duplicate records, handle missing values, and ensure data integrity. Automating
this process within the data pipeline will enhance the reliability of analytics and
reporting, and support more accurate business decisions.

• Deploy Automated Anomaly Detection Systems:

Introduce machine learning or statistical methods for real-time detection of
outliers and anomalies in key metrics such as sales, profit, or customer behaviour.
Early detection allows for proactive resolution of issues, minimizing the impact of
data errors on operations and decision-making.

• Leverage EDA Insights for Predictive Modelling:

Use the patterns and trends identified through EDA to inform and improve
forecasting models related to sales, demand, inventory, and customer behaviour.
Clean, structured, and well-understood data increases the accuracy and reliability
of predictive analytics, driving better business outcomes.

9. Conclusion
This Exploratory Data Analysis (EDA) provided a comprehensive understanding of the
Superstore sales dataset, uncovering key insights into both data quality and business
performance. Through systematic analysis, critical issues such as duplicate records,
inconsistent data types, missing values, and unprofitable discount patterns were
identified.

The presence of data anomalies and outliers highlighted the need for continuous
monitoring and cleaning processes. The strong relationship between discount levels and
profitability further emphasized the importance of aligning promotional strategies with
financial goals.

Overall, the findings from this EDA not only support the refinement of data quality
management practices but also contribute to more informed and data-driven business
decisions. These insights lay a strong foundation for implementing forecasting models,
improving operational efficiency, and enhancing customer strategies moving forward.

Naan Mudhalvan - Google Cloud Data Analytics
No ratings yet
Naan Mudhalvan - Google Cloud Data Analytics
33 pages
File 2620
No ratings yet
File 2620
24 pages
UNIT 5 Scenario
No ratings yet
UNIT 5 Scenario
5 pages
Wrangle Report
No ratings yet
Wrangle Report
7 pages
Retail Sales Data Analysis Report
No ratings yet
Retail Sales Data Analysis Report
40 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Text
No ratings yet
Text
3 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
Data Analysis Report
No ratings yet
Data Analysis Report
27 pages
Data Analysis and Data Science Task - 2
No ratings yet
Data Analysis and Data Science Task - 2
3 pages
Piyush Kumar Singh - Project Submission - Data Analytics
No ratings yet
Piyush Kumar Singh - Project Submission - Data Analytics
23 pages
Document
No ratings yet
Document
29 pages
Intern 23
No ratings yet
Intern 23
21 pages
Rithika
No ratings yet
Rithika
16 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Final Project
No ratings yet
Final Project
15 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
BA Report
No ratings yet
BA Report
17 pages
249 PRJ
No ratings yet
249 PRJ
31 pages
A Real World Scenario Solution Using Pandas
No ratings yet
A Real World Scenario Solution Using Pandas
3 pages
Ass
No ratings yet
Ass
4 pages
Data Prep & EDA for Python Users
No ratings yet
Data Prep & EDA for Python Users
12 pages
Analysis of Superstore Database
No ratings yet
Analysis of Superstore Database
23 pages
Sales Analysis
No ratings yet
Sales Analysis
4 pages
Data Preparation Guide
No ratings yet
Data Preparation Guide
6 pages
Walmart Sales Data Mining Guide
No ratings yet
Walmart Sales Data Mining Guide
8 pages
Daa SBMP
No ratings yet
Daa SBMP
11 pages
Training
No ratings yet
Training
17 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Walmart Sales Data Analysis
No ratings yet
Walmart Sales Data Analysis
4 pages
Report Shawari
No ratings yet
Report Shawari
10 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
Retail Sales Analytics Project
100% (1)
Retail Sales Analytics Project
3 pages
INDEX
No ratings yet
INDEX
16 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Project Report
100% (1)
Project Report
16 pages
Sales Perfomance Analysis
No ratings yet
Sales Perfomance Analysis
24 pages
Assessing Data Quality Dimensions
No ratings yet
Assessing Data Quality Dimensions
9 pages
Case Study Module 1
No ratings yet
Case Study Module 1
4 pages
Experiment No 7 DMV
No ratings yet
Experiment No 7 DMV
5 pages
CUSTOMER ANALYSIS - Report
No ratings yet
CUSTOMER ANALYSIS - Report
10 pages
Solution
No ratings yet
Solution
4 pages
Data Preprocessing & Analysis Guide
No ratings yet
Data Preprocessing & Analysis Guide
11 pages
The Data Lifecycle Process
No ratings yet
The Data Lifecycle Process
11 pages
E-Commerece Data Analysis Documentation
No ratings yet
E-Commerece Data Analysis Documentation
15 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Possible Exam Questions For Data Analytics
No ratings yet
Possible Exam Questions For Data Analytics
3 pages
BIDA Practical Print
No ratings yet
BIDA Practical Print
56 pages
Amazon Data Analysis With SQL
No ratings yet
Amazon Data Analysis With SQL
4 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
PCA Business Report - Part 1
No ratings yet
PCA Business Report - Part 1
31 pages
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
19 pages
Economic Data Analysis (Finance Analyst)
No ratings yet
Economic Data Analysis (Finance Analyst)
38 pages
Sales Data Analysis
No ratings yet
Sales Data Analysis
37 pages
A Guide For Data Analysts
No ratings yet
A Guide For Data Analysts
66 pages
Retail Data Analysis Insights
No ratings yet
Retail Data Analysis Insights
19 pages
Case Study-1-Pattern Discovery in Supermarket Sales Transactions Using EDA
No ratings yet
Case Study-1-Pattern Discovery in Supermarket Sales Transactions Using EDA
3 pages
Unit 8 - HS Part 1
No ratings yet
Unit 8 - HS Part 1
7 pages
Science 8 Summative Test
No ratings yet
Science 8 Summative Test
3 pages
FAA Advisory Circular 120-85A
No ratings yet
FAA Advisory Circular 120-85A
97 pages
BW Developer Resume Profile
No ratings yet
BW Developer Resume Profile
9 pages
TME 7 Pandu Gelombang
No ratings yet
TME 7 Pandu Gelombang
27 pages
The Importance of Counting Costs: Principle of Food Preparation of 10
No ratings yet
The Importance of Counting Costs: Principle of Food Preparation of 10
17 pages
Kisi Kisi Soal B. Inggris Psaj 2023-2024
No ratings yet
Kisi Kisi Soal B. Inggris Psaj 2023-2024
2 pages
D091181006 - Muh. Zaid Iskandar - Exercise 5
No ratings yet
D091181006 - Muh. Zaid Iskandar - Exercise 5
5 pages
Test Doble Lactato
No ratings yet
Test Doble Lactato
8 pages
L13c Searching
No ratings yet
L13c Searching
57 pages
Alphalist 2022
No ratings yet
Alphalist 2022
15 pages
CEGP013091: 49.248.216.238 21/07/2022 08:51:56 Static-238
No ratings yet
CEGP013091: 49.248.216.238 21/07/2022 08:51:56 Static-238
9 pages
Editorial Board - 2023 - Journal of Business Research
No ratings yet
Editorial Board - 2023 - Journal of Business Research
6 pages
"Varsovie Radieuse. Return of Le Corbusier To Za Żelazną Bramą" Exhibition Brochure
100% (1)
"Varsovie Radieuse. Return of Le Corbusier To Za Żelazną Bramą" Exhibition Brochure
28 pages
Asthma Control Test
No ratings yet
Asthma Control Test
8 pages
CS II Minor Elective 3rd Year Paper
No ratings yet
CS II Minor Elective 3rd Year Paper
4 pages
Evaluating Weapon System Using Fuzzy Analytic Hierarchy Process Based On Entropy Weight
No ratings yet
Evaluating Weapon System Using Fuzzy Analytic Hierarchy Process Based On Entropy Weight
8 pages
X++ Coding Standards
100% (1)
X++ Coding Standards
53 pages
Optimization Problems and Solutions
0% (1)
Optimization Problems and Solutions
6 pages
Export-Import Training Program
No ratings yet
Export-Import Training Program
4 pages
Handbook of Digital Homecare: Successes and Failures (Communications in Medical and Care Compunetics, 3) - ISBN 3642270646, 978-3642270642
100% (19)
Handbook of Digital Homecare: Successes and Failures (Communications in Medical and Care Compunetics, 3) - ISBN 3642270646, 978-3642270642
23 pages
Employee Engagement Survey
100% (1)
Employee Engagement Survey
3 pages
Gecco V6.2
No ratings yet
Gecco V6.2
1 page
Unit 9 - Introduction To Business Management
No ratings yet
Unit 9 - Introduction To Business Management
34 pages
Derakane Resin Chemical Resistant
No ratings yet
Derakane Resin Chemical Resistant
50 pages
Basic Settings For Approval: Short Text
No ratings yet
Basic Settings For Approval: Short Text
27 pages
Contact Details Updation Form
No ratings yet
Contact Details Updation Form
1 page
Private Excel Modelling Test
No ratings yet
Private Excel Modelling Test
40 pages
Human Cell Structure & Function Guide
No ratings yet
Human Cell Structure & Function Guide
11 pages
English Grammar Guide for Students
No ratings yet
English Grammar Guide for Students
5 pages