[go: up one dir, main page]

0% found this document useful (0 votes)
53 views15 pages

EDA Report Week2

The Exploratory Data Analysis (EDA) of the Superstore Sales dataset aims to uncover patterns and anomalies in business transactions, logistics, and customer behavior. Key findings include the negative correlation between discounts and profit margins, the impact of duplicate records on sales metrics, and the underperformance of the South region in profitability. The report emphasizes the importance of data cleaning, monitoring, and strategic recommendations to enhance data quality and inform business decisions.

Uploaded by

Shreya Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views15 pages

EDA Report Week2

The Exploratory Data Analysis (EDA) of the Superstore Sales dataset aims to uncover patterns and anomalies in business transactions, logistics, and customer behavior. Key findings include the negative correlation between discounts and profit margins, the impact of duplicate records on sales metrics, and the underperformance of the South region in profitability. The report emphasizes the importance of data cleaning, monitoring, and strategic recommendations to enhance data quality and inform business decisions.

Uploaded by

Shreya Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Exploratory Data Analysis (EDA) Report

1.Introduction
The goal of this Exploratory Data Analysis (EDA) is to conduct an in-depth examination
of a publicly available dataset focused on business transactions, logistics, and customer
behavior. For this purpose, the Superstore Sales dataset has been selected due to its
diverse and multidimensional features that span across sales figures, profit margins,
shipping timelines, customer demographics, and product categories.

Through this EDA, the objective is to identify key patterns, uncover hidden trends, and
detect potential anomalies that may affect the reliability of business insights derived from
the data. It also aims to evaluate the integrity and consistency of the data, which are
essential prerequisites for making data-driven business decisions.

This process sets the groundwork for a comprehensive data quality assessment. By
identifying inconsistencies such as outliers, duplicate records, missing values, or skewed
distributions, businesses can better understand the risks and opportunities inherent in their
operations. The findings from this EDA can directly influence strategic initiatives such as
sales forecasting, customer segmentation, inventory optimization, and profitability
enhancement.

2. Methodology
The methodology followed for the EDA process includes:

1.Data Loading and Inspection


This step involves importing the dataset into the Python environment using libraries such
as Pandas. Initial inspection is conducted by checking the structure, shape, column
names, and data types to understand the dataset's dimensions and attributes. Functions
like head (), info (), and describe () help identify any immediate anomalies, missing
values, or incorrect data types.

2.Data Cleaning
This stage focuses on correcting inaccuracies in the dataset. Common cleaning operations
include removing duplicate entries, handling missing values either by deletion or
imputation, converting date columns into proper datetime formats, and standardizing
inconsistent data. Cleaning ensures that the dataset is reliable and free of obvious data
quality issues before analysis begins.
3.Descriptive Statistics
This involves summarizing the dataset using measures such as mean, median, standard
deviation, minimum, maximum, and interquartile range. These metrics provide insight
into the central tendencies and variability of numerical features, forming the basis for
further analysis.

4.Distribution Analysis
Distribution plots like histograms, box plots, and KDE plots are used to visualize the
frequency and spread of the data. This step helps identify skewed distributions, potential
data entry errors, and unusual trends that may require deeper investigation.

5.Outlier Detection
Outliers can significantly skew the results of an analysis. Techniques like the
Interquartile Range (IQR) method and Z-score analysis are applied to detect and flag
these values. Outliers are visualized through box plots or scatter plots and may be
removed or separately treated.

6.Correlation Analysis
To understand relationships between numeric variables, a correlation matrix is computed.
Heatmaps help visualize strong or weak correlations, which inform the selection of
relevant variables for modelling and highlight potential multicollinearity issues.

7.Data Quality Insights


During the EDA, quality concerns such as missing entries, inconsistent formats, incorrect
data types, and implausible values are documented. Identifying these issues early helps
prevent errors in downstream analytics and supports robust decision-making.

8.Business Impact Mapping


The final step links technical findings to real-world implications. For instance, missing
customer data might affect personalization efforts, while duplicated records could distort
revenue reporting. This mapping highlights the significance of EDA in a business
context.

3. Analysis and Results


3.1 Data Overview

The dataset was imported using the pandas library, which is widely used for data
manipulation in Python. The initial inspection was conducted using df.info () and df.head
() functions. This helped reveal the overall structure, total number of records (9,994),
column data types, and the presence of mixed data types (e.g., numerical, categorical, and
datetime). The dataset consists of 21 columns, each representing various aspects of sales,
customer information, and order logistics. Understanding this structure laid the
foundation for more in-depth exploration.

3.2 Missing Values

A check for missing values was conducted using df.isnull(). sum (). It was observed that
the dataset contains minimal missing entries, primarily in the 'Postal Code' and 'Customer
Name' fields. Although the proportion is low, even a small number of missing values in
customer-related data can impact customer segmentation and delivery optimization.
These values were handled using appropriate cleaning methods to ensure the dataset's
completeness.

3.3 Duplicates

Duplicate records were identified using the df.duplicated(). sum () function, which found
17 redundant entries. Duplicate rows can cause significant distortions in sales metrics and
customer behaviour analysis. These rows were removed to prevent overestimation of
revenue and to maintain the integrity of the dataset during further analysis.

3.4 Descriptive Statistics

Descriptive statistical analysis was conducted using df.describe(). It revealed important


insights such as:

• The Sales column had a wide range, with several high-value entries causing the
data to be positively skewed.
• The Profit column also exhibited skewness, with some entries having negative
profit, indicating loss-making transactions.
• The Discount field showed that discounts go up to 80%, which could be a
contributing factor to the negative profits.
This analysis provided the groundwork for detecting irregularities and interpreting
business performance.

3.5 Outlier Detection

Outliers were detected using box plots and statistical techniques like the Interquartile
Range (IQR) method. Significant outliers were found in both the Sales and Profit
columns. These extreme values, while sometimes valid, can heavily influence averages
and skew analytical models. Identifying them helps in deciding whether they represent
real business cases or data entry errors that need to be corrected or excluded.

3.6 Correlation Matrix


Correlation analysis was performed using a correlation matrix and heatmap visualizations
(sns.heatmap). The following relationships were observed:

• A negative correlation between Discount and Profit, indicating that higher


discounts may result in lower or negative profit margins.
• A weak correlation between Sales and Profit, suggesting that higher sales volume
does not necessarily translate into higher profitability. These findings are crucial
for guiding pricing and discounting strategies.

3.7 Category-wise Insights

A segmented analysis was performed based on key categorical variables such as


Segment, Region, and Category. Key insights include:

• The Consumer segment generates the highest volume of sales, indicating strong
B2C performance.
• The Technology category, although not the most frequently sold, delivers higher
margins, making it a strategic area for focused growth.
• The Western region contributes the most to overall revenue, while some regions
like the South show comparatively lower profit margins, possibly due to higher
shipping costs or discount rates. These insights provide a clear direction for
region-specific and category-specific business strategies.

4. Data Cleaning Techniques – Pseudocode


import pandas as pd

# Load the dataset (update the path to your actual CSV file)

df = pd.read_csv("Superstore.csv")

# Remove duplicates

df = df.drop_duplicates()

# Handle missing values

df = df.dropna(subset=['Customer Name'])

# Convert 'Order Date' to datetime

df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%y')

# Detect and remove outliers in 'Sales' using IQR

Q1 = df['Sales']. quantile (0.25)

Q3 = df['Sales']. quantile (0.75)


IQR = Q3 - Q1

df = df[(df['Sales'] >= (Q1 - 1.5 * IQR)) & (df['Sales'] <= (Q3 + 1.5 * IQR))]

# Optional: Save cleaned dataset

df.to_csv("Superstore_cleaned.csv", index=False)

print ("Data cleaning complete. Cleaned file saved as 'Superstore_cleaned.csv'.")

Output
5. Visualizations
Code:
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import os

# Load the cleaned dataset

df = pd.read_csv("Superstore_cleaned.csv")

# Convert 'Order Date' to datetime

df["Order Date"] = pd.to_datetime(df["Order Date"])

# Create a folder to store visualizations

output_dir = "eda_visuals"

os.makedirs(output_dir, exist_ok=True)

# Set Seaborn style

sns.set(style="whitegrid")

# 1a: Sales Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

sns.histplot(df["Sales"], kde=True, color="skyblue")

plt.title("Sales Distribution")

plt.subplot(1, 2, 2)

sns.boxplot(x=df["Sales"], color="lightgreen")
plt.title("Sales Boxplot")

plt.tight_layout()

plt.savefig(f"{output_dir}/sales_distribution.png")

plt.close()

# Profit Distribution

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)

sns.histplot(df["Profit"], kde=True, color="orange")

plt.title("Profit Distribution")

plt.subplot(1, 2, 2)

sns.boxplot(x=df["Profit"], color="salmon")

plt.title("Profit Boxplot")

plt.tight_layout()

plt.savefig(f"{output_dir}/profit_distribution.png")

plt.close()

# 1b: Profit by Category

plt.figure(figsize=(8, 5))

avg_profit = df.groupby("Category")["Profit"].mean().sort_values()

sns.barplot(x=avg_profit.values, y=avg_profit.index, palette="Set2")

plt.title("Average Profit by Category")

plt.xlabel("Average Profit")

plt.savefig(f"{output_dir}/profit_by_category.png")

plt.close()

# 1c: Sales by Segment

plt.figure(figsize=(6, 6))
segment_sales = df.groupby("Segment")["Sales"].sum()

segment_sales.plot(kind="pie", autopct="%1.1f%%", startangle=140, colors=["#66b3ff",


"#99ff99", "#ffcc99"])

plt.title("Sales Contribution by Segment")

plt.ylabel("")

plt.savefig(f"{output_dir}/sales_by_segment.png")

plt.close()

# 1d: Discount vs Profit

plt.figure(figsize=(7, 5))

sns.scatterplot(x="Discount", y="Profit", data=df, hue="Category", palette="Set1")

plt.title("Discount vs Profit")

plt.savefig(f"{output_dir}/discount_vs_profit.png")

plt.close()

# 1e: Correlation Heatmap

plt.figure(figsize=(6, 5))

corr = df[["Sales", "Quantity", "Discount", "Profit"]].corr()

sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Heatmap")

plt.savefig(f"{output_dir}/correlation_heatmap.png")

plt.close()
5.1 Correlation Heatmap

5.2 Discount vs Profit


5.3 Average Profit by Category

5.4 Profit Distribution and Profit Boxplot


5.5 Sales Contribution by Segment

5.6 Sales Distribution and Sales Boxplot


6. Findings Summary
• High Discount Levels Reduce Profit Margins:
The analysis clearly shows a strong negative correlation between discount levels and
profit margins. Transactions involving heavy discounts often resulted in losses, indicating
that over-discounting may not be a sustainable strategy. This insight can guide pricing
and promotional decisions to maintain profitability.

• Duplicate Records Inflate Transaction Counts:


Initial inspection revealed the presence of duplicate entries, which, if not removed, could
lead to misleading metrics such as inflated sales volume or customer frequency. Cleaning
these records ensures the accuracy of key performance indicators used in business
reporting.

• South Region Underperforms in Profitability:


Regional analysis indicates that the South region contributes less to overall profit
compared to other regions. This may be due to a combination of lower sales, higher
returns, or logistic inefficiencies. Targeted marketing or operational strategies could help
improve performance in this region.

• Outliers Can Distort Analytical Insights:


Significant outliers were observed in both the Sales and Profit columns. These extreme
values can skew averages and affect the reliability of statistical models. Identifying and
treating these outliers is crucial for making data-driven decisions and for building robust
predictive models.

7. Linking EDA to Data Quality and Business


Impact
• Duplicate Records:
o Can inflate overall sales figures and transaction counts.

o Lead to misleading performance reports and analytics.

o Affect inventory management and marketing campaign decisions due to


overestimated demand.
• Missing Values:
o Missing customer names or postal codes reduce the ability to personalize
services or offers.

o Hinder accurate customer segmentation and targeting.

o Cause difficulties in region-wise analysis and reporting.

• Outliers in Sales/Profit:
o Extreme values distort statistical summaries like mean or standard
deviation.

o Can hide real trends or trigger misleading KPIs.

o Affect forecasting models, budgeting, and performance analysis.

• Wrong Data Types:


o Data stored in incorrect formats (e.g., dates as text) can break automation
and analytics pipelines.

o Lead to incorrect grouping, filtering, or computations in dashboards and


ETL processes.

o Impact consistency and reliability of business intelligence systems.

• Negative Profits:
o Indicate potential financial losses from faulty pricing, excessive
discounting, or high fulfilment costs.

o Highlight inefficiencies that need strategic intervention.

o Affect overall business profitability and require close monitoring of


discount policies.

8. Strategic Recommendations
• Validate and Monitor Discount Thresholds:
Based on the observed negative correlation between high discount rates and profit
margins, it is essential to set and enforce appropriate discount thresholds.
Implementing dynamic discount strategies tailored to product categories and
customer segments can help maintain profitability without compromising
competitiveness.

• Establish Regular Data Cleaning Routines:


A consistent data cleaning schedule should be adopted to detect and remove
duplicate records, handle missing values, and ensure data integrity. Automating
this process within the data pipeline will enhance the reliability of analytics and
reporting, and support more accurate business decisions.

• Deploy Automated Anomaly Detection Systems:


Introduce machine learning or statistical methods for real-time detection of
outliers and anomalies in key metrics such as sales, profit, or customer behaviour.
Early detection allows for proactive resolution of issues, minimizing the impact of
data errors on operations and decision-making.

• Leverage EDA Insights for Predictive Modelling:


Use the patterns and trends identified through EDA to inform and improve
forecasting models related to sales, demand, inventory, and customer behaviour.
Clean, structured, and well-understood data increases the accuracy and reliability
of predictive analytics, driving better business outcomes.

9. Conclusion
This Exploratory Data Analysis (EDA) provided a comprehensive understanding of the
Superstore sales dataset, uncovering key insights into both data quality and business
performance. Through systematic analysis, critical issues such as duplicate records,
inconsistent data types, missing values, and unprofitable discount patterns were
identified.

The presence of data anomalies and outliers highlighted the need for continuous
monitoring and cleaning processes. The strong relationship between discount levels and
profitability further emphasized the importance of aligning promotional strategies with
financial goals.

Overall, the findings from this EDA not only support the refinement of data quality
management practices but also contribute to more informed and data-driven business
decisions. These insights lay a strong foundation for implementing forecasting models,
improving operational efficiency, and enhancing customer strategies moving forward.

You might also like