Step 1: Identify at least 10 major KPIs that would be useful for the business
Based on the dataset, I have identified the following 10 major KPIs that would be useful for the
business:
Sales Revenue: Total sales revenue generated by the supermarket chain
Customer Count: Number of unique customers who have made purchases
Average Order Value (AOV): Average amount spent by customers in a single transaction
Customer Retention Rate: Percentage of customers who have made repeat purchases
Product Category Sales: Sales revenue generated by each product category (e.g. dairy,
bakery, etc.)
Top-Selling Products: Products that have generated the highest sales revenue
Region-wise Sales: Sales revenue generated by each region (e.g. Chennai, Coimbatore, etc.)
State-wise Sales: Sales revenue generated by each state (e.g. Tamil Nadu, Karnataka, etc.)
Gross Margin: Difference between revenue and cost of goods sold
Inventory Turnover: Number of times inventory is sold and replaced within a given period
Step 2: Load the dataset and perform Data Preprocessing, Outlier Detection, and Exploratory Data
Analysis
To perform data preprocessing, outlier detection, and exploratory data analysis, I will use Python
with the Pandas and NumPy libraries.
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('Supermart Grocery Sales - Retail Analytics Dataset.csv')
# Data Preprocessing
# Check for missing values
print(df.isnull().sum())
# Handle missing values (e.g. impute with mean or median)
df.fillna(df.mean(), inplace=True)
# Outlier Detection
# Use the Z-score method to detect outliers
from scipy import stats
z_scores = np.abs(stats.zscore(df))
print(z_scores)
# Exploratory Data Analysis
# Summary statistics
print(df.describe())
# Visualize the data using plots and charts
import matplotlib.pyplot as plt
df.plot(kind='bar')
plt.show()
Output:
Summary statistics of the dataset
Bar chart showing the distribution of sales revenue by product category
Step 3: Use Association Rule Mining technique to identify the items frequently bought together
and their demands
To perform association rule mining, I will use the Apriori algorithm implemented in the Python
library mlxtend.
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Convert the dataset to a transactional format
transactions = []
for index, row in df.iterrows():
transactions.append(row['Item Name'])
# Perform association rule mining
frequent_itemsets = apriori(transactions, min_support=0.01, use_colnames=True)
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)
# Print the top 10 rules
print(rules.head(10))
Output:
Top 10 association rules showing the items frequently bought together and their demands
Step 4: Use Classification techniques to develop a model and predict the item categories and sub-
categories that would provide the highest sales and profit region-wise/state-wise
To perform classification, I will use the Scikit-learn library in Python.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Prepare the dataset for classification
X = df.drop(['Item Category', 'Item Sub-Category'], axis=1)
y = df['Item Category']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = rfc.predict(X_test)
# Evaluate the model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
Output:
Accuracy and classification report of the random forest classifier
Step 5: Modify the dataset to incorporate the Non-Volatile feature of data warehouse
To modify the dataset to incorporate the Non-Volatile feature of data warehouse, I will create a new
column Version to track changes to the data.
# Create a new column 'Version' to track changes
df['Version'] = 1
# Save the modified dataset to a new CSV file
df.to_csv('Supermart Grocery Sales - Retail Analytics Dataset_Modified