Experiment No.
- 05
Aim: To develop Python code for basic and advanced data analysis on a given dataset.
Software: Python 3.13 as Interpreter and PyCharm as Integrated Development Environment.
Theory:
Data analysis is the process of transforming raw data into meaningful insights. It helps in understanding
patterns, drawing conclusions and supporting decision-making. Data analysis is perfomed on
employee records using core Python constructs such as lists and dictionaries without relying on
external libraries.
The dataset includes employee attributes like name, age, salary and department. Representing such
data in Python can be done using a list of dictionaries, where each dictionary corresponds to one
record. With this structure, records can be iterated and applied on various data analysis techniques.
Mean, also called the average, is the sum of all values divided by the total number of values. For
instance, the average salary of employees can reveal overall compensation trends and helps HR plan
future salary structures. Median is the middle value in an ordered dataset. It is less affected by
extreme values and is used in understanding the central tendency when the data contains outliers.
Maximum and minimum values help identify the range of a dataset—for.
Outliers are data points that significantly deviate from other observations. Identifying outliers is
crucial in fraud detection, performance analysis or sensor error checking. Outliers can be detected by
calculating the interquartile range (IQR), which is the difference between the third quartile (Q3) and
the first quartile (Q1). The first quartile (Q1) is the median of the lower half of the dataset and the
third quartile (Q3) is the median of the upper half.
Data can also be grouped by categories such as department to compute department-wise averages,
which are useful in assessing salary parity and age demographics within functional teams. Sorting
data based on salary or age helps prioritize records.
Applications of these basic data analysis techniques include HR analytics (e.g., determining average
salary per department), finance (e.g., identifying highest-paid roles), operations (e.g., identifying
departments with younger employees) and general business intelligence.
1
Python Programming Lab Data Analysis
Program:
# ------------------------------
# Basic and Advanced Data Analysis in Python
# ------------------------------
# Data stored in parallel lists for Name, Age, Salary and Department
names = [’Alice’, ’Bob’, ’Charlie’, ’David’, ’Eva’, ’Frank’, ’Grace’,
’Hannah’, ’Ian’, ’Julia’]
ages = [24, 27, 22, 32, 29, 24, 30, 28, 26, 31]
salaries = [50000, 60000, 48000, 75000, 62000, 52000, 70000, 68000,
59000, 72000]
departments = [’HR’, ’IT’, ’HR’, ’Finance’, ’IT’, ’HR’, ’Finance’, ’IT’,
’IT’, ’Finance’]
# ------------------------------
# BASIC DATA ANALYSIS
# ------------------------------
print("----- BASIC DATA ANALYSIS -----\n")
# Print table header
print("Name\tAge\tSalary\tDepartment")
# Display the first 5 employee records in tabular format
for i in range(5):
print(f"{names[i]}\t{ages[i]}\t{salaries[i]}\t{departments[i]}")
# Calculate the average (mean) age and salary
mean_age = sum(ages) / len(ages)
mean_salary = sum(salaries) / len(salaries)
print(f"\nMean Age: {mean_age:.2f}")
print(f"Mean Salary: {mean_salary:.2f}")
Dr. D. K. Singh 2 National Fire Service College, Nagpur
Data Analysis Python Programming Lab
# Find and display minimum and maximum values for age and salary
print(f"Minimum Age: {min(ages)}")
print(f"Maximum Age: {max(ages)}")
print(f"Minimum Salary: {min(salaries)}")
print(f"Maximum Salary: {max(salaries)}")
# Count number of employees in each department using a dictionary
dept_count = {}
for dept in departments:
if dept in dept_count:
dept_count[dept] += 1
else:
dept_count[dept] = 1
# Print department-wise employee counts
print("\nDepartment-wise Employee Count:")
for dept, count in dept_count.items():
print(f"{dept}: {count}")
# ------------------------------
# ADVANCED DATA ANALYSIS
# ------------------------------
print("\n----- ADVANCED DATA ANALYSIS -----\n")
# Function to calculate correlation coefficient between two lists
def correlation(x, y):
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
# Numerator: sum of product of deviations
num = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
National Fire Service College, Nagpur 3 Dr. D. K. Singh
Python Programming Lab Data Analysis
# Denominator: product of standard deviations
den_x = sum((x[i] - mean_x) ** 2 for i in range(n)) ** 0.5
den_y = sum((y[i] - mean_y) ** 2 for i in range(n)) ** 0.5
return num / (den_x * den_y)
# Compute correlation between age and salary
corr = correlation(ages, salaries)
print(f"Correlation between Age and Salary: {corr:.4f}")
# Calculate average salary for each department
dept_sums = {} # To store total salary per department
dept_counts = {} # To store employee count per department
for i in range(len(departments)):
dept = departments[i]
salary = salaries[i]
if dept in dept_sums:
dept_sums[dept] += salary
dept_counts[dept] += 1
else:
dept_sums[dept] = salary
dept_counts[dept] = 1
# Display department-wise average salary
print("\nAverage Salary by Department:")
for dept in dept_sums:
avg_salary = dept_sums[dept] / dept_counts[dept]
print(f"{dept}: {avg_salary:.2f}")
# ------------------------------
# Outlier Detection Using IQR
Dr. D. K. Singh 4 National Fire Service College, Nagpur
Data Analysis Python Programming Lab
# ------------------------------
# Sort salary data to compute quartiles
sorted_salaries = sorted(salaries)
n = len(sorted_salaries)
# Function to compute median of a list
def median(data):
mid = len(data) // 2
if len(data) % 2 == 0:
return (data[mid - 1] + data[mid]) / 2
else:
return data[mid]
# Calculate Q1 (lower quartile) and Q3 (upper quartile)
Q1 = median(sorted_salaries[:n // 2])
Q3 = median(sorted_salaries[(n + 1) // 2:])
IQR = Q3 - Q1 # Interquartile Range
# Calculate lower and upper bounds for detecting outliers
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
print(f"\nIQR = {IQR}, Lower Bound = {lower}, Upper Bound = {upper}")
print("Outliers in Salary:")
# Identify and display salaries outside the IQR range
for i in range(n):
if salaries[i] < lower or salaries[i] > upper:
print(f"{names[i]}: {salaries[i]}")
National Fire Service College, Nagpur 5 Dr. D. K. Singh
Python Programming Lab Data Analysis
Program Output: Data Analysis
----- BASIC DATA ANALYSIS -----
Name Age Salary Department
Alice 24 50000 HR
Bob 27 60000 IT
Charlie 22 48000 HR
David 32 75000 Finance
Eva 29 62000 IT
Mean Age: 27.30
Mean Salary: 61600.00
Minimum Age: 22
Maximum Age: 32
Minimum Salary: 48000
Maximum Salary: 75000
Department-wise Employee Count:
HR: 3
IT: 4
Finance: 3
----- ADVANCED DATA ANALYSIS -----
Correlation between Age and Salary: 0.9701
Average Salary by Department:
HR: 50000.00
IT: 62250.00
Finance: 72333.33
IQR = 18000, Lower Bound = 25000.0, Upper Bound = 97000.0
Outliers in Salary:
Dr. D. K. Singh 6 National Fire Service College, Nagpur