[Approved by AICTE, Govt. of India & Affiliated to Dr.
APJ Abdul Kalam
Technical University, Lucknow, U.P., India]
Department of Computer Science & Engineering (AI)
Lab File
Data Analytics Lab
(BADS651)
ACADEMIC SESSION 2024-25
COURSE: B. TECH (CSE-AI)
SEM: VI
Submitted to: Submitted by:
Mr. Piyush Kushwaha Yugank Singh
Assistant Professor 2201921520200
CSE(AI) Department
INDEX
Date of Date of
S. No. List of Programs Signature
Experiment Submission
To get the input from user and perform
numerical operations (MAX, MIN,
1.
AVG, SUM, SQRT, ROUND) using in
Python.
To perform data import/export (.CSV,
2. .XLS, .TXT) operations using data
frames in Python.
To get the input matrix from user and
perform Matrix addition, subtraction,
3. multiplication, inverse transpose and
division operations using vector concept
in Python.
To perform statistical operations (Mean,
4. Median, Mode and Standard deviation)
using Python.
To perform data pre-processing
5. operations i) Handling Missing data ii)
Min-Max normalization.
To perform dimensionality reduction
6. operation using PCA for Houses Data
Set.
To perform Simple Linear Regression
7. with Python.
To perform K-Means clustering
8.
operation and visualize for iris data set.
Write Python script to diagnose any
9. disease using KNN classification and
plot the results.
To perform market basket analysis using
10.
Association Rules (Apriori).
Program – 1
Aim: To get the input from user and perform numerical operations (MAX, MIN,
AVG, SUM, SQRT, ROUND) using in Python.
Program:
import math
# Function to perform all the operations
def perform_operations():
# Get a list of numbers from the user (space-separated)
user_input = input("Enter numbers separated by space: ")
# Convert the input string into a list of numbers
numbers = list(map(float, user_input.split()))
# Perform the operations
max_value = max(numbers)
min_value = min(numbers)
sum_value = sum(numbers)
avg_value = sum_value / len(numbers) if len(numbers) > 0 else 0
sqrt_values = [math.sqrt(num) for num in numbers]
rounded_values = [round(num, 2) for num in numbers]
# Display the results
print(f"Max Value: {max_value}")
print(f"Min Value: {min_value}")
print(f"Sum: {sum_value}")
print(f"Average: {avg_value}")
print(f"Square Root of each number: {sqrt_values}")
print(f"Rounded values (to 2 decimal places): {rounded_values}")
1
# Call the function
perform_operations()
Output:
2
Program – 2
Aim: To perform data import/export (.CSV, .XLS, .TXT) operations using data
frames in Python.
Program:
import pandas as pd
# Correct file paths using raw string (r"") or double backslashes (\\)
csv_path = r"D:\GL BAJAJ\DAata Analytics\customers-100.csv"
excel_path = r"D:\GL BAJAJ\DAata Analytics\Project-Management-Sample-Data.xlsx"
txt_path = r"D:\GL BAJAJ\DAata Analytics\sample-1.txt"
# Load CSV File
try:
csv_data = pd.read_csv(csv_path)
print("\nCSV Data:\n", csv_data.head()) # Show first 5 rows
except Exception as e:
print("Error loading CSV file:", e)
# Load Excel File
try:
excel_data = pd.read_excel(excel_path)
print("\nExcel Data:\n", excel_data.head()) # Show first 5 rows
except Exception as e:
print("Error loading Excel file:", e)
# Load TXT File (Tab-Separated)
try:
txt_data = pd.read_csv(txt_path, sep="\t", engine="python", on_bad_lines="skip") #
Auto-detect separator
3
print("\nTXT Data:\n", txt_data.head()) # Show first 5 rows
except Exception as e:
print("Error loading TXT file:", e)
4
Output:
5
Program – 3
Aim: To get the input matrix from user and perform Matrix addition,
subtraction, multiplication, inverse transpose and division operations using
vector concept in Python.
Program:
import numpy as np
# Function to get a matrix input from the user
def get_matrix_input():
rows = int(input("Enter number of rows for the matrix: "))
cols = int(input("Enter number of columns for the matrix: "))
print(f"Enter the elements of the {rows}x{cols} matrix (row by row):")
matrix = []
for i in range(rows):
row = list(map(float, input(f"Enter elements for row {i+1} separated by space: ").split()))
matrix.append(row)
return np.array(matrix)
# Function to perform matrix operations
def perform_operations(matrix1, matrix2):
try:
# Matrix Addition
matrix_addition = matrix1 + matrix2
print("Matrix Addition:\n", matrix_addition)
6
# Matrix Subtraction
matrix_subtraction = matrix1 - matrix2
print("Matrix Subtraction:\n", matrix_subtraction)
# Matrix Multiplication
matrix_multiplication = np.dot(matrix1, matrix2)
print("Matrix Multiplication:\n", matrix_multiplication)
# Matrix Inverse (if square matrix)
if matrix1.shape[0] == matrix1.shape[1]:
matrix_inverse = np.linalg.inv(matrix1)
print("Matrix Inverse:\n", matrix_inverse)
else:
print("Matrix 1 is not square, so inverse cannot be computed.")
# Matrix Transpose
matrix_transpose = np.transpose(matrix1)
print("Matrix Transpose:\n", matrix_transpose)
# Matrix Division (element-wise division)
matrix_division = np.divide(matrix1, matrix2)
print("Matrix Division (element-wise):\n", matrix_division)
except Exception as e:
print(f"Error during matrix operations: {e}")
# Main driver code
def main():
print("Matrix Operations")
7
# Get user input for two matrices
print("Enter the first matrix:")
matrix1 = get_matrix_input()
print("Enter the second matrix:")
matrix2 = get_matrix_input()
# Perform the operations
perform_operations(matrix1, matrix2)
# Run the program
main()
Output:
8
Program – 4
Aim: To perform statistical operations (Mean, Median, Mode and Standard
deviation) using Python.
Program:
import statistics
# Function to perform statistical operations
def perform_statistical_operations():
# Get user input for the data
data = list(map(float, input("Enter numbers separated by space: ").split()))
# Mean
mean_value = statistics.mean(data)
print(f"Mean: {mean_value}")
# Median
median_value = statistics.median(data)
print(f"Median: {median_value}")
# Mode
try:
mode_value = statistics.mode(data)
print(f"Mode: {mode_value}")
except statistics.StatisticsError:
print("Mode: No unique mode (multiple modes or no mode)")
# Standard Deviation
stdev_value = statistics.stdev(data)
9
print(f"Standard Deviation: {stdev_value}")
# Call the function
perform_statistical_operations()
Output:
10
Program – 5
Aim: To perform data pre-processing operations i) Handling Missing data ii)
Min-Max normalization.
Program:
i) Handling Missing Data in Python:
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing data
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [5, np.nan, 7, 8, 9],
'C': [10, 11, 12, np.nan, 14]
}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Data:")
print(df)
# i. Remove rows with any missing values
df_dropna = df.dropna()
print("\nDataFrame after removing rows with missing values:")
print(df_dropna)
# ii. Fill missing values with the mean of the column
df_fill_mean = df.fillna(df.mean())
print("\nDataFrame after filling missing values with column mean:")
print(df_fill_mean)
# iii. Fill missing values with a specific value (e.g., 0)
df_fill_zero = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_fill_zero)
# iv. Forward fill missing values (using the previous value)
df_fill_forward = df.fillna(method='ffill')
print("\nDataFrame after forward filling missing values:")
11
print(df_fill_forward)
# v. Backward fill missing values (using the next value)
df_fill_backward = df.fillna(method='bfill')
print("\nDataFrame after backward filling missing values:")
print(df_fill_backward)
Output:
12
ii) Min-Max Normalization in Python:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Create a sample DataFrame
data = {
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50],
'C': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Using pandas to perform Min-Max Normalization
df_min_max = (df - df.min()) / (df.max() - df.min())
print("\nDataFrame after Min-Max Normalization (using pandas):")
print(df_min_max)
# Alternatively, using scikit-learn's MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("\nDataFrame after Min-Max Normalization (using scikit-learn):")
print(df_scaled)
Output:
13
14
Program – 6
Aim: To perform dimensionality reduction operation using PCA for Houses Data
Set.
Program:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load the dataset (Ensure the correct file path)
file_path = "D:\GL BAJAJ\DAata Analytics\House price data .xlsx" # Use raw string (r"")
df = pd.read_excel(file_path, engine="openpyxl") # Ensure openpyxl is installed
# Display the first few rows
print("Original Dataset:\n", df.head())
# Step 1: Select numerical features for PCA
numeric_features = df.select_dtypes(include=[np.number]) # Select only numeric columns
numeric_features = numeric_features.dropna() # Drop rows with missing values
# Step 2: Standardize the Data (PCA works better with scaled data)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_features)
# Step 3: Apply PCA (Reduce to 2 principal components)
pca = PCA(n_components=2)
15
pca_result = pca.fit_transform(scaled_data)
# Step 4: Analyze Explained Variance
explained_variance = pca.explained_variance_ratio_ * 100
print("\nExplained Variance by Each Principal Component:", explained_variance)
# Step 5: Create a DataFrame for PCA results
pca_df = pd.DataFrame(data=pca_result, columns=["PC1", "PC2"])
print("\nPCA Transformed Data (First 5 Rows):\n", pca_df.head())
# Step 6: Plot the PCA Components
plt.figure(figsize=(8, 5))
plt.scatter(pca_result[:, 0], pca_result[:, 1], c="blue", alpha=0.5)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA on House Prices Dataset")
plt.grid()
plt.show()
# Step 7: Check cumulative explained variance for all components
pca_full = PCA().fit(scaled_data)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_) * 100
# Plot cumulative explained variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker="o",
linestyle="--", color="red")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance (%)")
plt.title("Cumulative Explained Variance vs. Number of Components")
16
plt.grid()
plt.show()
Output:
17
Program – 7
Aim: To perform Simple Linear Regression with Python.
Program:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# 1. Prepare the dataset (for this example, let's generate some data)
# Generate a simple linear dataset
np.random.seed(0)
X = 2 * np.random.rand(100, 1) # Feature: 100 random values between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1) # Target: y = 4 + 3*X + random noise
# Convert to pandas DataFrame (optional)
data = pd.DataFrame({'X': X.flatten(), 'y': y.flatten()})
# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Create the Linear Regression model
model = LinearRegression()
# 4. Train the model
model.fit(X_train, y_train)
18
# 5. Make predictions
y_pred = model.predict(X_test)
# 6. Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
# 7. Visualize the results
plt.scatter(X_test, y_test, color='blue', label='Actual data')
plt.plot(X_test, y_pred, color='red', label='Predicted line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
Output:
19
Program – 8
Aim: To perform K-Means clustering operation and visualize for iris data set
Program:
!pip install mlxtend
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Step 1: Define the dataset (list of transactions)
dataset = [
['milk', 'bread', 'nuts', 'apple'],
['milk', 'bread', 'nuts'],
['milk', 'bread'],
['milk', 'bread', 'apple'],
['milk', 'bread', 'apple']
]
# Step 2: Convert the list of transactions into one-hot encoded DataFrame
te = TransactionEncoder()
te_data = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_data, columns=te.columns_)
print(" Transaction Data (One-Hot Encoded):")
print(df)
# Step 3: Apply Apriori to find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print("\n📦 Frequent Itemsets (Support >= 0.6):")
print(frequent_itemsets)
20
# Step 4: Derive Association Rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print("\n🔗 Association Rules (Confidence >= 0.7):")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
unique, counts = np.unique(I, return_counts=True)
print("Cluster Distribution:", dict(zip(unique, counts)))
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Reduce dimensions using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Scatter plot of clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=I.flatten(), cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('FAISS K-Means Clustering on Iris Dataset')
plt.colorbar(label="Cluster")
plt.show()
from sklearn.metrics import accuracy_score
true_labels = iris.target # Actual labels from dataset
print("Accuracy (approximate):", accuracy_score(true_labels, I.flatten()))
true_labels = iris.target
# Create a mapping between predicted clusters and true labels
mapping = {}
for cluster in range(3):
mask = (I.flatten() == cluster) # Find all data points in cluster
21
if np.sum(mask) > 0: # Ensure the mask is not empty
most_common_label = mode(true_labels[mask], keepdims=True).mode[0] # Fix
mapping[cluster] = most_common_label
# Map the predicted clusters to corrected labels
mapped_clusters = np.array([mapping[label] for label in I.flatten()])
# Compute accuracy
accuracy = accuracy_score(true_labels, mapped_clusters)
print("Corrected Accuracy:", accuracy)
Output:
Cluster Assignments: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0
0 2 2 2 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 0 2 2
0 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 2 0 2 2 2 0 2
2 0]
Cluster Distribution: {0: 56, 1: 50, 2: 44}
Accuracy (approximate): 0.22
Corrected Accuracy: 0.8133333333333334
22
Program – 9
Aim: Write R script to diagnose any disease using KNN classification and plot the
results.
Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
df = pd.read_csv('diabetes.csv') # replace with your file path if needed
# Features and target
X = df.drop('Outcome', axis=1)
y = df['Outcome']
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data
23
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25,
random_state=42)
# Hyperparameter tuning for KNN
k_range = range(1, 31)
cv_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
# Plot accuracy vs. k
plt.figure(figsize=(10, 6))
plt.plot(k_range, cv_scores, marker='o')
plt.title('KNN Hyperparameter Tuning')
plt.xlabel('Number of Neighbors K')
plt.ylabel('Cross-Validated Accuracy')
plt.grid()
plt.show()
# Best k
best_k = k_range[cv_scores.index(max(cv_scores))]
print(f"Best K value: {best_k}")
# Train with best K
knn_best = KNeighborsClassifier(n_neighbors=best_k)
knn_best.fit(X_train, y_train)
y_pred_knn = knn_best.predict(X_test)
24
# Evaluation
print("\n✅ KNN Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Classification Report:\n", classification_report(y_test, y_pred_knn))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Disease', 'Disease'],
yticklabels=['No Disease', 'Disease'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - KNN')
plt.tight_layout()
plt.show()
# Optional: Compare with Random Forest and SVM
models = {
"Random Forest": RandomForestClassifier(random_state=42),
"SVM": SVC(),
"KNN": knn_best
}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"\n🔍 {name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
25
Output:
Best K value: 7
✅ KNN Model Performance:
Accuracy: 0.6875
Classification Report:
precision recall f1-score support
0 0.74 0.78 0.76 123
1 0.57 0.52 0.55 69
accuracy 0.69 192
macro avg 0.66 0.65 0.65 192
weighted avg 0.68 0.69 0.68 192
26
🔍 Random Forest Accuracy: 0.73
precision recall f1-score support
0 0.80 0.78 0.79 123
1 0.62 0.65 0.64 69
accuracy 0.73 192
macro avg 0.71 0.72 0.71 192
weighted avg 0.74 0.73 0.74 192
🔍 SVM Accuracy: 0.73
precision recall f1-score support
0 0.77 0.82 0.80 123
1 0.64 0.57 0.60 69
accuracy 0.73 192
macro avg 0.71 0.69 0.70 192
weighted avg 0.72 0.73 0.73 192
🔍 KNN Accuracy: 0.69
precision recall f1-score support
...
accuracy 0.69 192
macro avg 0.66 0.65 0.65 192
weighted avg 0.68 0.69 0.68 192
27
Program – 10
Aim: To perform market basket analysis using Association Rules (Apriori).
Program:
!pip install mlxtend
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
# Step 1: Define the dataset (list of transactions)
dataset = [
['milk', 'bread', 'nuts', 'apple'],
['milk', 'bread', 'nuts'],
['milk', 'bread'],
['milk', 'bread', 'apple'],
['milk', 'bread', 'apple']
]
# Step 2: Convert the list of transactions into one-hot encoded DataFrame
te = TransactionEncoder()
te_data = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_data, columns=te.columns_)
print("🧾 Transaction Data (One-Hot Encoded):")
print(df)
# Step 3: Apply Apriori to find frequent itemsets
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
28
print("\n📦 Frequent Itemsets (Support >= 0.6):")
print(frequent_itemsets)
# Step 4: Derive Association Rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print("\n🔗 Association Rules (Confidence >= 0.7):")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
Output:
(milk) (bread) 1.0 1.0 1.0
(apple, bread) (milk) 0.6 1.0 1.0
(apple, milk) (bread) 0.6 1.0 1.0
(apple) (bread, milk) 0.6 1.0 1.0
29