Machine Learning - Practical Manual
Programs List:
1. Develop a Python program to import and export data using Pandas library functions.
2. Regression Models:
A) Implement a simple linear regression model.
B) Implement a multiple linear regression model.
3. Apply linear regression on a given dataset and evaluate its performance using
accuracy metrics.
4. Demonstrate data pre-processing techniques, including feature scaling and
normalization, on a suitable dataset.
5. Build a Decision Tree Classification model for a given dataset and use it to predict a
new sample.
6. Implement classification using a Random Forest model for a sparse dataset and assess
its performance using a confusion matrix and classification report.
7. Implement classification using Support Vector Machines (SVM) and evaluate its
efficiency.
8. Write a Python program to implement the K-Means clustering algorithm.
9. Perform dimensionality reduction using the Principal Component Analysis (PCA)
method.
10. Develop a Python program to showcase various data visualization techniques.
11. Construct an Artificial Neural Network (ANN) / Convolutional Neural Network
(CNN) model with backpropagation for a given dataset.
12. Ensemble Learning Methods:
A) Implement the Random Forest ensemble method on a given dataset.
B) Implement a Boosting ensemble method on a given dataset.
1. Import and Export Data Using Pandas
import pandas as pd
# Create a sample dataframe
data = {'Name': ['Anita', 'Bobby', 'Mohammed'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Export to CSV
df.to_csv('sample_data.csv', index=False)
# Import from CSV
df_imported = pd.read_csv('sample_data.csv')
print("Imported Data:\n", df_imported)
Output:
2. Linear Regression (Simple & Multiple)
(A) Simple Linear Regression
Type 1:
from sklearn.linear_model import LinearRegression
import numpy as np
# Original data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
# Reshape X to be a 2D column vector
X = X.reshape(-1, 1)
# Initialize and fit the model
model = LinearRegression()
model.fit(X, y)
# Predict the model with an input
y_pred = model.predict([[6]])
print(y_pred)
Output:
Type 2:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Example Dataset
X = [[1], [2], [3], [4]] # Features
y = [3, 6, 9, 12] # Labels
# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Training Model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
print(y_pred)
Output:
(B) Multiple Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Data
X = [[1,10],[2,20],[3,30],[4,40],[5,50],[6,60]]
y = [1,2,3,4,5,6]
# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=0)
# Training Model
model = LinearRegression()
model.fit(X_train, y_train)
# Prediction
y_pred = model.predict(X_test)
print("X_test:", X_test)
print("Y_test:", y_test)
print("Y_Predicted:", y_pred)
Output:
3. Regression with Performance Metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
# Predefined linear dataset: Study Hours vs Exam Scores
# study_hours = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10],
[11], [12], [13], [14], [15]]
# exam_scores = [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
150, 160, 170] # y = 10x + 20
# Predefined linear dataset: Study Hours vs Exam Scores
study_hours = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11],
[12], [13]] # Study hours (1 to 10)
exam_scores = [30, 40, 50, 65, 75, 80, 90, 100, 115, 120, 130, 140,
145] # Exam scores: y = 10x + 20
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(study_hours,
exam_scores, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict the target values
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
# Calculate R-squared
r_squared = r2_score(y_test, y_pred)
print("R-squared:", r_squared)
Output:
4. Feature Scaling and Normalization
Type 1:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
# Step 1: Create a Sample Dataset with Missing Values
data = {
'Age': [22, np.nan, 30, 35, 40], # Introduced NaN in Age
'Salary': [25000, 50000, np.nan, 100000, 125000] # Introduced NaN
in Salary
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Save dataset as CSV file
csv_filename = "salary_age.csv" # Adjust path if needed
df.to_csv(csv_filename, index=False)
print(f"Dataset saved as {csv_filename}")
# Step 2: Load CSV Dataset
df = pd.read_csv(csv_filename)
print("\n Original Data with Missing Values:\n", df)
# Step 3: Impute Missing Values (Using Mean)
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)
print("\n Data After Imputation:\n", df_imputed)
# Step 4: Apply Standardization (Z-score normalization)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df_imputed),
columns=df.columns)
print("\n Standardized Data (Z-score Normalization):\n",
df_standardized)
# Step 5: Apply Min-Max Normalization (Scaling between 0 and 1)
min_max_scaler = MinMaxScaler()
df_normalized = pd.DataFrame(min_max_scaler.fit_transform(df_imputed),
columns=df.columns)
print("\n Min-Max Normalized Data (0 to 1 Scaling):\n", df_normalized)
Output:
Type 2:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
# Sample data
data = np.array([[100, 0.001],
[200, 0.005],
[300, 0.002],
[400, 0.010]])
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized_data = min_max_scaler.fit_transform(data)
# Standardization
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(data)
print("Normalized Data:\n", normalized_data)
print("Standardized Data:\n", standardized_data)
Output:
5. Decision Tree Classification
Type 1:
For the following program, upload the “data.csv” file before running the program. (Make a
csv file by entering the data in an excel sheet and save as “data” in csv format)
The csv file should have the following details and the column names should match the
features in the program.
Square Number of Neighborhood House Price
Footage Bedrooms Rating (₹)
1200 2 1 1250000
2000 3 2 2100000
2800 4 3 3300000
3500 5 3 4250000
1200 2 4 1250000
2000 3 4 2100000
1000 2 3 1000000
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Load dataset from CSV file
data = pd.read_csv('data.csv') # Replace 'data.csv' with the actual
filename
# Selecting features and target variable
X = data[['Square Footage', 'Number of Bedrooms', 'Neighborhood
Rating']].values # Features
y = data['House Price (₹)'].values # Target variable
# Splitting dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train Decision Tree Regressor
model = DecisionTreeRegressor(criterion='squared_error',
random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
print("X_test: \n", X_test)
print("Actual Y: \n", y_test)
print("Predictions:\n", y_pred)
Output:
Type 2:
from sklearn.tree import DecisionTreeClassifier
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])
# Train model
model = DecisionTreeClassifier()
model.fit(X, y)
# Predict new sample
sample = np.array([[4, 5]])
prediction = model.predict(sample)
print("Prediction:", prediction)
Output:
6. Random Forest Classification
Type 1:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Sample dataset (Binary Classification)
X = np.array([[1, 2], [2, 3], [3, 3], [4, 5], [6, 8], [7, 8], [8, 9],
[9, 10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Labels (0 or 1)
# Splitting dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42) #
100 trees in the forest
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Classification Report
class_report = classification_report(y_test, y_pred)
# Display Results
print(" Random Forest Classification Performance:")
print("\n Confusion Matrix:\n", conf_matrix)
print("\n Classification Report:\n", class_report)
Output:
Type 2:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import numpy as np
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])
# Train model
model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
# Predict
y_pred = model.predict(X)
# Printing values
print("X: ",X)
print("Y: ",y)
print("Prediction: ",y_pred)
# Evaluate
cm = confusion_matrix(y, y_pred)
print("Confusion Matrix:\n", cm)
Output:
7. Support Vector Machines (SVM)
Type 1:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Sample dataset (Binary Classification)
X = np.array([[1, 2], [2, 3], [3, 3], [4, 5], [6, 8], [7, 8], [8, 9],
[9, 10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # Labels (0 or 1)
# Splitting dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train SVM Model (Using a Linear Kernel)
model = SVC(kernel='linear', random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Classification Report
class_report = classification_report(y_test, y_pred)
# Display Results
print("SVM Classification Performance:")
print("Accuracy Score:", accuracy)
print("\n Confusion Matrix:\n", conf_matrix)
print("\n Classification Report:\n", class_report)
Output:
Type 2:
from sklearn.svm import SVC
# Data
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 1, 0]
# Model
clf = SVC(kernel='linear')
clf.fit(X, y)
# Prediction
print("Predicted:", clf.predict([[2.5, 3.5]]))
Output:
8. K-Means Clustering
Type 1:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
# Step 1: Create and Save a Sample Dataset
data = {
'Salary': [25000, 50000, 75000, 100000, 125000, 15000, 80000,
30000, 90000, 110000],
'Age': [22, 25, 30, 35, 40, 20, 32, 24, 37, 45]
}
df = pd.DataFrame(data)
# Save dataset as CSV file
csv_filename = "/content/salary_age.csv" # Adjust path if running
locally
df.to_csv(csv_filename, index=False)
print(f" Dataset saved as {csv_filename}")
# Step 2: Load CSV Dataset
df = pd.read_csv(csv_filename)
print("🔹 Original Data:\n", df)
# Step 3: Apply Feature Scaling - Standardization (Z-score
normalization)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
# Step 4: Apply K-Means Clustering on Standardized Data
kmeans_standardized = KMeans(n_clusters=3, random_state=42)
df_standardized['Cluster'] =
kmeans_standardized.fit_predict(df_standardized)
print("\n🔹 Clustering Results on Standardized Data:\n",
df_standardized)
Output:
Type 2:
from sklearn.cluster import KMeans
import numpy as np
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [8, 8]])
# Train K-Means
model = KMeans(n_clusters=2, random_state=42)
model.fit(X)
# Predictions
labels = model.predict(X)
print("Cluster Labels:", labels)
Output:
9. Principal Component Analysis (PCA)
Type 1:
from sklearn.decomposition import PCA
import pandas as pd
# Sample Dataset (House Features)
data = {
'Square Footage': [1200, 1500, 1800, 2100, 2500],
'Number of Bedrooms': [2, 3, 3, 4, 5],
'Number of Bathrooms': [1, 2, 2, 3, 3],
'Price (₹)': [5000000, 6000000, 7500000, 9000000, 12000000]
}
# Convert to DataFrame
X = pd.DataFrame(data)
# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("X_reduced: \n", X_reduced)
# Find the most important feature for each principal component
pc1_feature = X.columns[abs(pca.components_[0]).argmax()]
pc2_feature = X.columns[abs(pca.components_[1]).argmax()]
# Display chosen features
print("Principal Component 1 is mostly influenced by:", pc1_feature)
print("Principal Component 2 is mostly influenced by:", pc2_feature)
Output:
Type 2:
from sklearn.decomposition import PCA
import numpy as np
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6]])
# Reduce to 1 principal component
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
print("Reduced Data:\n", X_reduced)
Output:
10. Data Visualization
Type 1:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Sample dataset
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
# Visualization
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression")
plt.show()
Output:
Type 2:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
# Step 1: Create and Save a Sample Dataset
data = {
'Salary': [25000, 50000, 75000, 100000, 125000, 15000, 80000,
30000, 90000, 110000],
'Age': [22, 25, 30, 35, 40, 20, 32, 24, 37, 45]
}
df = pd.DataFrame(data)
# Save dataset as CSV file
csv_filename = "/content/salary_age.csv" # Adjust path if running
locally
df.to_csv(csv_filename, index=False)
print(f" Dataset saved as {csv_filename}")
# Step 2: Load CSV Dataset
df = pd.read_csv(csv_filename)
print("🔹 Original Data:\n", df)
# Step 3: Apply Feature Scaling - Standardization (Z-score
normalization)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df),
columns=df.columns)
# Step 4: Apply K-Means Clustering on Standardized Data
kmeans_standardized = KMeans(n_clusters=3, random_state=42)
df_standardized['Cluster'] =
kmeans_standardized.fit_predict(df_standardized)
# Step 5: Visualize Clusters (Standardized Data)
plt.figure(figsize=(10, 5))
plt.scatter(df_standardized['Salary'], df_standardized['Age'],
c=df_standardized['Cluster'], cmap='viridis')
plt.xlabel("Salary (Standardized)")
plt.ylabel("Age (Standardized)")
plt.title("K-Means Clustering on Standardized Data")
plt.colorbar(label="Cluster")
plt.show()
print("\n🔹 Clustering Results on Standardized Data:\n",
df_standardized)
Output:
11. Artificial Neural Network (ANN)/ Convolutional Neural Network (CNN)
Type 1: Artificial Neural Network (ANN)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Simple Dataset (Predicting Exam Score from Study Hours)
X = np.array([[1], [2], [3], [4], [5]], dtype=np.float32) # Study Hours
y = np.array([[50], [55], [60], [65], [70]], dtype=np.float32) # Exam
Score
# Normalize Data (Scaling improves learning)
X = X / 10 # Scale study hours
y = y / 100 # Scale exam scores
# Define a Simple ANN Model
model = Sequential([
Dense(8, activation='relu', input_shape=(1,)),
Dense(1) # Output layer
])
# Compile and Train the Model
model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=500, verbose=0) # Increased epochs for better
learning
# Test Prediction
test_input = np.array([[6]], dtype=np.float32) / 10 # Scale input
predicted_output = model.predict(test_input) * 100 # Rescale output
print("Predicted Exam Score for 6 study hours:", predicted_output[0][0])
Output:
Type 2: Convolutional Neural Network (CNN)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense
# Step 1: Create a Tiny Dataset (Black = 0, White = 1)
X = np.array([
np.zeros((5, 5)), # Black image (all 0s)
np.ones((5, 5)) # White image (all 1s)
]).reshape(-1, 5, 5, 1) # Reshape for CNN input
y = np.array([0, 1]) # Labels: 0 = Black, 1 = White
# Step 2: Define the Smallest CNN Model
model = Sequential([
Conv2D(2, (2,2), activation='relu', input_shape=(5,5,1)), # Very small Conv layer
Flatten(), # Convert to 1D
Dense(1, activation='sigmoid') # Output layer for binary classification
])
# Step 3: Compile & Train the Model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=10, verbose=1) # Train for just 10 epochs
# Step 4: Test the Model
test_image = np.ones((5, 5)).reshape(1, 5, 5, 1) # Test with a white image
prediction = model.predict(test_image)
predicted_label = "White" if prediction[0][0] > 0.5 else "Black"
print("Predicted Class:", predicted_label)
Output:
12. Ensemble Learning Methods
(A) Random Forest
Type 1:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
# Improved Sparse Dataset (Predicting Disease Risk)
data = {
'Age': [25, 47, 35, 50, 29, 60, 55, 33, 48, 52],
'Blood Pressure': [120, 140, 130, 145, 125, 150, 135, 128, 142,
148],
'Cholesterol': [180, 220, 200, 240, 190, 260, 230, 195, 225, 250],
'Smoker': [0, 1, 0, 1, 0, 1, 0, 0, 1, 1], # 0 = Non-smoker, 1 =
Smoker
'Risk': [0, 1, 0, 1, 0, 1, 0, 0, 1, 1] # 0 = Low risk, 1 = High
risk
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Feature Engineering: Adding BP/Cholesterol Ratio
df['BP_Cholesterol_Ratio'] = df['Blood Pressure'] / df['Cholesterol']
# Selecting features and target variable
X = df[['Age', 'Blood Pressure', 'Cholesterol', 'Smoker',
'BP_Cholesterol_Ratio']]
y = df['Risk']
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train Optimized Random Forest Classifier
model = RandomForestClassifier(
n_estimators=200, # More trees
max_depth=5, # Prevent overfitting
min_samples_split=3, # Reduce noise
min_samples_leaf=2, # Avoid deep trees
class_weight='balanced', # Handle class imbalance
random_state=42
)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
cross_val_scores = cross_val_score(model, X, y, cv=5)
print("Random Forest Predictions:", y_pred)
print("Model Accuracy:", accuracy)
print("Mean Cross-Validation Accuracy:", cross_val_scores.mean())
Output:
Type 2:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Sample dataset
X = np.array([[0], [1], [2], [3], [4]])
y = np.array([0, 1, 4, 9, 16])
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
print("Random Forest Prediction:", model.predict(X))
Output:
B) Implement a Boosting ensemble method on a given dataset.
Type 1:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Simplified Dataset (Disease Risk Prediction)
data = {
'Age': [25, 47, 35, 50, 29, 60, 55, 33, 48, 52],
'Blood Pressure': [120, 140, 130, 145, 125, 150, 135, 128, 142, 148],
'Risk': [0, 1, 0, 1, 0, 1, 0, 0, 1, 1] # 0 = Low risk, 1 = High risk
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Splitting dataset into features (X) and target labels (y)
X = df[['Age', 'Blood Pressure']]
y = df['Risk']
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train AdaBoost Classifier
model = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=50, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate Model Performance
accuracy = accuracy_score(y_test, y_pred)
print("AdaBoost Predictions:", y_pred)
print("Model Accuracy:", accuracy)
Output:
Type 2:
from sklearn.ensemble import AdaBoostClassifier
import numpy as np
# Sample dataset
X = np.array([[0], [1], [2], [3], [4]])
y = np.array([0, 1, 4, 9, 16])
model = AdaBoostClassifier(n_estimators=50)
model.fit(X, y)
print("AdaBoost Prediction:", model.predict(X))
Output: