[go: up one dir, main page]

0% found this document useful (0 votes)
11 views14 pages

ML Exp4 Part A

Uploaded by

Nishad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

ML Exp4 Part A

Uploaded by

Nishad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

PART A

(PART A: TO BE REFFERED BY STUDENTS)

Experiment No. 4
A.1 Aim:
To implement Ensemble algorithm.

A.2 Prerequisite:
Python Basic Concepts

A.3 Outcome:
Students will be able to implement Ensemble algorithm.

A.4 Theory:

Ensemble Learning Techniques in Machine Learning, Machine learning models suffer bias
and/or variance. Bias is the difference between the predicted value and actual value by the
model. Bias is introduced when the model doesn’t consider the variation of data and creates a
simple model. The simple model doesn’t follow the patterns of data, and hence the model
gives errors in predicting training as well as testing data i.e. the model with high bias and
high variance
When the model follows even random quirks of data, as pattern of data, then the model might
do very well on training dataset i.e. it gives low bias, but it fails on test data and gives high
variance.
Therefore, to improve the accuracy (estimate) of the model, ensemble learning methods are
developed. Ensemble is a machine learning concept, in which several models are trained
using machine learning algorithms. It combines low performing classifiers (also called as
weak learners or base learner) and combine individual model prediction for the final
prediction.
On the basis of type of base learners, ensemble methods can be categorized as homogeneous
and heterogeneous ensemble methods. If base learners are same, then it is a homogeneous
ensemble method. If base learners are different then it is a heterogeneous ensemble method.

Ensemble Learning Methods


Ensemble techniques are classified into three types:

1. Bagging
2. Boosting
3. Stacking
4. Bagging
Consider a scenario where you are looking at the users’ ratings for a product. Instead of
approving one user’s good/bad rating, we consider average rating given to the product. With
average rating, we can be considerably sure of quality of the product. Bagging makes use of
this principle. Instead of depending on one model, it runs the data through multiple models in
parallel, and average them out as model’s final output.

What is Bagging? How it works?

 Bagging is an acronym for Bootstrapped Aggregation. Bootstrapping means


random selection of records with replacement from the training dataset. ‘Random
selection with replacement’ can be explained as follows:

a. Consider that there are 8 samples in the training dataset. Out of these 8 samples,
every weak learner gets 5 samples as training data for the model. These 5 samples
need not be unique, or non-repetitive.
b. The model (weak learner) is allowed to get a sample multiple times. For example,
as shown in the figure, Rec5 is selected 2 times by the model. Therefore, weak
learner1 gets Rec2, Rec5, Rec8, Rec5, Rec4 as training data.
c. All the samples are available for selection to next weak learners. Thus all 8
samples will be available for next weak learner and any sample can be selected
multiple times by next weak learners.

 Bagging is a parallel method, which means several weak learners learn the data
pattern independently and simultaneously. This can be best shown in the below
diagram:
1. The output of each weak learner is averaged to generate final output of the model.
2. Since the weak learner’s outputs are averaged, this mechanism helps to reduce
variance or variability in the predictions. However, it does not help to reduce bias
of the model.
3. Since final prediction is an average of output of each weak learner, it means that
each weak learner has equal say or weight in the final output.
To summarize:

1. Bagging is Bootstrapped Aggregation


2. It is Parallel method
3. Final output is calculated by averaging the outputs produced by individual weak
learner
4. Each weak learner has equal say
5. Bagging reduces variance

Boosting
We saw that in bagging every model is given equal preference, but if one model predicts data
more correctly than the other, then higher weightage should be given to this model over the
other. Also, the model should attempt to reduce bias. These concepts are applied in the
second ensemble method that we are going to learn, that is Boosting.

What is Boosting?

1. To start with, boosting assigns equal weights to all data points as all points are
equally important in the beginning. For example, if a training dataset has N
samples, it assigns weight = 1/N to each sample.
2. The weak learner classifies the data. The weak classifier classifies some samples
correctly, while making mistake in classifying others.
3. After classification, sample weights are changed. Weight of correctly classified
sample is reduced, and weight of incorrectly classified sample is increased. Then
the next weak classifier is run.
4. This process continues until model as a whole gives strong predictions.
Note: Adaboost is the ensemble learning method used in binary classification only.
PART B
(PART B : TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments within two hours of the practical.
The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)

Roll. No. BE-A10 Name: Nishad Sutar


Class: BE-Comps A Batch: A1
Date of Experiment: 28/07/2025 Date of Submission: 04/08/2025
Grade:

B.1 Software Code written by student:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.datasets import load_breast_cancer

# Load a dataset (Breast Cancer dataset for classification)


data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# No missing values in this dataset, but including imputation for completeness


# Identify categorical and numerical features
# The breast cancer dataset only has numerical features, but including steps for categorical
features for completeness
numerical_features = X.columns
categorical_features = [] # No categorical features in this dataset
# Create transformers for numerical and categorical features
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a column transformer to apply different transformations to different columns


preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])

# Create a preprocessing pipeline


preprocess_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Apply preprocessing to the data


X_processed = preprocess_pipeline.fit_transform(X)

# Convert the processed data back to a DataFrame (optional, but useful for inspection)
# If there were categorical features, the column names would be different after one-hot
encoding.
# For this dataset, since only numerical features are present and scaled, we can keep the
original column names.
X_processed_df = pd.DataFrame(X_processed, columns=numerical_features)

# Split the preprocessed data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_processed_df, y, test_size=0.2,
random_state=42)
print("Original data shape:", X.shape)
print("Processed data shape:", X_processed_df.shape)
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)

display(X_train.head())
display(y_train.head())

from sklearn.ensemble import BaggingClassifier


from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Instantiate a DecisionTreeClassifier as the base estimator


dt_classifier = DecisionTreeClassifier(random_state=42)

# Instantiate a BaggingClassifier with the corrected parameter name


bagging_classifier = BaggingClassifier(estimator=dt_classifier, n_estimators=100,
random_state=42)

# Train the BaggingClassifier


bagging_classifier.fit(X_train, y_train)

# Predict on the test data


y_pred_bagging = bagging_classifier.predict(X_test)

# Calculate the accuracy


accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

print(f"Bagging Classifier Accuracy: {accuracy_bagging:.4f}")

from sklearn.ensemble import GradientBoostingClassifier


from sklearn.metrics import accuracy_score

# Instantiate a GradientBoostingClassifier
boosting_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=3, random_state=42)

# Train the Boosting Classifier


boosting_classifier.fit(X_train, y_train)

# Predict on the test data


y_pred_boosting = boosting_classifier.predict(X_test)

# Calculate the accuracy


accuracy_boosting = accuracy_score(y_test, y_pred_boosting)

print(f"Boosting Classifier Accuracy: {accuracy_boosting:.4f}")

from sklearn.ensemble import StackingClassifier


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Define a list of base models (estimators)


estimators = [
('dt', DecisionTreeClassifier(random_state=42)),
('knn', KNeighborsClassifier())
]

# Define a meta-classifier
meta_classifier = LogisticRegression(random_state=42)

# Instantiate StackingClassifier
stacking_classifier = StackingClassifier(
estimators=estimators,
final_estimator=meta_classifier,
cv=5 # Cross-validation for training base models
)

# Train the StackingClassifier


stacking_classifier.fit(X_train, y_train)

# Make predictions on the test data


y_pred_stacking = stacking_classifier.predict(X_test)

# Calculate the accuracy


accuracy_stacking = accuracy_score(y_test, y_pred_stacking)

# Print the calculated accuracy


print(f"Stacking Classifier Accuracy: {accuracy_stacking:.4f}")

from sklearn.metrics import accuracy_score


from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Calculate accuracy for Decision Tree base model


dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
y_pred_dt = dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Calculate accuracy for K-Nearest Neighbors base model


knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

# Store accuracies in a dictionary


model_accuracies = {
'Bagging': accuracy_bagging,
'Boosting': accuracy_boosting,
'Stacking': accuracy_stacking,
'Decision Tree (Base)': accuracy_dt,
'K-Nearest Neighbors (Base)': accuracy_knn
}

# Print accuracies for comparison


print("Model Accuracies:")
for model, accuracy in model_accuracies.items():
print(f"{model}: {accuracy:.4f}")

import matplotlib.pyplot as plt

# Model names and accuracies


models = [
'Decision Tree (Base)',
'K-Nearest Neighbors (Base)',
'Bagging',
'Boosting',
'Stacking'
]

accuracies = [0.9474, 0.9474, 0.9561, 0.9561, 0.9649]

# Convert to percentages for display


accuracies_percent = [acc * 100 for acc in accuracies]

# Define colors (optional)


colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']

# Create the bar chart


plt.figure(figsize=(10, 6))
bars = plt.bar(models, accuracies_percent, color=colors)

# Add value labels on top of each bar


for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, height + 0.2,
f'{height:.2f}%', ha='center', va='bottom', fontsize=10)

# Chart formatting
plt.title('Model Accuracy Comparison', fontsize=14)
plt.ylabel('Accuracy (%)', fontsize=12)
plt.ylim(94, 98)
plt.xticks(rotation=15)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot


plt.tight_layout()
plt.show()
B.2 Input and Output:
B.3 Observations and learning:
In this experiment, I implemented Ensemble Learning techniques, a sophisticated approach in
machine learning designed to enhance model accuracy. I observed that the core principle of ensemble
methods is to combine the predictions of several base models, or "weak learners," to produce a single,
superior "strong learner." This strategy directly addresses the fundamental trade-off between bias and
variance that affects individual models. I explored two primary ensemble methods: Bagging and
Boosting. I noted that Bagging, or Bootstrap Aggregating, works in parallel, training multiple models
on different random subsets of the data and averaging their outputs to reduce variance. In contrast, I
observed that Boosting works sequentially, with each new model focusing on correcting the errors
made by its predecessor by adjusting data point weights, thereby reducing bias.
B.4 Conclusion:
In conclusion, this experiment successfully achieved its aim of implementing Ensemble algorithms. I
have learned that instead of relying on a single model, combining multiple models can significantly
improve predictive performance and robustness. The experiment demonstrated that Bagging is an
effective method for reducing variance and preventing overfitting, while Boosting is a powerful
technique for reducing bias and building highly accurate classifiers. This practical implementation
reinforces the understanding that ensemble learning is a critical concept in machine learning, proving
that the collective "wisdom" of multiple models is often more powerful and reliable than any single
model alone.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and learning/observations)

You might also like