[go: up one dir, main page]

0% found this document useful (0 votes)
17 views49 pages

IOFT AIML Report

AIML Report Highly Recommended

Uploaded by

yashsangale16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views49 pages

IOFT AIML Report

AIML Report Highly Recommended

Uploaded by

yashsangale16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

INPLANT TRAINING REPORT

FOR TRAINING AT

1
INSTITUTE OF FUTURISTIC TECHNOLOGIES
(GOREGAON, EAST)

SUBMITTED BY:
YASH SAMPAT SANGALE

ENROLLMENT NO: 2205710132

DIPLOMA IN COMPUTER ENGINEERING

KALA VIDYA MANDIR INSTITUTE OF TECHNOLOGY


(POLYTHCHNIC)
2
Plot no. M-3, R.S.C 19, MAHADA-World bank project, Malwani,

3
INDUSTRIAL TRAINING COMPLETION CERTIFICATE
This is to certify that Mr. YASH SAMPAT SANGALE, Enrollment
No.2205710171, Third year student of KVMIT Mumbai has
successfully completed the Inplant Training of 06 weeks at our
organizations IOFT – 104, classic heritage building, Aarey Rd, near Udipi Hotel, Peru
Baug, Churi Wadi, Goregaon, Mumbai, Maharashtra 400063

Training Start Date: 03/06/2024


Training Completion Date: 13/07/2024

The performance and conduct of the above student was good


during the complete training period.

Name and Sign:


Section/Industry Supervisor:
Date:

4
NO OBJECTION CERTIFICATE
This is to certify that Mr. YASH SAMPAT SANGALE, Enrollment
No.2205710171, Third year student of KVMIT Mumbai has
successfully completed the Inplant Training of 06 weeks at our
organization IOFT – 04, classic heritage building, Aarey Rd, near Udipi Hotel, Peru
Baug, Churi Wadi, Goregaon, Mumbai, Maharashtra 400063

From: 03/06/2024 to 13/07/2024.


This report does not contain any confidential document of the
company such as design, drawing, formula, specifications,
documents, procedures, etc. which may cause any type of loss to this
company.
Training Start Date: 03/06/2024
Training Completion Date: 13/07/2024
The performance and conduct of the above student was good during
the complete training period.

Name and Sign:


Section/Industry Supervisor:
Date:
5
KALA VIDYA MANDIR INSTITUTE OF TECHNOLOGY
MUMBAI
Plot No. M-3, R.S.C 19, Gaekwad Nagar, Malad (W),
MUMBAI – 400095
2022-2023
CERTIFICATE
This is to certify that Mr. Yash Sangale, Enrollment No. 2205710171,
Third Year Student of Diploma in Computer
Engineering, from KVMTI Polytechnic Mumbai has successfully
completed 06 weeks of training at “Institute of Futuristic
Technologies (Goregaon) in Computer Engineering Department”
for the partial fulfilment of diploma in Computer engineering during
Fifth semester. The training report has been approved by concerned
supervisors and satisfies the academic needs as per subject
curriculum.

(Polytechnic supervisor) Examiner

(Head of the department) (principal)

6
Day 1: Introduction to Python
Overview of Python: Python is a high-level, interpreted programming
language known for its simplicity and readability. It was created by
Guido van Rossum and first released in 1991. Python's features
include:
 Interpreted: Python code is executed line by line by the
Python interpreter.
 High-level: Python abstracts many low-level
programming details, making it easier to write and
understand code.
 Dynamically typed: Variables in Python don't have explicit
types and can change type
 during execution. Versatile: Python is used in various
domains such as web development, data science, artificial
intelligence, etc.
Python Installation and Setup: To get started with Python, we
recommend installing Anaconda, a Python distribution that includes
popular libraries for data science and machine learning. Anaconda
also comes with Jupyter Notebooks, an interactive computing
environment perfect for experimenting with Python code.
Basic Syntax: Python syntax is straightforward and easy to learn. Here
are some basic concepts:
 Variables: Variables are used to store data. They can
hold different types of data such as numbers, strings,
lists, etc.
 Data Types: Python supports various data types
including integers, floats, strings, booleans, etc.
 Operators: Python provides arithmetic operators (+, -, *, /),
comparison operators (==, !=, <, >), logical operators (and,
or, not), etc.
Control Structures: Control structures allow us to control the flow of
execution in our programs:

7
 Conditionals: Python supports if, elif, and else statements
for decision-making.

8
 Loops: Python provides for and while loops for iteration.
For example:
# Example of a simple Python program
name = input("Enter your name: ")
if name == "Alice":
print("Hello, Alice!")
else:
print("Hello, " + name + "!")

Output:
Enter your name: Bob
Hello, Bob!

This program prompts the user for their name and greets them
accordingly.

9
Day 2: Python Data Structures and Functions
Lists, Tuples, Dictionaries, Sets: Python provides several built-in data
structures for organizing and storing data:
 Lists: Ordered collection of items. Mutable, meaning they
can be modified after creation.
 Tuples: Similar to lists but immutable, meaning they cannot
be modified after creation.
 Dictionaries: Collection of key-value pairs. Keys are unique
and immutable, values can be of any data type.
 Sets: Unordered collection of unique items. Useful
for mathematical operations like union, intersection,
etc.
List Comprehensions: List comprehensions provide a concise way to
create lists. They consist of an expression followed by a for clause,
then zero or more for or if clauses.
# Example of list comprehension
numbers = [1, 2, 3, 4, 5]
squared_numbers = [num ** 2 for num in numbers]
print(squared_numbers) # Output: [1, 4, 9, 16, 25]

Functions: Functions are blocks of reusable code that perform a


specific task. They improve code modularity and reusability.
# Example of defining and calling a function
def greet(name):
return "Hello, " + name + "!"
print(greet("Alice")) # Output: Hello, Alice!

Lambda Functions: Lambda functions, also known as anonymous


functions, are small, single-expression functions without a name.
# Example of a lambda function
add = lambda x, y: x + y
print(add(3, 4)) # Output: 7

Basic Input/Output Operations: Python provides built-in functions


for taking input from the user and displaying output.
1
0
# Example of input/output operations
name = input("Enter your name: ")
print("Hello, " + name + "!")

Output:
Enter your name: Bob
Hello, Bob!

These are some of the foundational concepts of Python


programming that you'll use extensively in your journey as a Python
programmer.

1
1
Day 3: Advanced Python Concepts
Modules and Packages: Modules in Python are simply Python files
with the .py extension that contain Python code. Packages are
directories that contain multiple modules.
# Example of importing a module
import math
print(math.sqrt(25)) # Output: 5.0

Error and Exception Handling: Errors in Python can be handled


using try-except blocks, allowing graceful recovery from potential
exceptions.
# Example of error handling
try:
result = 10 / 0
except ZeroDivisionError:
print("Cannot divide by zero!")

Introduction to Numpy: NumPy is a powerful library for numerical


computing in Python. It provides support for multidimensional arrays
and matrices.
# Example of using NumPy
import numpy as np
arr = np.array([1, 2, 3])
print(arr) # Output: [1 2 3]

1. Array Creation
 numpy.array(): Create an array from a list or tuple.
 numpy.zeros(): Create an array filled with zeros.
 numpy.ones(): Create an array filled with ones.
 numpy.arange(): Create an array with a range of values.
 numpy.linspace(): Create an array with linearly spaced values.

1
2
 numpy.random(): Functions to create arrays with
random numbers, such as numpy.random.rand(),
numpy.random.randn(), etc.
2. Array Properties
 ndarray.shape: Get the shape of an array.
 ndarray.ndim: Get the number of dimensions of an array.
 ndarray.size: Get the number of elements in an array.
 ndarray.dtype: Get the data type of the array elements.
3. Array Manipulation
 numpy.reshape(): Change the shape of an array.
 numpy.flatten(): Flatten a multi-dimensional array to a
one- dimensional array.
 numpy.transpose(): Transpose the array (swap axes).
 numpy.concatenate(): Concatenate two or more arrays along
a specified axis.
 numpy.split(): Split an array into multiple sub-arrays.
4. Indexing and Slicing
 Basic slicing: array[start:stop:step].
 Boolean indexing: array[array > value].
 Fancy indexing: array[[index1, index2, ...]].
5. Mathematical Operations
 Arithmetic operations: +, -, *, /, etc., performed element-wise.
 numpy.sum(), numpy.mean(), numpy.std():
Aggregation functions.
 numpy.dot(): Dot product of two arrays.
 numpy.matmul(): Matrix multiplication.
 numpy.linalg.inv(): Inverse of a matrix.
 numpy.linalg.eig(): Eigenvalues and eigenvectors.
6. Statistical Functions
 numpy.min(), numpy.max(): Minimum and maximum values.
 numpy.median(): Median value.
 numpy.percentile(): Percentiles of the array elements.

10
7. Broadcasting
 Understanding how NumPy handles operations on arrays
of different shapes.
8. File I/O
 numpy.loadtxt(), numpy.genfromtxt(): Load data from text files.
 numpy.savetxt(): Save an array to a text file.
 numpy.save(), numpy.load(): Save and load arrays in
binary format (NumPy .npy files).
9. Special Functions
 numpy.fft: Fast Fourier Transform. numpy.polynomial:
Polynomial functions.
 numpy.random: Random number generation and
random distributions.
10. Universal Functions (ufuncs)
 Functions that operate element-wise on arrays, such
as numpy.sin(), numpy.exp(), numpy.log(), etc.
Basic Plotting with Matplotlib: Matplotlib is a plotting library for
Python. It enables you to create a wide variety of plots, graphs, and
charts.
# Example of basic plotting with Matplotlib
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()

These advanced Python concepts build upon the basics covered


earlier, providing you with more tools and techniques to write
efficient and robust code.

11
Day 4: Data Manipulation with Pandas
Introduction to Pandas: Pandas is a powerful Python library for
data manipulation and analysis. It provides data structures like
Series and DataFrame, which are ideal for handling structured data.
# Example of importing Pandas
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago',
'Houston']} df = pd.DataFrame(data)
print(df)

DataFrames: Creation, Indexing, and Selection: DataFrames are two-


dimensional labelled data structures with columns of potentially
different types. Indexing and selection operations allow you to access
specific rows and columns of a DataFrame.
# Example of indexing and selection in Pandas DataFrame
print(df['Name']) # Selecting a single column
print(df[['Name', 'Age']]) # Selecting multiple columns
print(df.iloc[0]) # Selecting a single row by index
print(df.loc[df['City'] == 'New York']) # Selecting rows based on a condition

Data Cleaning: Handling Missing Data, Data Transformation: Pandas


provides methods for handling missing data, such as dropping or
filling missing values. It also supports various data transformation
operations like merging, reshaping, and aggregating data.
# Example of handling missing data and data transformation
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
‘Age': [25, None, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)
print(df.dropna()) # Drop rows with missing values
print(df.fillna(0)) # Fill missing values with a specified value

12
Output:
Name Age City
0 Alice 25 New York
2 Charlie 35 Chicago
3 David 40 Houston
Name Age City
0 Alice 25.0 New York
1 Bob 0.0 Los Angeles
2 Charlie 35.0 Chicago
3 David 40.0 Houston

Pandas is an essential tool for data manipulation and analysis in


Python, and mastering its usage is crucial for working with
structured datasets effectively.

13
Day 5: Data Visualization with Matplotlib and Seaborn
Plotting with Matplotlib: Line Plots, Bar Plots, Histograms: Matplotlib
is a widely used Python library for creating static, interactive, and
animated visualizations. It supports various plot types, including line
plots, bar plots, histograms, scatter plots, etc.
# Example of creating line plot, bar plot, and histogram using Matplotlib
import matplotlib.pyplot as plt
# Line plot
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()
# Bar plot
plt.bar(['A', 'B', 'C', 'D'], [10, 20, 15, 25])
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot')
plt.show()
# Histogram
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
plt.hist(data, bins=5)
plt.xlabel('Bins')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

14
Introduction to Seaborn for Statistical Plots: Seaborn is built on top
of Matplotlib and provides a high-level interface for drawing
attractive and informative statistical graphics. It simplifies the process
of creating complex visualizations.
# Example of using Seaborn for statistical plots
import seaborn as sns
# Load example dataset
tips = sns.load_dataset("tips")
# Scatter plot with linear regression line
sns.lmplot(x="total_bill", y="tip", data=tips)
plt.xlabel('Total Bill')
plt.ylabel('Tip')
plt.title('Scatter Plot with Linear Regression')
plt.show()
# Box plot
sns.boxplot(x="day", y="total_bill", data=tips)
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.title('Box Plot')
plt.show()
# Violin plot
sns.violinplot(x="day", y="total_bill", data=tips)
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.title('Violin Plot')
plt.show()

15
Combining Multiple Plots: Matplotlib and Seaborn allow combining
multiple plots in a single figure to create complex visualizations for
better data exploration and analysis.
# Example of combining multiple plots using Matplotlib
plt.subplot(1, 2, 1)
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'r--')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Subplot 1')
plt.subplot(1, 2, 2)
plt.bar(['A', 'B', 'C', 'D'], [10, 20, 15, 25])
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Subplot 2')
plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()

Matplotlib and Seaborn are powerful visualization libraries that


play a crucial role in exploratory data analysis and communicating
insights from data.

16
Day 6: Introduction to Machine Learning
What is Machine Learning? Types of ML: Machine Learning (ML) is a
subset of artificial intelligence (AI) that focuses on the development
of algorithms and statistical models to enable computers to perform
tasks without explicit instructions. There are three main types of
ML:
1. Supervised Learning: In supervised learning, the model is
trained on a labeled dataset, meaning that each input data
point is associated with a corresponding target variable.
The goal is to learn a mapping from input to output.
2. Unsupervised Learning: Unsupervised learning involves training
the model on an unlabeled dataset, where the algorithm tries
to find patterns or intrinsic structures in the data. It's often
used for clustering and dimensionality reduction tasks.
3. Reinforcement Learning: Reinforcement learning is a type of
ML where an agent learns to make decisions by interacting
with an environment. It receives feedback in the form of
rewards or penalties, allowing it to learn the optimal behavior
through trial and error.
The ML Pipeline: Data Collection, Preprocessing, Model Building,
Evaluation: The ML pipeline outlines the typical workflow of a
machine learning project:
1. Data Collection: Gathering relevant data from various
sources, ensuring data quality, and understanding the problem
domain.
2. Data Preprocessing: Cleaning the data by handling missing values,
encoding categorical variables, scaling features, and splitting the
data into training and testing sets.
3. Model Building: Selecting an appropriate machine learning
algorithm based on the problem type and dataset, training the model
on the training data, and tuning hyperparameters to optimize
performance.
4. Evaluation: Assessing the model's performance on unseen
17
data using evaluation

18
metrics such as accuracy, precision, recall, F1-score, etc. It involves
comparing the model's predictions with the actual labels to measure
its effectiveness.
Introduction to Scikit-Learn: Scikit-Learn is a popular machine
learning library in Python that provides simple and efficient tools for
data mining and data analysis. It offers various algorithms for
classification, regression, clustering, dimensionality reduction, and
model selection.
# Example of using Scikit-Learn to build and evaluate a machine learning
model
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model import
LogisticRegression from sklearn.metrics import
accuracy_score
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the Logistic Regression model
model = LogisticRegression()
# Train the model on the training
data model.fit(X_train, y_train)
# Make predictions on the testing data
predictions = model.predict(X_test)
# Evaluate the model’s accuracy
accuracy = accuracy_score(y_test, predictions)
print(“Accuracy:”, accuracy)
Scikit-Learn provides a user-friendly interface for implementing
machine learning algorithms, making it accessible to both beginners
and experts in the field.

19
Day 7: Linear Regression
Simple Explanation of Linear Regression:
Linear regression is like drawing a straight line through a cloud of
points on a graph. Imagine you have a bunch of data points
scattered on a graph, and you want to find a line that best
represents the
overall trend of those points. This line helps you make predictions
about future data points based on their position relative to the line.

For example, think of a scenario where you have data on the number
of hours students study and their corresponding scores on a test. You
can use linear regression to find a line that best fits these data points.
Once you have this line, you can predict the score of a student based
on how many hours they study.

Deep Explanation of Linear Regression:


Linear regression is a statistical technique used to model the
relationship between a dependent variable (target) and one or more
independent variables (features). The goal is to find the equation of a
straight line that best fits the observed data points.

Mathematically, linear regression aims to minimize the sum of the


squared differences between the observed and predicted values. This
is typically done using the method of least squares, which finds the
line that minimizes the sum of the squared residuals (the vertical
distances between the observed data points and the line).

The equation of a simple linear regression model with one


independent variable can be
represented as:
y=mx+by

11
0
where:
 y is the dependent variable (target)
 x is the independent variable (feature)
 m is the slope of the line (coefficient)
 b is the y-intercept
The slope (mmm) represents the change in the dependent variable
for a one-unit change in the independent variable, while the y-
intercept (bbb) represents the value of the dependent variable
when the independent variable is zero.
Implementation of Linear Regression:
Let's implement linear regression using Python and Scikit-Learn:
import numpy as np
from sklearn.linear_model import LinearRegression
# Example data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Print the coefficients
print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)
In this example, we have input data X and target data y. We create a
LinearRegression model and fit it to the data. After fitting the model,
we print the slope (coef_) and intercept (intercept_) of the line.
This implementation demonstrates how we can use Scikit-Learn to
perform linear regression and obtain the coefficients of the resulting
line.

20
Regression and classification are two different types of supervised
learning tasks in
machine learning, each with distinct objectives and methodologies:
1. Objective:
 Regression: The goal of regression is to predict a continuous
value output based on input features. It is used when the target
variable is numerical.
 Classification: The objective of classification is to predict the
category or class label of a data point based on its features. It
is used when the target variable is categorical.
2. Output:
 Regression: The output of regression models is a
continuous value or a range of values.
 Classification: The output of classification models is a
discrete class label or category.
3. Algorithm Types:
 Regression: Regression algorithms include Linear Regression,
Polynomial Regression, Ridge Regression, Lasso Regression,
etc.
 Classification: Classification algorithms include Logistic
Regression, Decision Trees, Random Forests, Support Vector
Machines (SVM), k-Nearest Neighbors (k-NN), etc.
4. Evaluation Metrics:
 Regression: Common evaluation metrics for regression
models include Mean Squared Error (MSE), Root Mean
Squared Error
(RMSE), Mean Absolute Error (MAE), R-squared (coefficient of
determination), etc.
 Classification: Evaluation metrics for classification models
include Accuracy, Precision, Recall, F1-score, ROC-AUC
(Receiver Operating Characteristic - Area Under Curve),
Confusion Matrix, etc.

21
5. Use Cases:
 Regression: Regression is used for predicting numerical
values such as house prices, stock prices, temperature, etc.
 Classification: Classification is used for tasks like spam
email detection, sentiment analysis, medical diagnosis,
image
classification, etc.
In summary, while regression focuses on predicting continuous
values, classification deals with predicting discrete class labels or
categories.

22
Day 8: Classification Algorithms
Introduction to Classification
Classification is a fundamental task in machine learning where the
goal is to predict the category or class of an input data point based
on its features. In this session, we will focus on logistic regression, a
widely used classification algorithm.
Key Concepts:
 Binary Classification: Classifying data into two classes
or categories.
 Multiclass Classification: Classifying data into more than
two classes.
Logistic Regression
Logistic regression is a statistical method used for binary
classification. Despite its name, logistic regression is a
classification algorithm, not a regression algorithm. It predicts the
probability of occurrence of an event by fitting data to a logistic
function. The
output of logistic regression is a probability value between 0 and 1,
which can be interpreted as the likelihood of the input belonging to a
particular class.
Implementation in Python:
# Importing the necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Creating a Logistic Regression model
model = LogisticRegression()
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Training the model
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
23
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

Output:
Accuracy: 0.85
Confusion Matrix:
[[90 10]
[17 83]]
Classification Report:
precision recall f1-score support
0 0.84 0.90 0.87 100
1 0.89 0.83 0.86 100
accuracy 0.85 200
macro avg 0.86 0.85 0.85 200
weighted avg 0.86 0.85 0.85 200
Interpretation:
 Accuracy: The proportion of correctly classified instances.
 Confusion Matrix: A table showing the number of correct
and incorrect predictions.
 Classification Report: Provides precision, recall, F1-score,
and support for each class.
In the next session, we will delve deeper into decision boundaries
and explore more classification algorithms.

24
Day 9: Advanced Classification Algorithms
Today, we delve into more advanced classification algorithms,
including k-Nearest Neighbors (k-NN), Decision Trees, and Random
Forests. We will also cover model evaluation techniques for
imbalanced datasets.
1. Introduction to k-Nearest Neighbors (k-NN)

Concept:
 k-NN is a simple, non-parametric, and lazy learning
algorithm used for classification andregression tasks.
 It classifies a data point based on the majority class among its
k- nearest neighbors in thefeature space.
Algorithm Steps:
1. Choose the number of neighbors, k.
2. Calculate the distance between the test point and all
training points.
3. Sort the distances and select the k-nearest neighbors.
4. Assign the class with the majority vote among the k-
nearest neighbors.
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create k-NN classifier

25
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
# Predict the test set results
y_pred = knn.predict(X_test)
# Evaluation
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
Confusion Matrix:
[[16 0 0]
[ 0 13 1]
[ 0 1 14]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 16
1 0.93 0.93 0.93 14
2 0.93 0.93 0.93 15
accuracy 0.96 45
macro avg 0.95 0.95 0.95 45
weighted avg 0.96 0.96 0.96 45
2. Decision Trees and Random
Forests Decision Trees:
 A decision tree is a flowchart-like tree structure where an
internal node represents a feature (or attribute), the branch
represents a decision rule, and each leaf node represents
the outcome.
 The tree splits the feature space recursively based on
feature values, aiming to maximize the information gain or
Gini
impurity.

26
Example:
from sklearn.tree import
DecisionTreeClassifier from sklearn import
tree
# Create Decision Tree
classifier dtree =
DecisionTreeClassifier() # Train
the model
dtree.fit(X_train, y_train)
# Predict the test set results
y_pred = dtree.predict(X_test)
# Evaluation
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# Plot the tree
plt.figure(figsize=(12, 8))
tree.plot_tree(dtree,
Random Forests: filled=True)
 Random Forests is an ensemble learning method that
constructs multiple decision trees during training and outputs
the class that is the mode of the classes of the individual
trees.
 It introduces randomness in feature selection and samples
to reduce overfitting and improve generalization.
Example:
from sklearn.ensemble import RandomForestClassifier
# Create Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf.fit(X_train, y_train)
# Predict the test set results
y_pred = rf.predict(X_test)
# Evaluation
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

27
3. Model Evaluation Techniques for Imbalanced
Datasets Problem with Imbalanced Datasets:
 In imbalanced datasets, one class is significantly more
frequent than the others. Standard accuracy metrics can be
misleading.
Evaluation Metrics:
 Precision: True Positives / (True Positives + False Positives)
 Recall: True Positives / (True Positives + False Negatives)
 F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
 ROC-AUC: Area Under the Receiver Operating
Characteristic Curve, which plots the True Positive Rate
against the False Positive Rate at various threshold settings.
Example:
from sklearn.metrics import roc_auc_score, roc_curve
# Predict probabilities
y_prob = rf.predict_proba(X_test)[:,
1] # Compute ROC-AUC
roc_auc = roc_auc_score(y_test,
y_prob) print(f'ROC-AUC: {roc_auc:.2f}')
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='best')
plt.show()
By the end of today's session, you should have a thorough
understanding of advanced
classification algorithms and how to evaluate models effectively,
especially in the context of
imbalanced datasets.

28
Day 10: Clustering Algorithms
In this session, we will explore clustering algorithms, specifically K-
Means Clustering and Hierarchical Clustering. We will also discuss
how to evaluate the performance of clustering algorithms.
Introduction to Clustering
Concept Recap:
 Clustering is an unsupervised learning technique used to
group similar data points together based on certain
characteristics.
 Unlike classification, clustering does not rely on labeled data.
Use Cases:
 Customer segmentation
 Anomaly detection
 Image segmentation
K-Means
Clustering Concept:
 K-Means aims to partition n data points into k clusters in
which each point belongs to the cluster with the nearest
mean.
 It iteratively assigns each data point to the nearest
cluster center and then re-calculates the cluster centers.
Steps:
1. Initialize k cluster centers randomly.
2. Assign each data point to the nearest cluster center.
3. Recalculate the cluster centers as the mean of the assigned points.
4. Repeat steps 2 and 3 until convergence.

29
Example Implementation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate synthetic dataset
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)
# Visualize the dataset
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title('Synthetic Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
# Predict the cluster for each data point
y_kmeans = kmeans.predict(X)
# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200,
alpha=0.75) plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Output:
 A scatter plot showing the clustered data points with different
colors representing differentclusters and red dots
representing cluster centers.
K-Means Algorithm Evaluation:
 Inertia: Sum of squared distances of samples to their
closest cluster center.
 Silhouette Score: Measure of how similar a point is to its
own cluster compared to other clusters.

30
Example of Evaluation:
from sklearn.metrics import silhouette_score
# Inertia
print("Inertia:", kmeans.inertia_)
# Silhouette Score
silhouette_avg = silhouette_score(X, y_kmeans)
print("Silhouette Score:", silhouette_avg)
Output:
Inertia: 119.11923306699614
Silhouette Score: 0.5582475738591506
Hierarchical Clustering
Concept:
 Hierarchical clustering creates a tree of clusters, known as
a dendrogram.
 It can be agglomerative (bottom-up) or divisive (top-down).
Steps (Agglomerative):
1. Treat each data point as a single cluster.
2. Merge the two closest clusters.
3. Repeat until there is only one cluster.
Example Implementation:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
# Generate synthetic dataset
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)
# Compute the linkage matrix
Z = linkage(X, 'ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

31
Output:
 A dendrogram showing the hierarchical merging of clusters.
Hierarchical Clustering Algorithm Evaluation:
 Cophenetic Correlation Coefficient: Measures the
correlation between the cophenetic distances of all pairs of
points in the dataset and their original distances.
Example of Evaluation:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
# Calculate the cophenetic correlation coefficient
c, coph_dists = cophenet(Z, pdist(X))
print("Cophenetic Correlation Coefficient:", c)
Output:
Cophenetic Correlation Coefficient: 0.8043672687602522
Evaluating Clustering Performance
Methods:
 Silhouette Score: Measures how similar each point is to its
own cluster compared to other clusters.
 Davies-Bouldin Index: Measures the average similarity ratio
of each cluster with the cluster that is most similar to it.
Example of Silhouette Score and Davies-Bouldin Index:
from sklearn.metrics import davies_bouldin_score
# Silhouette Score
silhouette_avg = silhouette_score(X, y_kmeans)
print("Silhouette Score:", silhouette_avg)
# Davies-Bouldin Index
db_index = davies_bouldin_score(X, y_kmeans)
print("Davies-Bouldin Index:", db_index)
Output:
Silhouette Score: 0.5582475738591506
Davies-Bouldin Index: 0.4830227548077521

32
Summary:
 In this session, we've explored K-Means and
Hierarchical Clustering algorithms.
 We've covered their concepts, implementation with
practical examples, and evaluation methods.
 Understanding these clustering techniques and their
evaluation helps in effectively segmenting and analyzing data
in various
domains.

33
Day 11: Dimensionality Reduction
Dimensionality reduction techniques are used to reduce the number
of features (variables) in a dataset while retaining as much
information as possible. This can help to simplify models, reduce
computational cost, and mitigate the curse of dimensionality.
Introduction to Dimensionality Reduction
What is Dimensionality Reduction?
 It refers to the process of reducing the number of
random variables under consideration.
 It can be divided into two types: Feature Selection and
Feature Extraction.
Why is Dimensionality Reduction Important?
 Simplifies models to make them easier to interpret.
 Reduces computation time and resources.
 Helps to mitigate overfitting.
 Can improve the performance of machine learning algorithms.
Principal Component Analysis (PCA)
What is PCA?
 PCA is a statistical technique that transforms data into a set
of orthogonal (uncorrelated)variables called principal
components.
 The first principal component accounts for the most variance
in the data, and each subsequent component accounts for the
remaining variance under the constraint that it is orthogonal
to the preceding components.
Steps in PCA:
1. Standardize the data.
2. Compute the covariance matrix.
3. Compute the eigenvalues and eigenvectors of the
covariance matrix.
4. Sort the eigenvalues and their corresponding eigenvectors.
5. Choose the top kkk eigenvectors to form a new matrix.

34
6. Transform the original data set using this new matrix to get
the reduced dimensionality data set.
Example: Applying PCA in Python
python
Copy code
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load the Iris dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-
learningdatabases/iris/iris.data',
header=None)
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class']
X = df.iloc[:,
0:4].values y = df.iloc[:,
4].values
# Standardize the data
sc = StandardScaler()
X_std = sc.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)
# Plot the results
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k',
s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.show()
Output:
 A scatter plot showing the first two principal components of
the Iris dataset, effectively reducing the dataset from four
dimensions to two.

35
Applications of PCA
Visualization:
 PCA can be used to reduce the dimensions of high-
dimensional data to 2 or 3 dimensions for visualization.
Noise Reduction:
 By keeping only the principal components with the
highest variance, PCA can help reduce noise.
Feature Extraction:
 PCA can help to create new features that are combinations of
the original features, capturing the most important
information.
Other Dimensionality Reduction Techniques
t- SNE (t-Distributed Stochastic Neighbor Embedding):
 Mainly used for visualization of high-dimensional data.
 Unlike PCA, t-SNE is non-linear and focuses on preserving
the local structure of data.
Example of t-SNE in Python
python
Copy code
from sklearn.manifold import TSNE
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_std)
# Plot the results
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', edgecolor='k',
s=50)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE of Iris Dataset')
plt.show()
Output:
 A scatter plot showing the t-SNE transformation of the
Iris dataset.

36
LDA (Linear Discriminant Analysis):
 A supervised dimensionality reduction technique
that maximizes the separation between multiple
classes.
Example of LDA in Python
python
Copy code
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# Apply LDA
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_std,
y) # Plot the results
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis', edgecolor='k',
s=50)
plt.xlabel('LDA Component 1')
plt.ylabel('LDA Component 2')
plt.title('LDA of Iris Dataset')
plt.show()
Output:
 A scatter plot showing the LDA transformation of the
Iris dataset.
Summary
Dimensionality reduction techniques like PCA, t-SNE, and LDA are
essential tools in the data scientist's toolkit. They help to simplify
complex datasets, making them more manageable and interpretable,
while often improving the performance of machine learning models
by reducing overfitting and computational cost.

37
Day 12: Model Validation and Tuning
Topics:
 Train-Test Split, Cross-Validation
 Hyperparameter Tuning: Grid Search and Random Search
 Overfitting and Underfitting: Strategies to mitigate
Train-Test Split, Cross-
Validation Train-Test Split:
Train-test split is a technique for evaluating the performance of a
machine learning model. It involves splitting the data into two sets:
one for training the model and the other for testing it. This ensures
that the model's performance is evaluated on unseen data.
Example:
python
Copy code
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
print(f"Training data shape:
{X_train.shape}") print(f"Testing data shape:
{X_test.shape}")
Output:
kotlin
Copy code
Training data shape: (105, 4)
Testing data shape: (45, 4)

38
Cross-validation is a robust technique for assessing the
generalizability of a model. One common method is k-fold cross-
validation, where the data is divided into k subsets, and the model
is trained and tested k times, each time using a different subset as
the test set and the
remaining as the training set.
Example:
python
Copy code
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Load dataset
X, y = load_iris(return_X_y=True)
# Initialize model
model = LogisticRegression(max_iter=200)
# Perform 5-fold cross-validation
scores = cross_val_score(model, X, y,
cv=5) print(f"Cross-validation scores:
{scores}")
print(f"Average
Output: cross-validation score: {scores.mean()}")
sql
Copy code
Cross-validation scores: [1. 0.967 0.967 0.967 1. ]
Average cross-validation score: 0.980
Hyperparameter Tuning: Grid Search and Random Search
Grid Search:
Grid Search involves searching through a manually specified subset of
the hyperparameter space of a learning algorithm. It’s a brute-force
approach where every combination of hyperparameters is tried, and
the best combination is selected.

Example:
python
39
Copy code
from sklearn.model_selection import GridSearchCV
# Initialize model
model =
LogisticRegression(max_iter=200) #
Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'solver': ['liblinear', 'lbfgs']
}
# Initialize Grid Search
grid_search = GridSearchCV(model, param_grid, cv=5)
# Fit model
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_}")
Output:
css
Copy code
Best parameters: {'C': 1, 'solver': 'liblinear'}
Best cross-validation score: 0.98
Random Search:
Random Search involves sampling a fixed number of hyperparameter
settings from a specified distribution. It is generally more efficient
than grid search when dealing with a large hyperparameter space.
Example:
python
Copy code
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
# Define parameter distribution
param_dist = {
'C': uniform(loc=0, scale=4),
'solver': ['liblinear', 'lbfgs']
}
# Initialize Random Search
random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5,
40
random_state=42)
# Fit model
random_search.fit(X, y)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_}")
Output:
css
Copy code
Best parameters: {'C': 1.592, 'solver': 'liblinear'}
Best cross-validation score: 0.98
Overfitting and Underfitting: Strategies to Mitigate
Overfitting:
Overfitting occurs when a model learns the training data too well,
capturing noise along with the underlying pattern. This results in
poor generalization to new, unseen data.
Strategies to Mitigate Overfitting:
1. Simplify the Model: Reduce the complexity of the model.
2. Regularization: Add a penalty for larger coefficients (e.g., L1,
L2 regularization).
3. Prune Decision Trees: Remove parts of the tree that do
not provide power in predicting target variables.
4. Cross-Validation: Use cross-validation to tune
model hyperparameters.
Example:
python
Copy code
from sklearn.linear_model import Ridge
# Initialize Ridge regression model with regularization
ridge_model = Ridge(alpha=1.0)
# Fit model
ridge_model.fit(X_train, y_train)
# Evaluate model
print(f"Ridge regression score: {ridge_model.score(X_test, y_test)}")
Output:

41
scss
Copy code
Ridge regression score: 0.93
Underfitting:
Underfitting occurs when a model is too simple to capture the
underlying pattern in the data, leading to poor performance on both
training and testing data.
Strategies to Mitigate Underfitting:
1. Increase Model Complexity: Use a more complex model.
2. Remove Regularization: Reduce or remove regularization.
3. Feature Engineering: Add more relevant features to the model.
4. Longer Training Time: Allow the model to train for more
epochs (in neural networks).
Example:
python
Copy code
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# Create a pipeline with polynomial features and linear regression
model = Pipeline([
('poly', PolynomialFeatures(degree=2)),
('linear', LinearRegression())
])
# Fit model
model.fit(X_train, y_train)
# Evaluate model
print(f"Polynomial regression score: {model.score(X_test, y_test)}")
Output:
yaml
Copy code
Polynomial regression score: 0.95
By following these methods and understanding the balance between
overfitting and

42
underfitting, along with using proper model validation
techniques, you can build robust and
generalizable machine learning models.

Project:
Topic: Taking a GDP data of 2 country (Eg. India and Brazil.) and
give it to csv file. With using linear regression find the future gdp of
both the country.
Were we are using numpy, pandas, sklearn and matplotlib.
Code & CSV data:

43
44
Output:

45

You might also like