[go: up one dir, main page]

0% found this document useful (0 votes)
16 views22 pages

DSA Module 3 Notes

The document discusses the fundamentals of Data Science and its applications, focusing on Machine Learning concepts such as modeling, overfitting, underfitting, and the bias-variance tradeoff. It explains how machine learning models are created from data to make predictions and emphasizes the importance of model correctness and evaluation metrics like accuracy, precision, and recall. Additionally, it covers feature extraction and selection to improve model performance and generalization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

DSA Module 3 Notes

The document discusses the fundamentals of Data Science and its applications, focusing on Machine Learning concepts such as modeling, overfitting, underfitting, and the bias-variance tradeoff. It explains how machine learning models are created from data to make predictions and emphasizes the importance of model correctness and evaluation metrics like accuracy, precision, and recall. Additionally, it covers feature extraction and selection to improve model performance and generalization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Science and It’s Applications 21AD62

SUBJECT: Data Science and Its Applications (21AD62)

MODULE-1 MACHINE LEARNING

Syllabus: Modeling, What Is Machine Learning?, Over fitting and Under fitting,
Correctness, The Bias-Variance Tradeoff, Feature Extraction and Selection, k-Nearest
Neighbors, The Model, Example: The Iris Dataset, The Curse of Dimensionality, Naive Bayes,
A Really Dumb Spam Filter, A More Sophisticated Spam Filter, Implementation, Testing Our
Model, Using Our Model, Simple Linear Regression, The Model, Using Gradient Descent,
Maximum Likelihood Estimation, Multiple Regression, The Model, Further
Assumptions of the Least Squares Model, Fitting the Model, Interpreting the Model, Goodness
of Fit, Digression: The Bootstrap, Standard Errors of Regression Coefficients, Regularization,
Logistic Regression, The Problem, The Logistic Function, Applying the Model, Goodness of
Fit, Support Vector Machines.

Modeling
A model is essentially a simplified representation of reality that helps us to understand, predict, or
control some aspect of the world. It captures the key features and relationships of the phenomena.
The primary goal of a machine learning model is to make predictions or decisions based on input
data.
It‟s simply a specification of a mathematical (or probabilistic) relationship that exists between
different variables.
For example if we want to raise money for one social networking site, It might build a business
model that takes inputs like “number of users” and “ad revenue per user” and “number of
employees” and outputs your annual profit for the next several years. A cookbook recipe entails a
model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients
needed.
The business model is probably based on simple mathematical relationships: profit is revenue minus
expenses, revenue is units sold times average price, and so on. The recipe model is probably based on
trial and error — someone went in a kitchen and tried different combinations of ingredients until they
found one they liked. And the poker model is based on probability theory, the rules of poker, and
some reasonably innocuous assumptions about the random process by which cards are dealt.

What Is Machine Learning?


The machine learning refer to creating and using models that are learned from data. In other
contexts this might be called predictive modeling or data mining. Typically, the goal is to use
existing data to develop models that can use to predict various outcomes for new data, such as:
 Predicting whether an email message is spam or not
 Predicting whether a credit card transaction is fraudulent
 Predicting which advertisement a shopper is most likely to click on
 Predicting which football team is going to win the Super Bowl
1
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Overfitting and Underfitting


A Over fitting is a model that performs well on the training data but that generalizes poorly to any
new data. This could involve learning noise in the data,or it could involve learning to identify
specific inputs rather than whatever factors are actually predictive for the desired output.

Under fitting, producing a model that doesn‟t perform well even on the training data, or model
fails to understand the relationships between the input features and outcome.

Let‟s us consider Figure below, to fit three polynomials to a sample of data.

The horizontal line shows the best fit degree 0 polynomial. It severely under fits the training data.
The best fit degree 9 polynomial goes through every training data point exactly, but it very severely
over fits — if we were to pick a few more data points it would quite likely miss them by a lot. And
the degree 1 line strikes a nice balance — it‟s pretty close to every point, and the line will likely be
close to new data points as well. Clearly models that are too complex lead to over fitting and don‟t
generalize well beyond the data they were trained on. The most fundamental approach involves using
different data to train the model and to test the model. Overfitting and Underfitting
Overfitting
Causes:
 Too many parameters relative to the number of observations.
 Model complexity is too high.
 Insufficient training data.
Symptoms:
 High accuracy on training data.

2

Low accuracy on validation/test data.


Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Example: Consider a polynomial regression problem where we are trying to fit a polynomial to
data that has aquadratic relationship.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate some data


np.random.seed(0)
X = np.random.normal(0, 1, 100)
y = 2 * X ** 2 + 3 + np.random.normal(0, 0.5, 100)

# Split the data into training and test sets


X_train = X[:80]
y_train = y[:80]
X_test = X[80:]
y_test = y[80:]

# Reshape data for sklearn


X_train = X_train[:, np.newaxis]
X_test = X_test[:, np.newaxis]

# Fit polynomial regression with a high degree


poly = PolynomialFeatures(degree=10)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
model = LinearRegression()
model.fit(X_poly_train, y_train)
y_poly_pred_train = model.predict(X_poly_train)
y_poly_pred_test = model.predict(X_poly_test)

# Plot the data and the polynomial regression line


plt.scatter(X_train, y_train, color=‟blue‟;, label=‟Training data‟)
plt.scatter(X_test, y_test, color=‟red‟, label=‟Test data‟)
plt.plot(np.sort(X_train[:, 0]), y_poly_pred_train[np.argsort(X_train[:, 0])], color=‟green‟,
3

label=‟Polynomial fit‟;)
Page

plt.legend()
plt.show()

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Calculate and print training and test errors


train_error = mean_squared_error(y_train, y_poly_pred_train)
test_error = mean_squared_error(y_test, y_poly_pred_test)
print(f‟Training error: {train_error}‟)
print(f‟Test error: {test_error}‟)

In this example, using a polynomial of degree 10 leads to overfitting. The model fits the training data
very well,capturing noise, but generalizes poorly to the test data.
Underfitting
Causes:
 Model complexity is too low.
 Not enough features.
 Excessive regularization.
Symptoms:
 Low accuracy on training data.
 Low accuracy on validation/test data.
Example: Continuing with the same data, consider fitting a linear regression model:

# Fit a linear regression model


model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Plot the data and the linear regression line


plt.scatter(X_train, y_train, color=‟blue‟, label=‟Training data‟)
plt.scatter(X_test, y_test, color=‟red‟, label=‟Test data‟)
plt.plot(np.sort(X_train[:,0]),y_pred_train[np.argsort(X_train[:,0])],color=‟green‟,label=‟Linear fit‟)
plt.legend()
plt.show()

# Calculate and print training and test errors


train_error = mean_squared_error(y_train, y_pred_train)
test_error = mean_squared_error(y_test, y_pred_test)
print(f‟Training error: {train_error}‟)
print(f‟Test error: {test_error}‟)
4
Page

Here, using a linear regression model leads to underfitting. The model is too simple to capture the
quadratic relationship in the data.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Correctness
Correctness refers to how accurately a model's predictions align with the actual outcomes. It can
be quantified using various evaluation metrics depending on the type of problem and the specific
goals of the model.
In binary classification problems, such as determining whether an email is spam or not, the
performance of the model can be evaluated using a confusion matrix. This matrix summarizes the
outcomes of the predictions made by the model compared to the actual outcomes.
A confusion matrix is a table that is used to describe the performance of a classification
model.
Let‟s consider a data for building a model to make a judgment.

Predict "Spam" Predict "Not Spam"

Actual Spam True Positive (TP) False Negative (FN)

Actual Not Spam False Positive (FP) True Negative (TN)

Given a set of labeled data and such a predictive model, every data point lies in one of four
categories:
• True Positive (TP): An email is actually spam, and the model correctly identifies it as spam.
• False Positive (FP) An email is not spam, but the model incorrectly identifies it as spam
• False Negative (FN) : An email is spam, but the model incorrectly identifies it as not spam.
• True Negative (TN): An email is not spam, and the model correctly identifies it as not spam.
Correctness can be measured by

Accuracy: The proportion of total predictions that are correct.

Code:
def accuracy(tp, fp, fn, tn): correct = tp + tn
total = tp + fp + fn + tn
return correct / total
print accuracy(70, 4930, 13930, 981070) # 0.98114

Precision: The proportion of positive predictions that are actually correct.


5
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

Code:
def precision(tp, fp, fn, tn):
return tp / (tp + fp)
print precision(70, 4930, 13930, 981070) # 0.014

Recall (Sensitivity or True Positive Rate): The proportion of actual positives that are
correctly identified.

Code:
def recall(tp, fp, fn, tn):
return tp / (tp + fn)
print recall(70, 4930, 13930, 981070)

F1 Score: The harmonic mean of precision and recall, providing a balance between the two.

Code:
def f1_score(tp, fp, fn, tn):
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)

Usually the choice of a model involves a trade-off between precision and recall.
 A model that predicts “yes” when it‟s even a little bit confident will probably have a high
recall but a low precision;
 A model that predicts “yes” only when it‟s extremely confident is likely to have a low recall
and a high precision.
6
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The Bias-Variance Trade-off


• The bias-variance tradeoff is a key concept in understanding the performance of machine
learning models.
• Bias is the error due to overly simplistic assumptions in the learning algorithm.
• High bias can cause an algorithm to miss relevant relations between features and target outputs
(under fitting).
• Variance is the error due to too much complexity in the learning algorithm.
• High variance can cause an algorithm to model the random noise in the training data rather
than the expected outputs (over fitting).
• The tradeoff is about finding a balance between bias and variance to minimize total error.
• Typically, by increasing model complexity will decrease bias but increase variance, while
decreasing complexity will increase bias but decrease variance.
• The goal is to find the spot where the model generalizes well to new data.
The Bias and Variance both will measures what would happen if the model is retrain many times on
different sets of training data.
For example, the degree 0 model in “Over fitting and Under fitting” will make a lot of mistakes for
pretty much any training set (drawn from the same population), which means that it has a high bias.
However, any two randomly chosen training sets should give pretty similar models (since any two
randomly chosen training sets should have pretty similar average values). So we say that it has a low
variance. High bias and low variance typically correspond to underfitting.
On the other hand, the degree 9 model fit the training set perfectly. It has very low bias but very high
variance (since any two training sets would likely give rise to very different models). This
corresponds to overfitting.
Thinking about model problems this way can help you figure out what do when your model doesn‟t
work so well.

Adding More Features

If the model has high bias, it means the model is too simple to capture the underlying patterns in the
data.In such cases, adding more features can help improve the model by providing it with more
information. For example, in the context of polynomial regression:

o A degree 0 model (just a constant) is too simple.


o A degree 1 model (a straight line) is better because it can capture linear relationships.
o A higher-degree model can capture more complex relationships.

Reducing Features or Adding More Data

 If the model has high variance, it means the model is too complex and is over fitting the
training data. Removing some features can help by simplifying the model, thus reducing the
7

variance.
Page

 Another effective way to take action to reduce or prevent high variance is to gather more data.
More data can help the model generalize better because it provides more examples for the
model to learn from, reducing the risk of over fitting.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

In Figure 11-2, To fit a degree 9 polynomial to different size samples. If the model is trained on 100
data points, there‟s much less over fitting. And the model trained from 1,000 data points looks very
similar to the degree 1 model. Holding model complexity constant, the more data , the harder it is to
over fit.

Feature Extraction and Selection


Feature selection involves selecting a subset of the most important features for use in model
construction. This can improve the model's performance by reducing over fitting, speeding up the
training process, and improving the model's interpretability.
When the data doesn‟t have enough features, the model is likely to under fit. And when the data has
too many features, it‟s easy to overf it.
Features are whatever inputs that provide to our model.
In the simplest case ex: If we want to predict someone‟s salary based on her years of experience, then
years of experience is the only feature it has.
In complicated case ex: Imagine trying to build a spam filter to predict whether an email is junk or not.
Most model it is just a collection of text. To extract features.
For example:
Does the email contain the word “lottery”? How

many times does the letter d appear? What was

the domain of the sender?

The first is simply a yes or no, which we typically encode as a 1 or 0.


8

The second is a number. And the third is a choice from a discrete set of options.
Page

To extract features from our data that falls into one of these threecategories.
• The Naive Bayes classifier - is suited to yes-or-no features.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

• Regression models-require numeric features


• Decision trees- can deal with numeric or categorical data.
The features are choosen by the combination of experience and domain expertise comes .

K-Nearest Neighbors (KNN) Algorithm


The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to
tackle classification and regression problems..
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it predicts
the label or value of a new data point by considering the labels or values of its K nearest neighbors in
the training dataset.

Working of KNN:
:
Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.

Step 3: Finding Nearest Neighbors


 The k data points with the smallest distances to the target point are the nearest neighbors.

Step 4: Voting for Classification or Taking Average for Regression

 In the classification problem, the class labels of are determined by performing majority voting. The
class with the most occurrences among the neighbors becomes the predicted class for the target
data point.
 In the regression problem, the class label is calculated by taking average of the target values of K
nearest neighbors. The calculated average value becomes the predicted output for the target data
point.
X is the training dataset with n data points, where each data point is represented by a d-dimensional
feature vector and Y be the corresponding labels or values for each data point in X. Given a new data
point x, the algorithm calculates the distance between x and each data point in X using a distance
metric, such as Euclidean distance:
9
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The algorithm selects the K data points from X that have the shortest distances to x. For classification
tasks, the algorithm assigns the label y that is most frequent among the K nearest neighbors to x. For
regression tasks, the algorithm calculates the average or weighted average of the values y of the K
nearest neighbors and assigns it as the predicted value for x.

Python program to build a nearest neighbor model that can predict the class
from the IRIS dataset
import numpy as np
from collections import Counter
# Sample dataset (Iris data: sepal length, sepal width, petal length, petal width, species)
data = [
(5.1, 3.5, 1.4, 0.2, ‘setosa’),
(4.9, 3.0, 1.4, 0.2, ‘setosa’),
(5.0, 3.6, 1.4, 0.2, ‘setosa’),
(6.7, 3.0, 5.0, 1.7,’versicolor’),
(6.3, 3.3, 6.0, 2.5, ‘virginica’),
(5.8, 2.7, 5.1, 1.9, ‘virginica’)
]
# New data point (sepal length, sepal width, petal length, petal width)
new_point = (5.5, 3.4, 1.5, 0.2)

# Function to calculate Euclidean distance


def euclidean_distance(point1, point2):
return np.sqrt(sum((x - y) ** 2 for x, y in zip(point1, point2)))

# Function to get the nearest neighbors


def get_nearest_neighbors(data, new_point, k):
distances = [(euclidean_distance(point[:-1], new_point), point[-1]) for point in data]
10

distances.sort(key=lambda x: x[0])
Page

return [label for _, label in distances[:k]]

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Function to predict the class


def predict(data, new_point, k):
nearest_neighbors = get_nearest_neighbors(data, new_point, k)
most_common = Counter(nearest_neighbors).most_common(1)
return most_common[0][0]

# Predict the class for the new data point


k=3
predicted_class = predict(data, new_point, k)
print(f’The predicted class for the new point is: {predicted_class}’)

The Curse of Dimensionality


The Curse of Dimensionality is a concept that describes the challenges and issues that arise when
working with high-dimensional data.As the number of dimensions increases, the volume of the
space increases exponentially. This means that data pointsbecome sparse, and the distances between
them grow, making it difficult to find meaningful patterns.
The Curse of Dimensionality impacts various aspects of data analysis, including distance
calculations, data sparsity, and overfitting.
1. Distance Measures Become Less Meaningful:
In high-dimensional spaces, the distances between points tend to become similar, making it harder
to distinguish between near and far points. This is problematic for algorithms that rely on distance
measures, such as k-Nearest Neighbors (k-NN) and clustering algorithms.
2. Data Sparsity:
With more dimensions, the data points spread out more, leading to sparsity. In a high-dimensional
space, even a large dataset may have very few data points in any given region. This sparsity makes
it hard to find reliable patterns and can reduce the effectiveness of algorithms.
3. Overfitting:
High-dimensional datasets often contain many irrelevant or noisy features, which can lead to
overfitting. The model may capture noise instead of the underlying pattern, performing well on
training data but poorly on unseen data.
Example
Consider a simple example using a dataset with points uniformly distributed in a unit cube. We can
observe how the volume and distances change as the number of dimensions increases.
1. Volume of a Hypercube:
In a 1-dimensional space , the unit hypercube is simply a line segment of length 1.
In a 2-dimensional space (a square), the unit hypercube has an area of 1.
In a 3-dimensional space (a cube), the unit hypercube has a volume of 1.
11

However, in higher dimensions, the volume of the unit hypercube becomes negligible compared to
Page

the space it occupies. For instance:


 In a 10-dimensional space, the unit hypercube has a volume of 1 10 =1.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

 In a 100-dimensional space, the unit hypercube still has a volume of 1 100 =1, but the space
it occupies is vast.
2. Distance Calculations:
In a 1-dimensional space, consider two points at 0 and 1. The distance between them is 1.
In a 2-dimensional space, consider two points (0, 0) and (1, 1). The Euclidean distance is sqrt{2}.
In a 3-dimensional space, consider two points (0, 0, 0) and (1, 1, 1). The Euclidean distance is
sqrt{3}.
As dimensions increase, the distances between points increase as well. However, the difference
between the maximum and minimum distances decreases proportionally, making distances less
discriminative.

Dimensionality Reduction
Dimensionality reduction is a process used to reduce the number of features (dimensions) in a
dataset while retaining as much information as possible. This technique helps in simplifying
models, reducing computational costs, and mitigating issues related to the curse of dimensionality.
Principal Component Analysis (PCA)
PCA is a widely used technique for dimensionality reduction that transforms the original features
into a new set of uncorrelated features called principal components. The first principal component
captures the most variance in the data,and each subsequent component captures the remaining
variance under the constraint of being orthogonal to the previous components.
Example: Dimensionality Reduction Using PCA
1. Standardize the Data: Standardization ensures that each feature contributes equally to the
analysis by scaling the data to have a mean of 0 and a standard deviation of 1.
2. Compute the Covariance Matrix: The covariance matrix describes the variance and the
covariance between the features.
3. Compute the Eigenvalues and Eigenvectors: The eigenvectors determine the directions of the
new feature space, while the eigenvalues determine their magnitude (i.e., the amount of variance
captured by each principal component).
4. Sort Eigenvalues and Select Principal Components: The eigenvalues are sorted in descending
order, and the top k eigenvalues are selected. The corresponding eigenvectors form the new feature
space.
5. Transform the Data: The original data is projected onto the new feature space to obtain the
reduced dataset.
CODE:
Example using Python to illustrate PCA for dimensionality reduction:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
12

# Generate a synthetic dataset


Page

np.random.seed(42)
X = np.random.rand(100, 3) # 100 samples, 3 features

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Standardize the data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Print the explained variance ratios


Print(f‟Explained variance ratios:", pca.explained_variance_ratio‟)

# Visualize the original and reduced data


fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Original data
ax[0].scatter(X[:, 0], X[:, 1], c=‟blue‟, label=‟Original Data‟)
ax[0].set_xlabel(„Feature 1‟;)
ax[0].set_ylabel(„Feature 2‟)
ax[0].set_title(„Original Data‟)

# Reduced data

ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=‟red‟, label=‟PCA Reduced Data‟)


ax[1].set_xlabel(„Principal Component 1‟)
ax[1].set_ylabel(„Principal Component 2‟)
ax[1].set_title(„PCA Reduced Data‟)
plt.legend()
plt.show()

Using Gradient Descent


Gradient descent is a fundamental optimization algorithm used in machine learning to minimize a
cost function. Gradient Descent Basics Gradient descent aims to find the parameters (weights) of a
model that minimize the cost function, which measures how well the model fits the data.
The steps are as follows:
1. Initialize Parameters: Start with random initial values for the parameters (weights).
2. Compute the Gradient: Calculate the gradient of the cost function with respect to each parameter.
The gradient is a vector of partial derivatives, indicating the direction and rate of the steepest
13

increase in the cost function.


Page

3. Update Parameters: Adjust the parameters in the opposite direction of the gradient by a small
amount, which is determined by the learning rate. This step is repeated iteratively:
θi = θi − α∂J(θ)/∂θi

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

where θi is the i-th parameter, α is the learning rate, and J(θ)/∂θi is the cost function.
4. Convergence Check: Repeat steps 2 and 3 until the change in the cost function is smaller than a
predefined threshold or a maximum number of iterations is reached.
Gradient Descent for Parameterized Models
When fitting parameterized models, the cost function depends on the difference between the
predicted and actual values. For example, in linear regression, the cost function J(θ) is typically the
mean squared error:

where:
 m is the number of training examples.
 hθ(x(i)) is the predicted value for the ith training example.
 y(i) is the actual value for the ith training example.
The gradient descent algorithm for linear regression would involve computing the partial
derivatives of J(θ) with respect to each parameter θj:

Python program to illustrate gradient descent for a simple linear regression


model

import numpy as np

# Example data: y = 2x + 3 with some noise


np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Add x0 = 1 to each instance


X_b = np.c_[np.ones((100, 1)), X]

# Parameters
14

learning_rate = 0.1
n_iterations = 1000
Page

m = len(X_b)
theta = np.random.randn(2, 1)

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

# Gradient Descent
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - learning_rate * gradients
print(„Fitted parameters:‟, theta)

Naive Bayes
Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem, with the
assumption that the features are independent given the class label.
This model predicts the probability of an instance belongs to a class with a given set of
feature value. It is a probabilistic classifier. It is because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature contributes to the
predictions with no relation between each other.. It uses Bayes theorem in the algorithm for training
and prediction
The core assumption of Naive Bayes is conditional independence.
Mathematically, if Xi & Xj represents the event that the ith word is present in the message, then the
assumption says:
P(Xi and Xj/spam)=P(Xi/spam)⋅P(Xj/spam)
Sophisticated Spam Filter
Imagine now that a vocabulary of many words W1,W2,W3………WN. To move this into the
probability theory, let‟s write Xi let the event “a message contains the word “Wi” Also imagine that
with an estimate P(Xi/S) for the probability that a spam message contains the i th word, and a similar
estimate P(Xi/N) for the probability that a nonspam message containsthe ith word.
The key to Naive Bayes is making the assumption that the presences of each word are independent of
one another, conditional on a message being spam or not.
Intuitively, this assumption means that knowing whether a certain spam message contains the word
“lottery” gives the no information about whether that same message contains the word “rolex.” In math
terms, this means that:
P(X1= x1.........Xn= xn/S) = P(X1= x1/S) x……………… x P(Xn= xn/S)
Imagine that a vocabulary consists only of the words “lottery” and “rolex,” and that half of all spam
messages are for “cheap rolex” and that the other half are for “authentic lottery.” In this case, the
Naive Bayes estimate that a spam message contains both “lottery” and “rolex” is:

since the “lottery” and “rolex” words actually never occur together. Despite the unrealisticness of this
assumption, this model often performs well and is used in actual spam filters.
15

The same Bayes‟s Theorem reasoning used for the word “lottery-only” spam filter tells that we can
calculate the probability a message is spam using the equation:
Page

P( S/X=x) = P(X=x/S) / [P(X=x/S) + p(X=x/N)]

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

The Naive Bayes assumption allows us to compute each of the probabilities on the right simply by
multiplying together the individual probability estimates for each vocabulary word.

Let us consider three words: “rolex" "lottery," and "meeting." the following probabilities based on
historical data.

For Training Data


Assume Likelihoods as
• Rolex : P(R/N) = 0.1 & P(R/S) = 0.8
• Lottery : P(L/N)=0.05 & P(L/S) = 0.7
• Meeting : P(M/N)=0.9 & P(M/S) = 0.2
• Project: P(P/N)=0.7 & P(P/S) = 0.25
Prior Probabilities:
• P(spam) = 0.3 (30% of messages are spam)
• P(Normal) = 0.7 (70% of messages are non-spam)
New Message containing the words “rolex" and "lottery."
Let us classify it as spam or not spam using Naive Bayes.
• Calculate the probability of the message being spam:
P(spam/message)∝P(S)⋅P(R/S)⋅P(L/S) = 0.3X0.8X0.7= 0.168
• Calculate the probability of the message being ham:
P(normal/message)∝P(N)⋅P(R/N)⋅P(L/N) = 0.7x0.1x0.05= 0.0035
• Normalize the Probabilities
• To get the actual probabilities, we need to normalize these values so they sum to 1.
• P(message)=P(S/ message)+P(N∣ message)
• P(message)=0.168 + 0.0035= 0.1715
• So, the normalized probabilities are:
• P(S/message)=0.1680/1715 ≈ 0.98
• P(N/message)=0.00350/1715 ≈ 0.02
Given the message contains the words “rolex" and "lottery," there is a 98% chance it is spam and a
2% chance it is not spam (normal).

Python Code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
16

from sklearn.metrics import accuracy_score, classification_report


Page

# Sample data
data = {

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

'message': [
'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121
to
receive entry question(std txt rate)',
'Nah I don\'t think he goes to usf, he lives around here though',
'WINNER!! As a valued network customer you have been selected to receivea £900
prize
reward!',
'I HAVE A work ON SUNDAY !!',
'Had your mobile 11 months or more? U R entitled to update to the latest colour
mobiles with camera for free! Call The Mobile Update Co FREE on 08002986030'
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam']
}
# Convert data to DataFrame
df = pd.DataFrame(data)

# Feature extraction
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes classifier


nb = MultinomialNB()
nb.fit(X_train, y_train)

# Make predictions
y_pred = nb.predict(X_test)

# Evaluate the model


17

accuracy = accuracy_score(y_test, y_pred)


Page

report = classification_report(y_test, y_pred)

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)

# Function to classify a new message


def classify_message(message):
message_transformed = vectorizer.transform([message])
prediction = nb.predict(message_transformed)
return prediction[0]

# Test the classifier


new_message = 'Congratulations! You have won a free ticket to Bahamas. Call now!'
print(f'The message: "{new_message}" is classified as {classify_message(new_message)}')

Simple linear Regression


• Regression is a statistical technique used to model and analyze the relationships between
variables.
• It helps in understanding how the dependent variable (Y)changes when any one of the
independent variables (X)is varied.
• The primary goal of regression is to predict or estimate the value of the dependent variable based
on the values of one or more independent variables.
• Simple Linear Regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:
• Independent Variable (X): Also known as the predictor or explanatory variable.
• Dependent Variable (Y): Also known as the response or outcome variable.
• The goal of Simple Linear Regression is to model the relationship between these two variables
by fitting a linear equation to the observed data.
• The linear equation for a Simple Linear Regression model is:

Yi= +βXi+ϵi
Y - is the dependent variable.
X -is the independent variable.
- is the intercept of the regression line. It is the value of Y when X=0.
β -is the slope of the regression line. It represents the change in Y for a one-unit change in X.
ϵ - error term, which accounts for the variability in Y that cannot be explained by the linear
relationship with X.
18

• To fit the Simple Linear Regression model.


Page

• Estimate the parameters and β using Ordinary Least Squares (OLS)


• It minimizes the sum of the squared differences between the observed values and the values

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

predicted by the linear model.

After fitting the mode


• Mean Squared Error (MSE): The average of the squared differences between the observed and
predicted values.

R-squared (R²): The proportion of the variance in the dependent variable that is predictable from
the independent variable.

Code
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 7, 11]

# Calculate means
x_mean = np.mean(x)
19

y_mean = np.mean(y
Page

# Calculate the parameters


beta_1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta_0 = y_mean - beta_1 * x_mean

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

print(f‟Estimated parameters: beta_0 = {beta_0}, beta_1 = {beta_1}‟)

# Make predictions
y_pred = beta_0 + beta_1 * x

# Plot the data and the regression line


plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, y_pred, color='red', label='Regression line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Multiple Regressions
Multiple regressions are a statistical technique used to understand the relationship between one
dependent variable and two or more independent variables. It extends simple linear regression, which
involves only one independent variable, by incorporating multiple predictors to better capture the
complexity of real-world phenomena.

The Multiple Regression Equation

The general form of the multiple regression equation is:

Y=β0+β1X1+β2X2+…+βnXn
 Y: Dependent variable.
 β0: Intercept, the expected value of Y when all Xs are zero.
 β1,β2,…,βn : Coefficients representing the change in Y for a one-unit change in the
corresponding X, holding other variables constant.
 X1,X2,…,Xn : Independent variables.
 ϵ: Error term, representing the deviation of observed values from the predicted values.

Steps in Multiple Regression Analysis

1. Model Specification: Define the dependent variable and select the independent variables
based on theoretical understanding or empirical evidence.
2. Data Collection: Gather data for the dependent and independent variables. Ensure the data is
clean and suitable for analysis.
3. Estimation of Coefficients: Use statistical software to estimate the coefficients (β\betaβ) of
the regression equation. This is typically done using the Ordinary Least Squares (OLS)
method, which minimizes the sum of the squared differences between observed and predicted
values.
20

4. Model Evaluation: Assess the model's performance using various metrics:


o R-squared (R²): Proportion of variance in the dependent variable explained by the
Page

independent variables.
o Adjusted R-squared: Adjusts R² for the number of predictors in the model.
o F-test: Tests the overall significance of the model.

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

o t-tests: Assess the significance of individual coefficients.


5. Assumption Checking: Ensure that the model meets the assumptions of multiple regression:
o Linearity: The relationship between the dependent and independent variables is linear.
o Independence: Observations are independent of each other.
o Homoscedasticity: Constant variance of errors across all levels of the independent
variables.
o Normality: Errors are normally distributed.
6. Diagnostics and Refinement: Perform residual analysis to check for any patterns in the
residuals that might indicate model misspecification. Address issues like multicollinearity
(high correlation among predictors) if they arise.
7. Interpretation: Interpret the coefficients to understand the impact of each independent
variable on the dependent variable. For example, a coefficient of 2 for X1 means that a one-unit
increase in X1 results in an average increase of 2 units in Y, holding other variables constant.
8. Prediction: Use the fitted model to make predictions for new data points by plugging in values
for the independent variables into the regression equation.

Need for fitting the model in multiple regressions


Fitting a model in multiple regressions is essential for several reasons, each contributing to the
robustness, accuracy, and interpretability of the analysis. The reasons are

1. Understanding Relationships between Variables

Multiple regression allows for the examination of the relationship between a dependent variable and
multiple independent variables simultaneously. This helps in understanding how various factors
collectively influence the outcome.

2. Controlling for Confounding Variables

In many real-world scenarios, the effect of one independent variable on the dependent variable might
be influenced by the presence of other variables. Multiple regression helps to isolate the effect of each
independent variable by controlling for others, reducing potential confounding effects.

3. Improved Prediction Accuracy

By incorporating multiple predictors, the model can capture more information about the dependent
variable, leading to better predictive accuracy compared to simple regression models with a single
predictor.

4. Identifying Significant Predictors

Multiple regression helps in identifying which independent variables have a significant impact on the
dependent variable. This is particularly useful in fields like economics, medicine, and social sciences,
where understanding the importance of various factors is crucial.

5. Quantifying the Impact of Variables


21

The coefficients in a multiple regression model quantify the impact of each independent variable on
the dependent variable, providing valuable insights into the strength and direction of these
Page

relationships.

6. Handling Multicollinearity

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore


Data Science and It’s Applications 21AD62

In multiple regression, it's important to assess and handle multicollinearity (when independent
variables are highly correlated). Properly fitting the model involves diagnosing and addressing
multicollinearity to ensure reliable and interpretable results.

7. Generalizability of Findings

A well-fitted multiple regression model that accounts for multiple factors is more likely to generalize
to new data, making the findings more robust and applicable in various contexts.

8. Model Diagnostics and Validation

Fitting the model involves checking for assumptions (linearity, independence, homoscedasticity,
normality) and performing diagnostics (residual analysis, influence analysis) to ensure the validity of
the model. This step is crucial for the reliability of the regression results.

9. Enabling Complex Analyses

Multiple regression serves as a foundation for more complex analyses like interaction effects,
polynomial regression, and hierarchical regression, expanding the analytical capabilities for research
and decision-making.

10. Policy and Decision-Making

In applied fields, such as business, public policy, and healthcare, multiple regression models provide
evidence-based insights that inform strategic decisions and policy-making by highlighting key factors
and their relative importance

22
Page

Dept. of CSE (AI&ML), Sai Vidya Institute of Technology, Rajanukunte, Bangalore

You might also like