0% found this document useful (0 votes)

8 views56 pages

Supervised Machine Learning

The document provides an overview of Supervised Machine Learning, focusing on Linear and Logistic Regression. It explains the mathematical foundations, model assumptions, implementation steps, and performance evaluation metrics for both regression techniques. Additionally, it discusses concepts like overfitting and underfitting, and includes practical examples using Python code for implementing Gradient Descent with single and multiple features.

Uploaded by

pezzyrex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views56 pages

Supervised Machine Learning

Uploaded by

pezzyrex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 56

SUPERVISED MACHINE

LEARNING
LINEAR REGRESSION
Linear Regression is primarily used for predicting continuous outcomes. It models the
relationship between a dependent variable and one or more independent variables.

Linear Regression algorithm optimizes a straight line to encapsulate the relationship accurately
between the input and output variables. This line is modeled using a simple equation, y=mx+c

Linear Regression estimates the relationship by fitting a linear equation to observed

data. The goal is to predict a continuous target variable based on input features.

The model learns a linear function y=mx +c :

Mathematics behind Linear Regression

At the heart of Linear Regression lies the concept of the cost function and hypothesis, which
we'll break down below:

1. Hypothesis: This results in the regression line that can predict the output based on the
inputs. If we're trying to predict wine quality based on certain properties, this hypothesis
would best fit the linear relationship between our selected properties and the wine's
quality. The hypothesis is represented as

, where θ0 and θ1 are the model's parameters.

2. Cost Function (or Loss Function): This term simply quantifies how wrong our model's
predictions are relative to the actual truth. We aim to minimize this function to achieve
the most accurate prediction. It's also known as the Mean Squared Error (MSE) and it's

given by where m is the total count

of observations and the summation over the squared differences (errors) ensures that
the higher the error, the greater the cost. The cost function is minimized using
Gradient Descent. Gradient Descent will painstakingly adjust θ0 and θ1 to minimize the
cost function and derive a line that gives us the lowest possible error or cost.
2

Linear Regression Model Assumptions

1. Linearity: The relationship between the independent and dependent variables is

linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: Constant variance of errors.
4. Normality: Errors are normally distributed.

Designing Linear Regression Model

Def gd(features,target)
prediction=features*theta
error=prediction - target
Cost = mean(error)^2
theta = theta - alpha*x*error

predictions = features_test*final_theta
r2_test =r2score(target_test,predictions)

LinearRegression().fit(features_train,target_train) ← TRAINING
x, y both shown
predictions = LinearRegression().predict(features_test) ← TESTING x
shown y predicted
3
r2_test = r2score(target_test,predictions) ← TEST
SCORE

Implementing Linear Regression Model

model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
mse=metrics.mean_squared_error(target_test,predictions)
r2=metrics.r2_score(target_test,predictions)
print(mse,r2)

Evaluating the Model Performance - Metrics

Key metrics include:

● Mean Absolute Error (MAE)

● Mean Squared Error (MSE)
● R-squared (R²)

LOGISTIC REGRESSION
Logistic Regression is a classification algorithm used to estimate the
probabilities of a binary response based on one or more predictor (also known
as independent) variables. It is particularly beneficial for binary outcomes,
meaning situations with only two possible results

Our goal is to predict wine quality, which, as you may remember, ranges from
0 to 10. To keep things simple and focus on a binary classification problem,
let's classify the wines as good (a quality rating of 7 or above) and not good (a
quality rating below 7). Therefore, we will be using Logistic Regression to
predict whether the quality of a specific type of wine is 'good' or 'not good'
based on its physicochemical features.

All of this is achieved by using a logistic function, which limits the unlimited
outcome of the linear equation to a number between 0 and 1. Also known as
the Sigmoid function, this logistic function is an S-shaped curve that maps
4
any real-valued number into a value falling within these bounds. The
function is defined as follows,

In this equation, x represents the output of a linear combination of feature

values and their corresponding coefficients,

In this informative equation:

●β (Beta) terms are the model's parameters or weights, signifying the

influence of each input feature (denoted by X) on the predicted
outcome. X terms represent independent predictor variables.
Once we compute the predicted probability (p) using the Sigmoid function, we
can assign classes by defining a threshold (which is generally 0.5):
● If p≥0.5, the label for the example is 1 (or Good in our case).
● If p<0.5, the label for the example is 0 (or Not Good in our case).

The next critical component in Logistic Regression is the cost function. Unlike
in Linear Regression, we can't use Mean Square Error as the cost function
because the Logistic function would introduce a non-linear term into the cost
function, making the cost function non-convex anymore. In Logistic
Regression, the cost function is defined as:

Where:
5
● θ represents the parameters we must determine using an
optimization algorithm to minimize the cost function.
● m is the number of samples.
● y and x represent the target and input of each sample, respectively.
● hθ(x) is the logistic function that computes the predicted probability
that y=1.

While discussing the cost function, it's crucial to consider optimization

algorithms like Gradient Descent used to find the parameters θ to minimize
this cost.

Disclaimer: in most scenarios, you don't have to remember and implement the
cost function yourself, as there are plenty of libraries (e.g., scikit-learn
that provide the built-in implementation of the Logistic Regression). However,
it's still essential to understand high-level concepts and what's being
optimized.

Logistic Regression Model Assumptions

1. Each observation is independent of others: This means the outcome or

probability of success (p in our logistic function) for one example
neither influences nor is influenced by the outcomes of other examples.
2. There is no multicollinearity among explanatory variables: In simple
terms, the input variables should not be too highly correlated with each
other. Any correlation implies that they carry similar information to the
model, which is redundant.
3. The input variables have a linear relationship with the log odds:
Although the outcome in logistic regression is a binary variable, logistic
regression stipulates that the input variables are linearly related to the
log odds

, and hence, to the logit of the probability, p.

6
Violating these assumptions may result in inaccurate models and
misinterpretations. Therefore, validating these assumptions while
modeling Logistic Regression is essential.

Designing Logistic Regression Model

1. Specify the hypothesis or function the model should learn. In Logistic

Regression, this is the Sigmoid function.
2. Define an error, cost, or loss function we aim to minimize. For Logistic
Regression, the cost function is defined as Cross-Entropy Loss. Define a
learning algorithm that optimizes the parameters for the hypothesis to
fit the model to the training data. In our case, it's the Gradient Descent
algorithm.

Implementing Logistic Regression Model

create a LogisticRegression object and use the fit function to train it on

the training sets, X_train and y_train. The learned parameters of the
Logistic function can be printed as shown in the last line. The coef_ variable
gives the coefficients for different features (or X), while intercept_ provides
the intercept term (or β0).

Implementing Logistic Regression Model

Now that we have our trained Logistic Regression model, we might wonder how to interpret the
output of our model. The output of the model includes the coefficients (also known as weights) of
each feature and a bias (also known as the intercept). The coefficients represent the log of the odds
ratio of the corresponding feature.

For example, if the coefficient of a feature, say pH (with log odds ratio = β), is 0.5, it indicates that
for each unit change in pH, keeping other features constant, the odds of our outcome (whether the
wine quality is good) would increase by a factor of e^0.5
7

lr=LogisticRegression(max_iter=5000)
# TODO: Train your logistic regression model using the training sets
lr.fit(features_train,target_train)
# TODO: Make predictions on the test dataset
predictions=lr.predict(features_test)
# TODO: Evaluate the model using different metrics, e.g. accuracy,
precision, recall, or F1 score

print("Model Performance Metrics:")

print("Accuracy: ", metrics.accuracy_score(target_test, predictions))
print("Precision: ", metrics.precision_score(target_test, predictions))
print("Recall: ", metrics.recall_score(target_test, predictions))
print("F1 Score: ", metrics.f1_score(target_test, predictions))
print("AUC: ", metrics.roc_auc_score(target_test, predictions))
print("Model Coefficients: ", lr.coef_)
print("Model Coefficients: ", lr.coef_[0])
print("Intercept: ", lr.intercept_[0])
print("Predicted outcomes: ", predictions)

Evaluating the Model Performance - Metrics

Key metrics include:

1. Confusion Matrix: This table describes the performance of a

classification model. It's essentially a 2×2 matrix that visualizes the
performance of the regression, representing actual and predicted
classifications in terms of true positives, false positives, true negatives,
and false negatives.
2. Accuracy: This is the ratio of correctly predicted observations to total
observations. Accuracy = (True Positives + True Negatives) / Total
Observations.
8
3. Precision: This is the ratio of correctly predicted positive
observations to the total predicted positives. Precision = True
Positives / (True Positives + False Positives).
4. Recall (Sensitivity): This is the ratio of correctly predicted positive
observations to all observations in the actual class. Recall = True
Positives / (True Positives + False Negatives).
5. F1 Score: This is the weighted average of Precision and recall. F1Score
= 2 * Recall * Precision / (Recall + Precision).
6. ROC-AUC : This is the area under the Receiver Operating Characteristic
curve. It indicates how much the model can distinguish between
classes.
9
10
Understanding Model Overfitting and Underfitting
In machine learning, balance is crucial. If your model performs well on the
training data but poorly on unseen data (such as validation and test
datasets), it may be overfitting. This issue is similar to an attempt to ace a
specific test by learning to copy all the answers without understanding the
concepts, which leads to poor performance in other tests. This problem arises
because the model learns the noise in the training data rather than the signal.

Conversely, we have underfitting. An underfitted model performs poorly on

both training and unseen data because it hasn't learned the underlying pattern
of the data.

In subsequent lessons, we will explore these concepts deeper and examine

how to fine-tune our models to prevent overfitting and underfitting.

GRADIENT DESCENT SINGLE FEATURE

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import r2_score, mean_squared_error
11

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)

return theta, theta_list, cost_list

# TODO: Load the Red Wine Quality dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine =pd.DataFrame(red_wine)

# TODO: Select 'sulphates' as the predictive feature and 'quality' as the

target variable
x=pd.DataFrame(red_wine['sulphates'])
y=red_wine['quality']
# TODO: Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=42)

# TODO: Initialize the model parameters to all 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# TODO: Apply the Gradient Descent function to optimize the parameters
y_train = np.array(y_train).reshape(-1, 1)
alpha = 0.05
iters = 3000
12
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)

y_pred_train = np.dot(x_train, g)
y_pred_test = np.dot(x_test, g)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")

# TODO: Create a plot to display the cost function against iterations

plt.plot(cost_list)
plt.show()

GRADIENT DESCENT MULTIPLE FEATURE

GRADIENT DESCENT MULTIPLE FEATURE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.preprocessing import StandardScaler
start_time = pd.Timestamp.now()

# Load the dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Separate features (X) and target (y)

13
# Exclude 'quality' which is our target variable
X = red_wine.drop('quality', axis=1)
y = red_wine['quality']

# Print feature names and shape

print(f"Features used ({X.shape[1]} total):")
for i, feature in enumerate(X.columns, 1):
print(f"{i}. {feature}")

# Scale the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Initialize parameters (theta) for all features

theta = np.zeros(X_train.shape[1]).reshape(-1, 1)

def gradient_descent(X, y, theta, alpha, iterations):

m = y.size
cost_list = []
theta_list = [theta.copy()]

# Add progress tracking

for i in range(iterations):
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)
cost = 1/(2*m) * np.dot(error.T, error)
theta = theta - (alpha * (1/m) * np.dot(X.T, error))

cost_list.append(np.squeeze(cost))
theta_list.append(theta.copy())

if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost: {np.squeeze(cost):.6f}")

return theta, theta_list, cost_list

# Set parameters
alpha = 0.01 # Increased learning rate since we scaled the features
iters = 3000
14

# Run gradient descent

print("\nTraining model...")
final_theta, theta_history, cost_history = gradient_descent(
X_train, y_train, theta, alpha, iters
)

# Plot cost convergence

plt.figure(figsize=(12, 6))
plt.plot(range(1, iters + 1), cost_history, color='blue')
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.grid(True)
plt.show()

# Print final parameters and their importance

print("\nFinal parameters for each feature:")
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Parameter Value': final_theta.flatten()
})
feature_importance['Absolute Importance'] = abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute Importance', ascending=False)
print(feature_importance.to_string(index=False))

# Calculate R-squared for training data

y_pred_train = np.dot(X_train, final_theta)
ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean()) ** 2)
ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)

# Calculate R-squared for test data

y_pred_test = np.dot(X_test, final_theta)
ss_tot = np.sum((y_test.values.reshape(-1, 1) - y_test.values.mean()) ** 2)
ss_res = np.sum((y_test.values.reshape(-1, 1) - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)

print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
stop_time = pd.Timestamp.now()
print('Time taken for optimization: ', stop_time - start_time, ' seconds')
15

Features used (11 total):

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol

Training model...
Iteration 500/3000, Cost: 16.036170
Iteration 1000/3000, Cost: 16.035651
Iteration 1500/3000, Cost: 16.035434
Iteration 2000/3000, Cost: 16.035318
Iteration 2500/3000, Cost: 16.035256
Iteration 3000/3000, Cost: 16.035222

Final parameters for each feature:

Feature Parameter Value Absolute Importance
alcohol 0.366530 0.366530
sulphates 0.186969 0.186969
density 0.144561 0.144561
pH 0.101144 0.101144
citric acid 0.094403 0.094403
chlorides -0.087866 0.087866
fixed acidity -0.039480 0.039480
residual sugar -0.026653 0.026653
free sulfur dioxide 0.013299 0.013299
total sulfur dioxide -0.013174 0.013174
volatile acidity 0.005152 0.005152

Model Performance:
Training R-squared: -47.6866
Testing R-squared: -50.5587
Time taken for optimization: 0 days 00:00:27.383149 seconds

GRADIENT DESCENT REGULARIZED

GRADIENT DESCENT REGULARIZED
16
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import matplotlib.pyplot as plt
import datasets
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_squared_error
start_time = pd.Timestamp.now()

# Load and prepare the dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Separate features and target

X = red_wine.drop('quality', axis=1)
y = red_wine['quality']

if red_wine.isnull().sum().any():
print("Missing values found")
else:
print("No missing values found")
# Create polynomial features (up to degree 2 for interaction terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
feature_names = poly.get_feature_names_out(X.columns)

# Scale the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3, random_state=10)
17
def gradient_descent_with_regularization(X, y, theta, alpha,
lambda_reg, iterations, early_stop_rounds=50):
m = y.size
cost_list = []
theta_list = [theta.copy()]
best_cost = float('inf')
rounds_without_improvement = 0

for i in range(iterations):
# Forward pass
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)

# Calculate cost with L2 regularization

reg_term = (lambda_reg / (2*m)) * np.sum(theta[1:] ** 2) # Don't
regularize intercept
cost = 1/(2*m) * np.dot(error.T, error) + reg_term

# Calculate gradients with regularization

gradients = (1/m) * np.dot(X.T, error)
gradients[1:] += (lambda_reg/m) * theta[1:] # Add regularization
term

# Update parameters
theta = theta - alpha * gradients

cost_list.append(float(cost))
theta_list.append(theta.copy())

# Early stopping check

if cost < best_cost:
best_cost = cost
rounds_without_improvement = 0
best_theta = theta.copy()
else:
rounds_without_improvement += 1
18
if rounds_without_improvement >= early_stop_rounds:
print(f"Early stopping at iteration {i+1}")
break

if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{float(cost):.6f}")

return best_theta, theta_list, cost_list

# Initialize parameters
n_features = X_train.shape[1]
theta = np.zeros((n_features, 1))

# Hyperparameters
alpha = 0.001 # Reduced learning rate
lambda_reg = 0.1 # Regularization strength
iters = 5000

print(f"\nTraining model with {n_features} features (including polynomial

terms)")
print("First 5 feature names as example:")
for i, name in enumerate(feature_names[:5], 1):
print(f"{i}. {name}")

# Train the model

final_theta, theta_history, cost_history =
gradient_descent_with_regularization(
X_train, y_train, theta, alpha, lambda_reg, iters
)

# Plot cost convergence

plt.figure(figsize=(12, 6))
plt.plot(cost_history, color='blue')
plt.xlabel('Iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of Gradient Descent with Regularization')
19
plt.grid(True)
plt.show()

# Compare with sklearn's Ridge regression

ridge = Ridge(alpha=lambda_reg)
ridge.fit(X_train, y_train)

# Calculate predictions and R-squared for both approaches

# Our implementation
y_pred_train = np.dot(X_train, final_theta)
y_pred_test = np.dot(X_test, final_theta)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

# Ridge implementation
ridge_pred_train = ridge.predict(X_train)
ridge_pred_test = ridge.predict(X_test)
ridge_r2_train = r2_score(y_train, ridge_pred_train)
ridge_r2_test = r2_score(y_test, ridge_pred_test)

# Print results
print("\nModel Performance:")
print("Gradient Descent with Regularization:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
print(f"Training MSE: {mean_squared_error(y_train, y_pred_train):.4f}")
print(f"Testing MSE: {mean_squared_error(y_test, y_pred_test):.4f}")

print("\nScikit-learn Ridge Implementation:")

print(f"Training R-squared: {ridge_r2_train:.4f}")
print(f"Testing R-squared: {ridge_r2_test:.4f}")
print(f"Training MSE: {mean_squared_error(y_train,
ridge_pred_train):.4f}")
print(f"Testing MSE: {mean_squared_error(y_test, ridge_pred_test):.4f}")

# Feature importance analysis

feature_importance = pd.DataFrame({
20
'Feature': feature_names,
'Parameter Value': final_theta.flatten(),
'Ridge Parameter': ridge.coef_
})
feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute Importance',
ascending=False)

print("\nTop 10 Most Important Features:")

print(feature_importance.head(10).to_string(index=False))

stop_time = pd.Timestamp.now()
print('Time taken for optimization: ', stop_time - start_time, ' seconds')
Training model with 77 features (including polynomial terms)
First 5 feature names as example:
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
Iteration 500/5000, Cost: 16.005579
Iteration 1000/5000, Cost: 15.995963
Iteration 1500/5000, Cost: 15.988637
Iteration 2000/5000, Cost: 15.982492
Iteration 2500/5000, Cost: 15.977168
Iteration 3000/5000, Cost: 15.972465
Iteration 3500/5000, Cost: 15.968250
Iteration 4000/5000, Cost: 15.964428
Iteration 4500/5000, Cost: 15.960930
Iteration 5000/5000, Cost: 15.957702

Model Performance:
Gradient Descent with Regularization:
Training R-squared: -49.5259
Testing R-squared: -46.7613
Training MSE: 31.9154
Testing MSE: 33.3533

Scikit-learn Ridge Implementation:

Training R-squared: 0.4307
Testing R-squared: 0.3855
Training MSE: 0.3596
Testing MSE: 0.4291

Top 10 Most Important Features:

Feature Parameter Value Ridge Parameter Absolute Importance
chlorides total sulfur dioxide -0.188286 -0.056623 0.188286
volatile acidity total sulfur dioxide -0.160016 0.460917 0.160016
residual sugar total sulfur dioxide 0.155526 0.019077 0.155526
residual sugar^2 -0.152816 -0.175048 0.152816
residual sugar chlorides -0.144540 0.134031 0.144540
fixed acidity volatile acidity -0.118415 -0.037261 0.118415
total sulfur dioxide alcohol 0.108945 0.117779 0.108945
free sulfur dioxide sulphates -0.100814 -0.154157 0.100814
21
volatile acidity citric acid -0.097507 0.056614 0.097507
citric acid^2 -0.089492 0.114346 0.089492
Time taken for optimization: 0 days 00:00:18.955081 seconds

GRADIENT DESCENT THEORY

Gradient Descent Demystified
Have you ever hiked to the top of a hill and looked down to determine the best route of descent? One
potentially disastrous step off a steep cliff is dangerous, while cautiously descending the gentle slopes
might cause less harm. The concept of Gradient Descent mirrors this scenario — it, too, sees the value in
finding and taking the optimal path or, more precisely, reaching the minimum point.

In machine learning, Gradient Descent can be visualized as a careful navigation downwards until we find
the valley between hills. The 'hill' in this context is the cost function, which quantifies our model's error.
Through a series of small steps, Gradient Descent refines the cost function by 'walking' down the hill
towards the steepest descent until it reaches the lowest possible point at its optimal state.

At its core, Gradient Descent relies on two key mathematical mechanisms: the Cost Function and
the Learning Rate.

The Cost Function (or Loss Function) quantifies the disparity between predicted and expected values,
presenting it as a single float number. The type of cost function utilized depends on the challenge at hand. In
our Wine Quality dataset, we can define a cost function that computes the difference between our model's
predicted quality of wine and the actual quality.

The Learning Rate, symbolized by α, dictates the size of the steps we take downhill. A lower value of α results
in smaller, more precise steps, while a high value could cause drastic, potentially unstable steps. From our
previous analogy, imagine the hill is symbolized by a function of position, g(x). Starting at the hill's pinnacle
(x0), we revise our position (x) by moving a step proportional to the negative gradient at that location. The
gradient g′(x) is simply the derivative of g(x), pointing toward the steepest ascent. Conversely, −g′(x)
signifies the fastest descending path. We repeat this stepping process until the gradient becomes zero at the
minimum point, indicating no further downhill path, i.e., no additional optimization is required.

Advancements in Gradient Descent

22
Here, an interesting question arises, "Do we always use all data to calculate the gradient?" The
answer depends. Gradient Descent has evolved into various versions, depending on the amount of
data used in computing the gradient: batch, stochastic, and mini-batch gradient descent.

The original version, batch gradient descent, uses the complete dataset at every step. While this may
seem meticulous and comprehensive, it proves extremely inefficient when dealing with substantial
datasets housing millions of entries. Imagine watching a movie frame by frame at a snail's pace — it can
be painstakingly slow despite its precision.

Implementing Gradient Descent

Now, let's make the Gradient Descent implementation in Python. We start by assigning random
values to our model’s parameters. Gradual adjustments to these parameters follow, in each instance
computing the cost function, our error, and taking a step towards the steepest slope until our error is
minimal or the state is optimized.

Here’s a general outline of how we would implement gradient descent in Python:

def gradient_descent(x, y, theta, alpha, iterations):
# x - input dataset/feature
# y - target dataset/feature
# theta - initial parameters
# alpha - learning rate
# iterations - no. of times optimization algo executes to fine-tune the parameters

m = y.size # number of data points

cost_list = [] # list to store the cost function value at each iteration
theta_list = [theta] # list to store the values of theta at each iteration
for i in range(iterations):
prediction = np.dot(x, theta)# our prediction based on our current theta
error = prediction - y # error between our prediction and the actual values
cost = 1 / (2*m) * np.dot(error.T, error) # calculate the cost function
cost_list.append(np.squeeze(cost))# append the cost to the cost_list
theta = theta - (alpha * (1/m) * np.dot(x.T, error))# GD and update theta
theta_list.append(theta)# append the updated theta to the theta_list

return theta, theta_list, cost_list # return the final values after last iter

# PROGRAM
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets

white_wine = datasets.load_dataset('codesignal/wine-quality', split='white')

white_wine = pd.DataFrame(white_wine)

# Only consider the 'density' column as a predictive feature for now

x = pd.DataFrame(white_wine['density'])
y = white_wine['quality']
23

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.001
iters = 2000

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='green')

plt.rcParams["figure.figsize"] = (10,6)
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.show()

Let's break down the line theta = np.zeros(x_train.shape[1]).reshape(-1, 1) step by

step:
Since x is a matrix of input features (shape: n x 1) and theta is a vector of
parameters (shape: 1 x 1), the dot product, for prediction used later, results in
the predicted values prediction (shape: n x 1).

1. x_train.shape[1]:
● This retrieves the number of columns (features) in x_train.
● Example: If x_train has one feature (like density), x_train.shape[1]
would be 1.
2. np.zeros(x_train.shape[1]):
● Creates an array of zeros with a length equal to the number of
features.
● Example: If x_train.shape[1] is 1, this results in an array [0].
3. .reshape(-1, 1):
● Reshapes the array into a column vector. This reshapes theta into a
24

column vector (from an array of shape (1,) to (1, 1))

● -1: Let NumPy automatically determine the number of rows based on the
data (in this case, it’s 1 because there’s one feature). 1: Specifies
that the output should have 1 column.
● Example: Reshaping [0] results in [[0]], a column vector with one row
and one column.

We reshape theta because later, we’ll perform matrix operations where theta must be
a column vector for matrix multiplication to work correctly.

In gradient descent, we will use matrix multiplication to calculate predictions and

update the weights. For matrix multiplication to work correctly in Python’s numpy
(or other numerical libraries), the dimensions of matrices must align.

● Specifically, the shape of the input feature matrix x will be (n_samples,

n_features) (i.e., number of rows = number of data points, number of columns =
number of features).
● theta needs to be a column vector (shape (n_features, 1)) to allow for the dot
product between the feature matrix and the weight vector.

For example, if we want to compute:

Prediction=X⋅θ\text{Prediction} = X \cdot \thetaPrediction=X⋅θ

● X: The feature matrix of shape (n_samples, n_features).

● theta: The weight vector, which must be shaped (n_features, 1) for this
multiplication to be valid.
●

Purpose of the Split:

The goal of splitting the data into training and testing sets is to train the model
on one portion of the data and then test its generalization on unseen data. This
helps in assessing the model’s real-world performance and avoiding overfitting (a
model performing well on training data but poorly on new data).

Why Split the Data?

1. Training:
○ You need enough data to train the model so it can learn the relationship between
the input feature (density) and the target (quality).
2. Testing:
○ Once the model is trained, the test set is used to evaluate the model’s
performance on data it has never seen before. This helps ensure that the model
25

will generalize well to new, unseen data in real-world scenarios.

3. Avoid Overfitting:
○ Overfitting happens when a model performs very well on the training data but
poorly on new data because it has "memorized" the training set rather than
learning general patterns.
○ The test set allows you to check if the model has overfitted by evaluating its
performance on data it has not been trained on.

By splitting the dataset into training and testing subsets, we ensure that the model can be
evaluated effectively for its predictive accuracy, and we get an estimate of how it will
perform in real-world applications.

Function Parameters:
● x: The input feature matrix (in this case, the density of the wine from x_train). It is
an n x 1 matrix (n = number of training examples, 1 feature).
● y: The target values (in this case, the quality of the wine from y_train). This is the
true label we are trying to predict. It is a column vector of shape (n, 1).
● theta: The initial parameter values (initialized as zeros in the previous step). This
is the weight vector (or coefficient) that will be updated iteratively to minimize the
error in predictions.
● alpha: The learning rate, which controls the size of the steps taken towards the
minimum of the cost function. A small value of alpha ensures that the gradient descent
algorithm converges more steadily, but it might take longer. If alpha is too large, it
can cause the algorithm to overshoot the minimum and fail to converge.
● iterations: The number of times to update theta. More iterations allow the model to
converge closer to the optimal solution, but at a cost of more computation.

Breakdown of the Function:

1. m = y.size:

● m represents the number of training examples (rows) in the dataset. This is used to
calculate the average cost and gradients.
● y.size gives the total number of elements in y, which is the number of training
samples.

2. cost_list = []:
● This initializes an empty list to store the cost values after each iteration. The cost
(also known as the loss) is a measure of how far off the model's predictions are from
the actual target values. In linear regression, the cost function is the mean squared
error.

3. theta_list = [theta]:
● This initializes a list to store the theta values after each iteration. It starts with
the initial value of theta (which was set to zeros in Step 5).
● Keeping track of theta values helps in visualizing how the parameters evolve during
26

training.

4. Loop over the Number of Iterations:

● loop runs for the specified number of iterations, updating the parameters theta in each
step.

5. Purpose of theta is Prediction:

● theta represents the weights or coefficients that will be used to predict the target
variable (quality of wine) based on the input feature (density of wine).
● In a linear regression model, the formula for predicting the target is:

Where:
○ y^\hat{y}y^ is the predicted value (in this case, wine quality).
○ θ0\theta_0θ0 is the intercept (bias term).
○ θ1\theta_1θ1 is the weight for the feature (density).
○ x is the input feature (density values in this case).

6. Cost Calculation (Mean Squared Error):

The cost function in linear regression is the Mean Squared Error (MSE). The formula is:

● h_{\theta}(x) is the predicted value (i.e., np.dot(x, theta)).

● y is the actual value.
● m is the number of data points.

In code, np.dot(error.T, error) computes the sum of squared errors. The 1 / (2*m) scales it
to give the mean squared error. The division by 2 is a mathematical convenience used in
gradient descent because it simplifies the derivative of the cost function.

The computed cost is appended to the cost_list to track how the cost changes over
iterations.

7. Gradient Descent Update for theta:

● This is the key step in gradient descent. The idea is to update theta in the direction
27

that minimizes the cost function. This update rule is derived from the partial
derivative of the cost function with respect to theta.
● The term alpha * (1/m) * np.dot(x.T, error) is the gradient of the cost function with
respect to theta. It tells us the direction and magnitude by which we need to adjust
theta to reduce the error.
● The formula for updating theta is:

Where:

● α is the learning rate.

● ∇θJ(θ)\nabla_{\theta} J(\theta)∇θJ(θ) is the gradient of the cost function (the term np.dot(x.T, error)
in code).
● theta is updated in the direction that minimizes the cost.

Each time the loop runs, this update is applied, moving the theta values closer to their
optimal solution.

Summary of Gradient Descent Process:

1. Start with an initial guess for theta or weights (initialized as zeros).
2. Make predictions using the current values of theta and the input features with a simple
linear regression model where density predicts wine quality.
3. Calculate the error between the predicted values and the actual target values.
4. Compute the cost function (mean squared error) to measure how far off the predictions
are.
5. Update theta by moving in the direction that reduces the error (the gradient descent
step).
6. Repeat this process for the specified number of iterations, tracking the cost and theta
values as you go.
7. Return the final optimized theta and the history of theta and cost values.

Gradient descent ensures that, with each iteration, the model gets better at making predictions
by continuously reducing the cost.

This is a basic introduction to linear regression with gradient descent, using only one feature
(density) and optimizing parameters iteratively.
28

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets

# Load White Wine Quality Dataset

white_wine = datasets.load_dataset('codesignal/wine-quality',
split='white')
white_wine = pd.DataFrame(white_wine)

# Only consider the 'density' column as a predictive feature for now

x = pd.DataFrame(white_wine['density'])
y = white_wine['quality']

print(f"Features used ({x.shape[1]} total):")

for i, feature in enumerate(x.columns, 1):
print(f"{i}. {feature}")

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=0)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{np.squeeze(cost):.6f}")

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.001
iters = 2000

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
print("\nTraining model...")
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='green')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.show()

print("\nFinal parameters for each feature:")

final_theta = theta_list[-1].flatten() # Access the last element and
flatten it
feature_importance = pd.DataFrame({
'Feature': x.columns,
'Parameter Value': final_theta
})
print(feature_importance)

feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute
Importance', ascending=False)
print(feature_importance.to_string(index=False))

# Calculate R-squared for training data

y_pred_train = np.dot(x_train, final_theta)
#ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean())
** 2)
ss_tot = np.sum((y_train - y_train.mean()) ** 2)
30

#ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)

ss_res = np.sum((y_train - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)

# Calculate R-squared for test data

y_pred_test = np.dot(x_test, final_theta)
ss_tot = np.sum((y_test - y_test.values.mean()) ** 2)
ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")

Features used (1 total):

1. density

Training model...
Iteration 500/2000, Cost: 6.873248
Iteration 1000/2000, Cost: 2.803397
Iteration 1500/2000, Cost: 1.288870
Iteration 2000/2000, Cost: 0.725263

Final parameters for each feature:

Feature Parameter Value
0 density 5.110624
Feature Parameter Value Absolute Importance
density 5.110624 5.110624

Model Performance:
Training R-squared: -6393.4648
Testing R-squared: -0.7097

The parameter value of 5.110624 for the feature density

indicates the weight that the gradient descent algorithm has
learned for this feature in predicting wine quality. Here's
what it means:

● Positive Value: A positive parameter value suggests that

as the density of the wine increases, the predicted
quality also increases, according to the model.
● Magnitude: The magnitude of the parameter indicates the
31

strength of the relationship. A larger absolute value

would suggest a stronger influence of density on the
predicted quality.

In this context, the "Parameter Value" and "Absolute

Importance" are indeed the same because:

● Parameter Value: This is the weight that the model has

learned for the feature density.
● Absolute Importance: This is simply the absolute value of
the parameter. Since the parameter is positive, its
absolute value is the same as the parameter itself.

In cases with multiple features, the absolute importance helps

compare the relative influence of each feature on the model's
predictions, regardless of the sign of the parameter. Here,
with only one feature, they naturally match.

The "Model Performance" section in the output refers to the

evaluation of how well your linear regression model is fitting
the data. Specifically, it uses the R-squared (R2) metric to
assess performance:

● R-squared (R2): This metric indicates the proportion of

variance in the target variable that is predictable from
the input features. It ranges from 0 to 1, where:
● 1 means the model perfectly predicts the target
variable.
● 0 means the model does not explain any variability in
the target variable.
● Negative values, like in your output, suggest that
the model is performing worse than a simple mean-
based prediction.

In your output:
32

● Training R-squared: Indicates how well the model fits the

training data. A negative value suggests poor performance.
● Testing R-squared: Indicates how well the model
generalizes to unseen data. A negative value here also
suggests poor generalization.

The negative R-squared values imply that the model is not

capturing the relationship between density and quality
effectively. You might need to explore other features or model
types for better performance

Your approach to calculating the R-squared value for the test

dataset is correct, but the negative R-squared value suggests
that the model is performing poorly. This could be due to
several reasons, such as:

1. Feature Scaling: Ensure that the feature (pH) is scaled

properly. Gradient descent can be sensitive to the scale
of the features.
2. Learning Rate: The learning rate (alpha) might be too high
or too low. Experiment with different values to see if the
model's performance improves.
3. Model Complexity: Using only one feature (pH) might not be
sufficient to capture the complexity of the data. Consider
adding more features to improve the model.
4. Data Quality: Check if there are any issues with the data,
such as outliers or missing values, that might affect the
model's performance.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
33

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 1500

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
34

g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha,

iters)

# Test the performance of the model on the testing dataset

final_theta=theta_list[-1].flatten()
y_pred_test = np.dot(x_test, final_theta)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Testing R-squared: {r2_test:.4f}")

mse = mean_squared_error(y_test, y_pred_test)

print(f'Mean Squared Error: {mse}')

# Mean Absolute Error (MAE)

mae = mean_absolute_error(y_test, y_pred_test)
print(f'Mean Absolute Error: {mae}')

# R-squared (Coefficient of Determination)

r2 = r2_score(y_test, y_pred_test)
print(f'R-squared: {r2}')

plt.plot(range(1, iters + 1), cost_list, color='orange')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
The change in the "Testing R-squared" output is due to the
learner correctly using the final theta from the training phase
to calculate predictions for the test dataset. Here's what was
done:

1. Using Final Theta: The learner used theta_list[-1] to get

the final theta from the training phase, which is crucial
35

for making accurate predictions on the test dataset.

2. Calculating Predictions: They calculated y_pred_test using
the test dataset and the final theta.
3. R-squared Calculation: The R-squared value was calculated
using the formula:
4. R2=1−SSresSStot
5. R
6. 2
7. =1−
8. SS
9. tot
10.
11. SS
12. res

13.
14.
15. where
16. SSres
17. SS
18. res
19.
20. is the sum of squares of residuals and
21. SStot
22. SS
23. tot
24.
25. is the total sum of squares.

The negative R-squared value indicates that the model is not

performing well on the test dataset, which could be due to the
simplicity of using only one feature (pH) or other factors like
feature scaling or learning rate.
36

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 1500

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='orange')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()

# Test the performance of the model on the testing dataset

final_theta=theta_list[-1].flatten()
y_pred_test = np.dot(x_test, final_theta)
# R-squared (Coefficient of Determination) direct method from sklearn
r2 = r2_score(y_test, y_pred_test)
print(f'R-squared: {r2}')

ss_tot = np.sum((y_test - y_test.mean()) ** 2)

ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"R-squared manual calc: {r2_test:.4f}")

mse = mean_squared_error(y_test, y_pred_test)

mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)
38

CHATGPT R-square

Explanation of Performance Metrics:

1. Mean Squared Error (MSE):

○ MSE is a common metric for regression problems. It calculates the
average squared difference between the predicted and actual values.
○ Lower values of MSE indicate better performance (closer to zero means
the model's predictions are closer to the true values).
2. Formula:

Where ypred,i is the predicted value, and ytrue,iy_{\text{true},i}ytrue,i is

the actual value.
3. Mean Absolute Error (MAE):
○ MAE calculates the average absolute difference between predicted and
actual values. Unlike MSE, it doesn't square the errors, so it's less
sensitive to outliers.
○ Like MSE, a lower value of MAE indicates better model performance.
4. Formula:

5. R-squared (R²):
○ R² is a statistical measure that represents the proportion of the
variance for the target variable that is explained by the model. It's a
measure of how well the regression predictions approximate the real
data points.
○ The value of R² is between 0 and 1, where 1 means perfect predictions
and 0 means the model explains none of the variability in the target
variable.
6. Formula:
39

Where:

○ ypredy_{\text{pred}}ypred are the predicted values.

○ ytruey_{\text{true}}ytrue are the actual values.
○ yˉtrue\bar{y}_{\text{true}}yˉtrue is the mean of the actual values.

Summary of Steps:

1. Make Predictions: Use the trained model (optimized theta) to make predictions
on the test set (x_test).
2. Evaluate the Model: Compute metrics like MSE, MAE, or R² to understand how
well your model is performing on the test data.
3. Interpret the Results: Lower MSE and MAE values indicate better model
performance. For R², a value close to 1 indicates a good fit.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Reshape y_test to match the shape of predictions for matrix multiplication

y_test = np.array(y_test).reshape(-1, 1)

# 1. Use `theta` to predict the output (quality) for the test dataset
# Since x_test is a single feature (density), we use matrix multiplication
predictions = np.dot(x_test, theta)

# 2. Calculate the errors (residuals)

error = predictions - y_test

# 3. Calculate different performance metrics:

# Mean Squared Error (MSE)

mse = mean_squared_error(y_test, predictions)
40

print(f'Mean Squared Error: {mse}')

# Mean Absolute Error (MAE)

mae = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {mae}')

# R-squared (Coefficient of Determination)

r2 = r2_score(y_test, predictions)
print(f'R-squared: {r2}')

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

for i in range(iterations):
41

prediction = np.dot(x, theta)

error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
if (i+1) % 10 == 0:
print(f"Iteration {i+1}/{iterations}, Cost: {np.squeeze(cost):.6f}")

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 100

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='blue')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()

#Training Performance
final_theta=theta_list[-1].flatten()
y_pred_train=np.dot(x_train, final_theta)
r2_train=r2_score(y_train,y_pred_train)
print(f'R-squared train: {r2_train}')
# Test Performance
y_pred_test = np.dot(x_test, final_theta)
r2_test = r2_score(y_test, y_pred_test)
print(f'R-squared test: {r2_test}')
42

#Manual Test Performance just for illustration

#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")

#mse = mean_squared_error(y_test, y_pred_test)

#mae = mean_absolute_error(y_test, y_pred_test)
#print(f'MSE, MAE:')
#print(mse, mae)

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

x = pd.DataFrame(red_wine['pH'])
43

y = red_wine['quality']
plt.plot(x)
plt.show()
#scaler = StandardScaler()
#x = scaler.fit_transform(x_raw)
#x = pd.DataFrame(x, columns=x_raw.columns)
print(x.describe())
#print(x_raw.describe())

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 3000

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
44

g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='blue')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()

#Manual Test Performance just for illustration

#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")

mse = mean_squared_error(y_test, y_pred_test)

mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)

FRESH BASIC
import numpy as np
import pandas as pd
45

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 1500

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)

# Test the performance of the model on the testing dataset

y_test = np.array(y_test).reshape(-1, 1)
g_test, theta_test_list, cost_test_list = gradient_descent(x_test,
y_test, theta, alpha, iters)

mse = mean_squared_error(y_test, y_pred_test)

mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)

plt.plot(range(1, iters + 1), cost_test_list, color='orange')

plt.rcParams["figure.figsize"] = (10,6)
47

plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()

# ALL FEATURES EXCEPT TARGET

# Load Red Wine Quality Dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)

# Only consider the 'pH' column as a predictive feature for now

#x = pd.DataFrame(red_wine['pH'])
#y = red_wine['quality']
#plt.plot(x)
#plt.show()

x_raw = red_wine.drop('quality', axis=1)

y = red_wine['quality']

print(x_raw.isnull().sum())
x_raw = x_raw.fillna(0) # Replace NaN with 0
scaler = StandardScaler()
48

x = scaler.fit_transform(x_raw)
x = pd.DataFrame(x, columns=x_raw.columns)
print(x.describe())
#print(x_raw.describe())

# Splitting datasets into training and testing datasets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)

# We set our parameters to start at 0

theta = np.zeros(x_train.shape[1]).reshape(-1, 1)

# Gradient Descent Function

def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]

return theta, theta_list, cost_list

# Define the number of iterations and alpha value

alpha = 0.01
iters = 3000
49

# Applying Gradient Descent

y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)

plt.plot(range(1, iters + 1), cost_list, color='blue')

plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()

#Manual Test Performance just for illustration

#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")
50

mse = mean_squared_error(y_test, y_pred_test)

mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)

fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
dtype: int64
fixed acidity volatile acidity ... sulphates alcohol
count 1.599000e+03 1.599000e+03 ... 1.599000e+03 1.599000e+03
mean 3.554936e-16 1.733031e-16 ... 6.754377e-16 1.066481e-16
std 1.000313e+00 1.000313e+00 ... 1.000313e+00 1.000313e+00
min -2.137045e+00 -2.278280e+00 ... -1.936507e+00 -1.898919e+00
25% -7.007187e-01 -7.699311e-01 ... -6.382196e-01 -8.663789e-01
50% -2.410944e-01 -4.368911e-02 ... -2.251281e-01 -2.093081e-01
75% 5.057952e-01 6.266881e-01 ... 4.240158e-01 6.354971e-01
max 4.355149e+00 5.877976e+00 ... 7.918677e+00 4.202453e+00

[8 rows x 11 columns]
Iteration 500/3000, Cost: 16.014052
Iteration 1000/3000, Cost: 16.012998
Iteration 1500/3000, Cost: 16.012576
Iteration 2000/3000, Cost: 16.012360
Iteration 2500/3000, Cost: 16.012244
Iteration 3000/3000, Cost: 16.012180
51

R-squared train: -49.69847600642071

R-squared test: -46.29026820603956
MSE, MAE:
33.02437063055097 5.698079446215109

#CLAUDE all features

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.preprocessing import StandardScaler

# Load the dataset

red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
52

# Separate features (X) and target (y)

# Exclude 'quality' which is our target variable
X = red_wine.drop('quality', axis=1)
y = red_wine['quality']

# Print feature names and shape

print(f"Features used ({X.shape[1]} total):")
for i, feature in enumerate(X.columns, 1):
print(f"{i}. {feature}")

# Scale the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3, random_state=0)

# Initialize parameters (theta) for all features

theta = np.zeros(X_train.shape[1]).reshape(-1, 1)

def gradient_descent(X, y, theta, alpha, iterations):

m = y.size
cost_list = []
theta_list = [theta.copy()]

# Add progress tracking

for i in range(iterations):
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)
cost = 1/(2*m) * np.dot(error.T, error)
theta = theta - (alpha * (1/m) * np.dot(X.T, error))
53

cost_list.append(np.squeeze(cost))
theta_list.append(theta.copy())

if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{np.squeeze(cost):.6f}")

return theta, theta_list, cost_list

# Set parameters
alpha = 0.01 # Increased learning rate since we scaled the features
iters = 3000

# Run gradient descent

print("\nTraining model...")
final_theta, theta_history, cost_history = gradient_descent(
X_train, y_train, theta, alpha, iters
)

# Plot cost convergence

# Print final parameters and their importance

print("\nFinal parameters for each feature:")
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Parameter Value': final_theta.flatten()
54

})
feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute
Importance', ascending=False)
print(feature_importance.to_string(index=False))

# Calculate R-squared for training data

y_pred_train = np.dot(X_train, final_theta)
ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean())
** 2)
ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)

# Calculate R-squared for test data

y_pred_test = np.dot(X_test, final_theta)
ss_tot = np.sum((y_test.values.reshape(-1, 1) - y_test.values.mean()) **
2)
ss_res = np.sum((y_test.values.reshape(-1, 1) - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)

print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")

Features used (11 total):

1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
55

9. pH
10. sulphates
11. alcohol

Training model...
Iteration 500/3000, Cost: 16.170979
Iteration 1000/3000, Cost: 16.167348
Iteration 1500/3000, Cost: 16.165631
Iteration 2000/3000, Cost: 16.164715
Iteration 2500/3000, Cost: 16.164220
Iteration 3000/3000, Cost: 16.163950

Final parameters for each feature:

Feature Parameter Value Absolute Importance
density -0.430942 0.430942
fixed acidity 0.426656 0.426656
sulphates 0.282146 0.282146
free sulfur dioxide 0.196943 0.196943
residual sugar 0.150299 0.150299
chlorides -0.129332 0.129332
total sulfur dioxide -0.070154 0.070154
alcohol 0.058984 0.058984
pH 0.045277 0.045277
citric acid 0.037427 0.037427
volatile acidity -0.034311 0.034311

Model Performance:
Training R-squared: -46.7194
Testing R-squared: -53.4278
56

Dolor sit
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut
laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation
ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.

Details
Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim
placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum
claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius. Lorem ipsum
dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper
suscipit lobortis nisl ut aliquip ex ea commodo consequat.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut
laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation
ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in
hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros
et accumsan.

Lecture Material 11
No ratings yet
Lecture Material 11
14 pages
DMML Unit4
No ratings yet
DMML Unit4
77 pages
Linear and Logistic Regression
No ratings yet
Linear and Logistic Regression
21 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
6 ML Updated
No ratings yet
6 ML Updated
23 pages
Exp 2
No ratings yet
Exp 2
7 pages
Eml 24.7.25
No ratings yet
Eml 24.7.25
23 pages
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
No ratings yet
Bias and Variance Tradeoff:: High Bias Underfitting Low Training & Testing
12 pages
Logistic Regression
No ratings yet
Logistic Regression
10 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
11logistic Regression in Machine Learning - GeeksforGeeks
No ratings yet
11logistic Regression in Machine Learning - GeeksforGeeks
4 pages
ML-Unit 4
No ratings yet
ML-Unit 4
29 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems
No ratings yet
A Research Project On Applying Logistic Regression To Predict Result of Binary Classification Problems
6 pages
03 Logistic Regression
No ratings yet
03 Logistic Regression
23 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 34-62
29 pages
ML-Unit I - Logistic Regression
No ratings yet
ML-Unit I - Logistic Regression
102 pages
Chp2 Logistic Regression
No ratings yet
Chp2 Logistic Regression
6 pages
Logistic Regression Class Notes
No ratings yet
Logistic Regression Class Notes
3 pages
ML Assignment Kv2
No ratings yet
ML Assignment Kv2
10 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
Module-2 - Logistic Regression in Machine Learning
No ratings yet
Module-2 - Logistic Regression in Machine Learning
28 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
7 pages
Logistic Regression
No ratings yet
Logistic Regression
21 pages
HR Analytics with Logistic Regression
No ratings yet
HR Analytics with Logistic Regression
9 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Logisticregression
No ratings yet
Logisticregression
22 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
MACHINE LEARNING Presentation Logistic Regression
No ratings yet
MACHINE LEARNING Presentation Logistic Regression
18 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
41 pages
Logistic Regression
No ratings yet
Logistic Regression
14 pages
Task 1
No ratings yet
Task 1
7 pages
PCCAIML601
No ratings yet
PCCAIML601
7 pages
Logistic Regression
No ratings yet
Logistic Regression
6 pages
Logistic Regression Guide & Concepts
No ratings yet
Logistic Regression Guide & Concepts
25 pages
Intro To Linear and Logistic Reg
No ratings yet
Intro To Linear and Logistic Reg
5 pages
Practical - Logistic Regression
No ratings yet
Practical - Logistic Regression
84 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
L5 LogisticRegression
No ratings yet
L5 LogisticRegression
22 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Logistic Regression
No ratings yet
Logistic Regression
13 pages
Linear Regression
No ratings yet
Linear Regression
8 pages
DSBDL - Write - Ups - 4 To 7
No ratings yet
DSBDL - Write - Ups - 4 To 7
11 pages
DS203 2024 01 02 LogisticRegression
No ratings yet
DS203 2024 01 02 LogisticRegression
38 pages
Report Logistic Regression
No ratings yet
Report Logistic Regression
21 pages
ML Lec-9
No ratings yet
ML Lec-9
13 pages
Module 2-Supervised Learning
No ratings yet
Module 2-Supervised Learning
74 pages
Lecture 08
No ratings yet
Lecture 08
42 pages
Logistic Regression & Classification
No ratings yet
Logistic Regression & Classification
30 pages
Lecture 8 Logistic Regression
No ratings yet
Lecture 8 Logistic Regression
34 pages
Week 8
No ratings yet
Week 8
38 pages
What Is Logistic Regression
No ratings yet
What Is Logistic Regression
20 pages
Reviewer For Practical Research Semi Final
No ratings yet
Reviewer For Practical Research Semi Final
5 pages
Advances in EM Methods for Hydrocarbon Exploration
No ratings yet
Advances in EM Methods for Hydrocarbon Exploration
5 pages
Research Paper Structure Guide
No ratings yet
Research Paper Structure Guide
2 pages
Math 1150 Reference Solutions of Practice Questions Exam 02
No ratings yet
Math 1150 Reference Solutions of Practice Questions Exam 02
5 pages
Department of Education: Mimaropa Region
No ratings yet
Department of Education: Mimaropa Region
4 pages
PEKA Physics 3
No ratings yet
PEKA Physics 3
3 pages
Annexure 1. A. 1 Grade 5 Term 1 Exam Portion - Ay 24 25
No ratings yet
Annexure 1. A. 1 Grade 5 Term 1 Exam Portion - Ay 24 25
3 pages
Sta 2201 Probability and Statistics III
No ratings yet
Sta 2201 Probability and Statistics III
90 pages
Chapter 2 How Computers Connect Study Guide PRT 1 PDF
100% (1)
Chapter 2 How Computers Connect Study Guide PRT 1 PDF
20 pages
SAP xMDSD 2.1 Learning Map Guide
No ratings yet
SAP xMDSD 2.1 Learning Map Guide
4 pages
Nirbhay Resume
No ratings yet
Nirbhay Resume
1 page
Notice To Hilltop Apartment Residents
No ratings yet
Notice To Hilltop Apartment Residents
15 pages
Beyond Negativity: What Comes After Gender Nihilism?: Alyson Escalante
No ratings yet
Beyond Negativity: What Comes After Gender Nihilism?: Alyson Escalante
7 pages
Engrid OpenFOAM Stammtisch Stuttgart2009
No ratings yet
Engrid OpenFOAM Stammtisch Stuttgart2009
47 pages
C1 WPF Datagrid Ion
No ratings yet
C1 WPF Datagrid Ion
90 pages
Tib BW Palette Reference
No ratings yet
Tib BW Palette Reference
734 pages
Annexure-IV Soil Investigation Report
No ratings yet
Annexure-IV Soil Investigation Report
28 pages
Activity Hazard Analysis For Erection and Dismantling of Scaffold
50% (2)
Activity Hazard Analysis For Erection and Dismantling of Scaffold
2 pages
Assignment 1 Implementing Polynomials in Java
No ratings yet
Assignment 1 Implementing Polynomials in Java
4 pages
Teen Bullying Prevention Strategies
No ratings yet
Teen Bullying Prevention Strategies
8 pages
Euler
No ratings yet
Euler
4 pages
1CDI 1 Ok
No ratings yet
1CDI 1 Ok
6 pages
Purification
No ratings yet
Purification
40 pages
Ship Combat Strategy Guide
100% (2)
Ship Combat Strategy Guide
3 pages
Exec Summary Leaf Switch On Your Brain
No ratings yet
Exec Summary Leaf Switch On Your Brain
2 pages
"Re-patterning for Systems Change"
No ratings yet
"Re-patterning for Systems Change"
6 pages
(Ebook PDF) Essentials of Statistics For The Behavioral Sciences 10th Edition PDF Download
100% (3)
(Ebook PDF) Essentials of Statistics For The Behavioral Sciences 10th Edition PDF Download
144 pages
PP Trilene Hi10ho
No ratings yet
PP Trilene Hi10ho
2 pages
4-Week Alternative Academic Calendar For Elementary Schools in Mizoram
No ratings yet
4-Week Alternative Academic Calendar For Elementary Schools in Mizoram
90 pages
Group 1: 6A English Education
No ratings yet
Group 1: 6A English Education
6 pages