SUPERVISED MACHINE
LEARNING
LINEAR REGRESSION
Linear Regression is primarily used for predicting continuous outcomes. It models the
relationship between a dependent variable and one or more independent variables.
Linear Regression algorithm optimizes a straight line to encapsulate the relationship accurately
between the input and output variables. This line is modeled using a simple equation, y=mx+c
Linear Regression estimates the relationship by fitting a linear equation to observed
data. The goal is to predict a continuous target variable based on input features.
The model learns a linear function y=mx +c :
Mathematics behind Linear Regression
At the heart of Linear Regression lies the concept of the cost function and hypothesis, which
we'll break down below:
1. Hypothesis: This results in the regression line that can predict the output based on the
inputs. If we're trying to predict wine quality based on certain properties, this hypothesis
would best fit the linear relationship between our selected properties and the wine's
quality. The hypothesis is represented as
, where θ0 and θ1 are the model's parameters.
2. Cost Function (or Loss Function): This term simply quantifies how wrong our model's
predictions are relative to the actual truth. We aim to minimize this function to achieve
the most accurate prediction. It's also known as the Mean Squared Error (MSE) and it's
given by where m is the total count
of observations and the summation over the squared differences (errors) ensures that
the higher the error, the greater the cost. The cost function is minimized using
Gradient Descent. Gradient Descent will painstakingly adjust θ0 and θ1 to minimize the
cost function and derive a line that gives us the lowest possible error or cost.
2
Linear Regression Model Assumptions
1. Linearity: The relationship between the independent and dependent variables is
linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: Constant variance of errors.
4. Normality: Errors are normally distributed.
Designing Linear Regression Model
Def gd(features,target)
prediction=features*theta
error=prediction - target
Cost = mean(error)^2
theta = theta - alpha*x*error
predictions = features_test*final_theta
r2_test =r2score(target_test,predictions)
OR
LinearRegression().fit(features_train,target_train) ← TRAINING
x, y both shown
predictions = LinearRegression().predict(features_test) ← TESTING x
shown y predicted
3
r2_test = r2score(target_test,predictions) ← TEST
SCORE
Implementing Linear Regression Model
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
mse=metrics.mean_squared_error(target_test,predictions)
r2=metrics.r2_score(target_test,predictions)
print(mse,r2)
Evaluating the Model Performance - Metrics
Key metrics include:
● Mean Absolute Error (MAE)
● Mean Squared Error (MSE)
● R-squared (R²)
LOGISTIC REGRESSION
Logistic Regression is a classification algorithm used to estimate the
probabilities of a binary response based on one or more predictor (also known
as independent) variables. It is particularly beneficial for binary outcomes,
meaning situations with only two possible results
Our goal is to predict wine quality, which, as you may remember, ranges from
0 to 10. To keep things simple and focus on a binary classification problem,
let's classify the wines as good (a quality rating of 7 or above) and not good (a
quality rating below 7). Therefore, we will be using Logistic Regression to
predict whether the quality of a specific type of wine is 'good' or 'not good'
based on its physicochemical features.
All of this is achieved by using a logistic function, which limits the unlimited
outcome of the linear equation to a number between 0 and 1. Also known as
the Sigmoid function, this logistic function is an S-shaped curve that maps
4
any real-valued number into a value falling within these bounds. The
function is defined as follows,
In this equation, x represents the output of a linear combination of feature
values and their corresponding coefficients,
In this informative equation:
●β (Beta) terms are the model's parameters or weights, signifying the
influence of each input feature (denoted by X) on the predicted
outcome. X terms represent independent predictor variables.
Once we compute the predicted probability (p) using the Sigmoid function, we
can assign classes by defining a threshold (which is generally 0.5):
● If p≥0.5, the label for the example is 1 (or Good in our case).
● If p<0.5, the label for the example is 0 (or Not Good in our case).
The next critical component in Logistic Regression is the cost function. Unlike
in Linear Regression, we can't use Mean Square Error as the cost function
because the Logistic function would introduce a non-linear term into the cost
function, making the cost function non-convex anymore. In Logistic
Regression, the cost function is defined as:
Where:
5
● θ represents the parameters we must determine using an
optimization algorithm to minimize the cost function.
● m is the number of samples.
● y and x represent the target and input of each sample, respectively.
● hθ(x) is the logistic function that computes the predicted probability
that y=1.
While discussing the cost function, it's crucial to consider optimization
algorithms like Gradient Descent used to find the parameters θ to minimize
this cost.
Disclaimer: in most scenarios, you don't have to remember and implement the
cost function yourself, as there are plenty of libraries (e.g., scikit-learn
that provide the built-in implementation of the Logistic Regression). However,
it's still essential to understand high-level concepts and what's being
optimized.
Logistic Regression Model Assumptions
1. Each observation is independent of others: This means the outcome or
probability of success (p in our logistic function) for one example
neither influences nor is influenced by the outcomes of other examples.
2. There is no multicollinearity among explanatory variables: In simple
terms, the input variables should not be too highly correlated with each
other. Any correlation implies that they carry similar information to the
model, which is redundant.
3. The input variables have a linear relationship with the log odds:
Although the outcome in logistic regression is a binary variable, logistic
regression stipulates that the input variables are linearly related to the
log odds
, and hence, to the logit of the probability, p.
6
Violating these assumptions may result in inaccurate models and
misinterpretations. Therefore, validating these assumptions while
modeling Logistic Regression is essential.
Designing Logistic Regression Model
1. Specify the hypothesis or function the model should learn. In Logistic
Regression, this is the Sigmoid function.
2. Define an error, cost, or loss function we aim to minimize. For Logistic
Regression, the cost function is defined as Cross-Entropy Loss. Define a
learning algorithm that optimizes the parameters for the hypothesis to
fit the model to the training data. In our case, it's the Gradient Descent
algorithm.
Implementing Logistic Regression Model
create a LogisticRegression object and use the fit function to train it on
the training sets, X_train and y_train. The learned parameters of the
Logistic function can be printed as shown in the last line. The coef_ variable
gives the coefficients for different features (or X), while intercept_ provides
the intercept term (or β0).
Implementing Logistic Regression Model
Now that we have our trained Logistic Regression model, we might wonder how to interpret the
output of our model. The output of the model includes the coefficients (also known as weights) of
each feature and a bias (also known as the intercept). The coefficients represent the log of the odds
ratio of the corresponding feature.
For example, if the coefficient of a feature, say pH (with log odds ratio = β), is 0.5, it indicates that
for each unit change in pH, keeping other features constant, the odds of our outcome (whether the
wine quality is good) would increase by a factor of e^0.5
7
lr=LogisticRegression(max_iter=5000)
# TODO: Train your logistic regression model using the training sets
lr.fit(features_train,target_train)
# TODO: Make predictions on the test dataset
predictions=lr.predict(features_test)
# TODO: Evaluate the model using different metrics, e.g. accuracy,
precision, recall, or F1 score
print("Model Performance Metrics:")
print("Accuracy: ", metrics.accuracy_score(target_test, predictions))
print("Precision: ", metrics.precision_score(target_test, predictions))
print("Recall: ", metrics.recall_score(target_test, predictions))
print("F1 Score: ", metrics.f1_score(target_test, predictions))
print("AUC: ", metrics.roc_auc_score(target_test, predictions))
print("Model Coefficients: ", lr.coef_)
print("Model Coefficients: ", lr.coef_[0])
print("Intercept: ", lr.intercept_[0])
print("Predicted outcomes: ", predictions)
Evaluating the Model Performance - Metrics
Key metrics include:
1. Confusion Matrix: This table describes the performance of a
classification model. It's essentially a 2×2 matrix that visualizes the
performance of the regression, representing actual and predicted
classifications in terms of true positives, false positives, true negatives,
and false negatives.
2. Accuracy: This is the ratio of correctly predicted observations to total
observations. Accuracy = (True Positives + True Negatives) / Total
Observations.
8
3. Precision: This is the ratio of correctly predicted positive
observations to the total predicted positives. Precision = True
Positives / (True Positives + False Positives).
4. Recall (Sensitivity): This is the ratio of correctly predicted positive
observations to all observations in the actual class. Recall = True
Positives / (True Positives + False Negatives).
5. F1 Score: This is the weighted average of Precision and recall. F1Score
= 2 * Recall * Precision / (Recall + Precision).
6. ROC-AUC : This is the area under the Receiver Operating Characteristic
curve. It indicates how much the model can distinguish between
classes.
9
10
Understanding Model Overfitting and Underfitting
In machine learning, balance is crucial. If your model performs well on the
training data but poorly on unseen data (such as validation and test
datasets), it may be overfitting. This issue is similar to an attempt to ace a
specific test by learning to copy all the answers without understanding the
concepts, which leads to poor performance in other tests. This problem arises
because the model learns the noise in the training data rather than the signal.
Conversely, we have underfitting. An underfitted model performs poorly on
both training and unseen data because it hasn't learned the underlying pattern
of the data.
In subsequent lessons, we will explore these concepts deeper and examine
how to fine-tune our models to prevent overfitting and underfitting.
GRADIENT DESCENT SINGLE FEATURE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import r2_score, mean_squared_error
11
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
return theta, theta_list, cost_list
# TODO: Load the Red Wine Quality dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine =pd.DataFrame(red_wine)
# TODO: Select 'sulphates' as the predictive feature and 'quality' as the
target variable
x=pd.DataFrame(red_wine['sulphates'])
y=red_wine['quality']
# TODO: Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=42)
# TODO: Initialize the model parameters to all 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# TODO: Apply the Gradient Descent function to optimize the parameters
y_train = np.array(y_train).reshape(-1, 1)
alpha = 0.05
iters = 3000
12
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)
y_pred_train = np.dot(x_train, g)
y_pred_test = np.dot(x_test, g)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
# TODO: Create a plot to display the cost function against iterations
plt.plot(cost_list)
plt.show()
GRADIENT DESCENT MULTIPLE FEATURE
GRADIENT DESCENT MULTIPLE FEATURE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.preprocessing import StandardScaler
start_time = pd.Timestamp.now()
# Load the dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Separate features (X) and target (y)
13
# Exclude 'quality' which is our target variable
X = red_wine.drop('quality', axis=1)
y = red_wine['quality']
# Print feature names and shape
print(f"Features used ({X.shape[1]} total):")
for i, feature in enumerate(X.columns, 1):
print(f"{i}. {feature}")
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Initialize parameters (theta) for all features
theta = np.zeros(X_train.shape[1]).reshape(-1, 1)
def gradient_descent(X, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta.copy()]
# Add progress tracking
for i in range(iterations):
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)
cost = 1/(2*m) * np.dot(error.T, error)
theta = theta - (alpha * (1/m) * np.dot(X.T, error))
cost_list.append(np.squeeze(cost))
theta_list.append(theta.copy())
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost: {np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Set parameters
alpha = 0.01 # Increased learning rate since we scaled the features
iters = 3000
14
# Run gradient descent
print("\nTraining model...")
final_theta, theta_history, cost_history = gradient_descent(
X_train, y_train, theta, alpha, iters
)
# Plot cost convergence
plt.figure(figsize=(12, 6))
plt.plot(range(1, iters + 1), cost_history, color='blue')
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.grid(True)
plt.show()
# Print final parameters and their importance
print("\nFinal parameters for each feature:")
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Parameter Value': final_theta.flatten()
})
feature_importance['Absolute Importance'] = abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute Importance', ascending=False)
print(feature_importance.to_string(index=False))
# Calculate R-squared for training data
y_pred_train = np.dot(X_train, final_theta)
ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean()) ** 2)
ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)
# Calculate R-squared for test data
y_pred_test = np.dot(X_test, final_theta)
ss_tot = np.sum((y_test.values.reshape(-1, 1) - y_test.values.mean()) ** 2)
ss_res = np.sum((y_test.values.reshape(-1, 1) - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
stop_time = pd.Timestamp.now()
print('Time taken for optimization: ', stop_time - start_time, ' seconds')
15
Features used (11 total):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
Training model...
Iteration 500/3000, Cost: 16.036170
Iteration 1000/3000, Cost: 16.035651
Iteration 1500/3000, Cost: 16.035434
Iteration 2000/3000, Cost: 16.035318
Iteration 2500/3000, Cost: 16.035256
Iteration 3000/3000, Cost: 16.035222
Final parameters for each feature:
Feature Parameter Value Absolute Importance
alcohol 0.366530 0.366530
sulphates 0.186969 0.186969
density 0.144561 0.144561
pH 0.101144 0.101144
citric acid 0.094403 0.094403
chlorides -0.087866 0.087866
fixed acidity -0.039480 0.039480
residual sugar -0.026653 0.026653
free sulfur dioxide 0.013299 0.013299
total sulfur dioxide -0.013174 0.013174
volatile acidity 0.005152 0.005152
Model Performance:
Training R-squared: -47.6866
Testing R-squared: -50.5587
Time taken for optimization: 0 days 00:00:27.383149 seconds
GRADIENT DESCENT REGULARIZED
GRADIENT DESCENT REGULARIZED
16
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import matplotlib.pyplot as plt
import datasets
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_squared_error
start_time = pd.Timestamp.now()
# Load and prepare the dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Separate features and target
X = red_wine.drop('quality', axis=1)
y = red_wine['quality']
if red_wine.isnull().sum().any():
print("Missing values found")
else:
print("No missing values found")
# Create polynomial features (up to degree 2 for interaction terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
feature_names = poly.get_feature_names_out(X.columns)
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3, random_state=10)
17
def gradient_descent_with_regularization(X, y, theta, alpha,
lambda_reg, iterations, early_stop_rounds=50):
m = y.size
cost_list = []
theta_list = [theta.copy()]
best_cost = float('inf')
rounds_without_improvement = 0
for i in range(iterations):
# Forward pass
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)
# Calculate cost with L2 regularization
reg_term = (lambda_reg / (2*m)) * np.sum(theta[1:] ** 2) # Don't
regularize intercept
cost = 1/(2*m) * np.dot(error.T, error) + reg_term
# Calculate gradients with regularization
gradients = (1/m) * np.dot(X.T, error)
gradients[1:] += (lambda_reg/m) * theta[1:] # Add regularization
term
# Update parameters
theta = theta - alpha * gradients
cost_list.append(float(cost))
theta_list.append(theta.copy())
# Early stopping check
if cost < best_cost:
best_cost = cost
rounds_without_improvement = 0
best_theta = theta.copy()
else:
rounds_without_improvement += 1
18
if rounds_without_improvement >= early_stop_rounds:
print(f"Early stopping at iteration {i+1}")
break
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{float(cost):.6f}")
return best_theta, theta_list, cost_list
# Initialize parameters
n_features = X_train.shape[1]
theta = np.zeros((n_features, 1))
# Hyperparameters
alpha = 0.001 # Reduced learning rate
lambda_reg = 0.1 # Regularization strength
iters = 5000
print(f"\nTraining model with {n_features} features (including polynomial
terms)")
print("First 5 feature names as example:")
for i, name in enumerate(feature_names[:5], 1):
print(f"{i}. {name}")
# Train the model
final_theta, theta_history, cost_history =
gradient_descent_with_regularization(
X_train, y_train, theta, alpha, lambda_reg, iters
)
# Plot cost convergence
plt.figure(figsize=(12, 6))
plt.plot(cost_history, color='blue')
plt.xlabel('Iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of Gradient Descent with Regularization')
19
plt.grid(True)
plt.show()
# Compare with sklearn's Ridge regression
ridge = Ridge(alpha=lambda_reg)
ridge.fit(X_train, y_train)
# Calculate predictions and R-squared for both approaches
# Our implementation
y_pred_train = np.dot(X_train, final_theta)
y_pred_test = np.dot(X_test, final_theta)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)
# Ridge implementation
ridge_pred_train = ridge.predict(X_train)
ridge_pred_test = ridge.predict(X_test)
ridge_r2_train = r2_score(y_train, ridge_pred_train)
ridge_r2_test = r2_score(y_test, ridge_pred_test)
# Print results
print("\nModel Performance:")
print("Gradient Descent with Regularization:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
print(f"Training MSE: {mean_squared_error(y_train, y_pred_train):.4f}")
print(f"Testing MSE: {mean_squared_error(y_test, y_pred_test):.4f}")
print("\nScikit-learn Ridge Implementation:")
print(f"Training R-squared: {ridge_r2_train:.4f}")
print(f"Testing R-squared: {ridge_r2_test:.4f}")
print(f"Training MSE: {mean_squared_error(y_train,
ridge_pred_train):.4f}")
print(f"Testing MSE: {mean_squared_error(y_test, ridge_pred_test):.4f}")
# Feature importance analysis
feature_importance = pd.DataFrame({
20
'Feature': feature_names,
'Parameter Value': final_theta.flatten(),
'Ridge Parameter': ridge.coef_
})
feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute Importance',
ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))
stop_time = pd.Timestamp.now()
print('Time taken for optimization: ', stop_time - start_time, ' seconds')
Training model with 77 features (including polynomial terms)
First 5 feature names as example:
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
Iteration 500/5000, Cost: 16.005579
Iteration 1000/5000, Cost: 15.995963
Iteration 1500/5000, Cost: 15.988637
Iteration 2000/5000, Cost: 15.982492
Iteration 2500/5000, Cost: 15.977168
Iteration 3000/5000, Cost: 15.972465
Iteration 3500/5000, Cost: 15.968250
Iteration 4000/5000, Cost: 15.964428
Iteration 4500/5000, Cost: 15.960930
Iteration 5000/5000, Cost: 15.957702
Model Performance:
Gradient Descent with Regularization:
Training R-squared: -49.5259
Testing R-squared: -46.7613
Training MSE: 31.9154
Testing MSE: 33.3533
Scikit-learn Ridge Implementation:
Training R-squared: 0.4307
Testing R-squared: 0.3855
Training MSE: 0.3596
Testing MSE: 0.4291
Top 10 Most Important Features:
Feature Parameter Value Ridge Parameter Absolute Importance
chlorides total sulfur dioxide -0.188286 -0.056623 0.188286
volatile acidity total sulfur dioxide -0.160016 0.460917 0.160016
residual sugar total sulfur dioxide 0.155526 0.019077 0.155526
residual sugar^2 -0.152816 -0.175048 0.152816
residual sugar chlorides -0.144540 0.134031 0.144540
fixed acidity volatile acidity -0.118415 -0.037261 0.118415
total sulfur dioxide alcohol 0.108945 0.117779 0.108945
free sulfur dioxide sulphates -0.100814 -0.154157 0.100814
21
volatile acidity citric acid -0.097507 0.056614 0.097507
citric acid^2 -0.089492 0.114346 0.089492
Time taken for optimization: 0 days 00:00:18.955081 seconds
GRADIENT DESCENT THEORY
Gradient Descent Demystified
Have you ever hiked to the top of a hill and looked down to determine the best route of descent? One
potentially disastrous step off a steep cliff is dangerous, while cautiously descending the gentle slopes
might cause less harm. The concept of Gradient Descent mirrors this scenario — it, too, sees the value in
finding and taking the optimal path or, more precisely, reaching the minimum point.
In machine learning, Gradient Descent can be visualized as a careful navigation downwards until we find
the valley between hills. The 'hill' in this context is the cost function, which quantifies our model's error.
Through a series of small steps, Gradient Descent refines the cost function by 'walking' down the hill
towards the steepest descent until it reaches the lowest possible point at its optimal state.
At its core, Gradient Descent relies on two key mathematical mechanisms: the Cost Function and
the Learning Rate.
The Cost Function (or Loss Function) quantifies the disparity between predicted and expected values,
presenting it as a single float number. The type of cost function utilized depends on the challenge at hand. In
our Wine Quality dataset, we can define a cost function that computes the difference between our model's
predicted quality of wine and the actual quality.
The Learning Rate, symbolized by α, dictates the size of the steps we take downhill. A lower value of α results
in smaller, more precise steps, while a high value could cause drastic, potentially unstable steps. From our
previous analogy, imagine the hill is symbolized by a function of position, g(x). Starting at the hill's pinnacle
(x0), we revise our position (x) by moving a step proportional to the negative gradient at that location. The
gradient g′(x) is simply the derivative of g(x), pointing toward the steepest ascent. Conversely, −g′(x)
signifies the fastest descending path. We repeat this stepping process until the gradient becomes zero at the
minimum point, indicating no further downhill path, i.e., no additional optimization is required.
Advancements in Gradient Descent
22
Here, an interesting question arises, "Do we always use all data to calculate the gradient?" The
answer depends. Gradient Descent has evolved into various versions, depending on the amount of
data used in computing the gradient: batch, stochastic, and mini-batch gradient descent.
The original version, batch gradient descent, uses the complete dataset at every step. While this may
seem meticulous and comprehensive, it proves extremely inefficient when dealing with substantial
datasets housing millions of entries. Imagine watching a movie frame by frame at a snail's pace — it can
be painstakingly slow despite its precision.
Implementing Gradient Descent
Now, let's make the Gradient Descent implementation in Python. We start by assigning random
values to our model’s parameters. Gradual adjustments to these parameters follow, in each instance
computing the cost function, our error, and taking a step towards the steepest slope until our error is
minimal or the state is optimized.
Here’s a general outline of how we would implement gradient descent in Python:
def gradient_descent(x, y, theta, alpha, iterations):
# x - input dataset/feature
# y - target dataset/feature
# theta - initial parameters
# alpha - learning rate
# iterations - no. of times optimization algo executes to fine-tune the parameters
m = y.size # number of data points
cost_list = [] # list to store the cost function value at each iteration
theta_list = [theta] # list to store the values of theta at each iteration
for i in range(iterations):
prediction = np.dot(x, theta)# our prediction based on our current theta
error = prediction - y # error between our prediction and the actual values
cost = 1 / (2*m) * np.dot(error.T, error) # calculate the cost function
cost_list.append(np.squeeze(cost))# append the cost to the cost_list
theta = theta - (alpha * (1/m) * np.dot(x.T, error))# GD and update theta
theta_list.append(theta)# append the updated theta to the theta_list
return theta, theta_list, cost_list # return the final values after last iter
# PROGRAM
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
white_wine = datasets.load_dataset('codesignal/wine-quality', split='white')
white_wine = pd.DataFrame(white_wine)
# Only consider the 'density' column as a predictive feature for now
x = pd.DataFrame(white_wine['density'])
y = white_wine['quality']
23
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.001
iters = 2000
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='green')
plt.rcParams["figure.figsize"] = (10,6)
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.show()
Let's break down the line theta = np.zeros(x_train.shape[1]).reshape(-1, 1) step by
step:
Since x is a matrix of input features (shape: n x 1) and theta is a vector of
parameters (shape: 1 x 1), the dot product, for prediction used later, results in
the predicted values prediction (shape: n x 1).
1. x_train.shape[1]:
● This retrieves the number of columns (features) in x_train.
● Example: If x_train has one feature (like density), x_train.shape[1]
would be 1.
2. np.zeros(x_train.shape[1]):
● Creates an array of zeros with a length equal to the number of
features.
● Example: If x_train.shape[1] is 1, this results in an array [0].
3. .reshape(-1, 1):
● Reshapes the array into a column vector. This reshapes theta into a
24
column vector (from an array of shape (1,) to (1, 1))
● -1: Let NumPy automatically determine the number of rows based on the
data (in this case, it’s 1 because there’s one feature). 1: Specifies
that the output should have 1 column.
● Example: Reshaping [0] results in [[0]], a column vector with one row
and one column.
We reshape theta because later, we’ll perform matrix operations where theta must be
a column vector for matrix multiplication to work correctly.
In gradient descent, we will use matrix multiplication to calculate predictions and
update the weights. For matrix multiplication to work correctly in Python’s numpy
(or other numerical libraries), the dimensions of matrices must align.
● Specifically, the shape of the input feature matrix x will be (n_samples,
n_features) (i.e., number of rows = number of data points, number of columns =
number of features).
● theta needs to be a column vector (shape (n_features, 1)) to allow for the dot
product between the feature matrix and the weight vector.
For example, if we want to compute:
Prediction=X⋅θ\text{Prediction} = X \cdot \thetaPrediction=X⋅θ
● X: The feature matrix of shape (n_samples, n_features).
● theta: The weight vector, which must be shaped (n_features, 1) for this
multiplication to be valid.
●
Purpose of the Split:
The goal of splitting the data into training and testing sets is to train the model
on one portion of the data and then test its generalization on unseen data. This
helps in assessing the model’s real-world performance and avoiding overfitting (a
model performing well on training data but poorly on new data).
Why Split the Data?
1. Training:
○ You need enough data to train the model so it can learn the relationship between
the input feature (density) and the target (quality).
2. Testing:
○ Once the model is trained, the test set is used to evaluate the model’s
performance on data it has never seen before. This helps ensure that the model
25
will generalize well to new, unseen data in real-world scenarios.
3. Avoid Overfitting:
○ Overfitting happens when a model performs very well on the training data but
poorly on new data because it has "memorized" the training set rather than
learning general patterns.
○ The test set allows you to check if the model has overfitted by evaluating its
performance on data it has not been trained on.
By splitting the dataset into training and testing subsets, we ensure that the model can be
evaluated effectively for its predictive accuracy, and we get an estimate of how it will
perform in real-world applications.
Function Parameters:
● x: The input feature matrix (in this case, the density of the wine from x_train). It is
an n x 1 matrix (n = number of training examples, 1 feature).
● y: The target values (in this case, the quality of the wine from y_train). This is the
true label we are trying to predict. It is a column vector of shape (n, 1).
● theta: The initial parameter values (initialized as zeros in the previous step). This
is the weight vector (or coefficient) that will be updated iteratively to minimize the
error in predictions.
● alpha: The learning rate, which controls the size of the steps taken towards the
minimum of the cost function. A small value of alpha ensures that the gradient descent
algorithm converges more steadily, but it might take longer. If alpha is too large, it
can cause the algorithm to overshoot the minimum and fail to converge.
● iterations: The number of times to update theta. More iterations allow the model to
converge closer to the optimal solution, but at a cost of more computation.
Breakdown of the Function:
1. m = y.size:
● m represents the number of training examples (rows) in the dataset. This is used to
calculate the average cost and gradients.
● y.size gives the total number of elements in y, which is the number of training
samples.
2. cost_list = []:
● This initializes an empty list to store the cost values after each iteration. The cost
(also known as the loss) is a measure of how far off the model's predictions are from
the actual target values. In linear regression, the cost function is the mean squared
error.
3. theta_list = [theta]:
● This initializes a list to store the theta values after each iteration. It starts with
the initial value of theta (which was set to zeros in Step 5).
● Keeping track of theta values helps in visualizing how the parameters evolve during
26
training.
4. Loop over the Number of Iterations:
● loop runs for the specified number of iterations, updating the parameters theta in each
step.
5. Purpose of theta is Prediction:
● theta represents the weights or coefficients that will be used to predict the target
variable (quality of wine) based on the input feature (density of wine).
● In a linear regression model, the formula for predicting the target is:
Where:
○ y^\hat{y}y^ is the predicted value (in this case, wine quality).
○ θ0\theta_0θ0 is the intercept (bias term).
○ θ1\theta_1θ1 is the weight for the feature (density).
○ x is the input feature (density values in this case).
6. Cost Calculation (Mean Squared Error):
The cost function in linear regression is the Mean Squared Error (MSE). The formula is:
● h_{\theta}(x) is the predicted value (i.e., np.dot(x, theta)).
● y is the actual value.
● m is the number of data points.
In code, np.dot(error.T, error) computes the sum of squared errors. The 1 / (2*m) scales it
to give the mean squared error. The division by 2 is a mathematical convenience used in
gradient descent because it simplifies the derivative of the cost function.
The computed cost is appended to the cost_list to track how the cost changes over
iterations.
7. Gradient Descent Update for theta:
● This is the key step in gradient descent. The idea is to update theta in the direction
27
that minimizes the cost function. This update rule is derived from the partial
derivative of the cost function with respect to theta.
● The term alpha * (1/m) * np.dot(x.T, error) is the gradient of the cost function with
respect to theta. It tells us the direction and magnitude by which we need to adjust
theta to reduce the error.
● The formula for updating theta is:
Where:
● α is the learning rate.
● ∇θJ(θ)\nabla_{\theta} J(\theta)∇θJ(θ) is the gradient of the cost function (the term np.dot(x.T, error)
in code).
● theta is updated in the direction that minimizes the cost.
Each time the loop runs, this update is applied, moving the theta values closer to their
optimal solution.
Summary of Gradient Descent Process:
1. Start with an initial guess for theta or weights (initialized as zeros).
2. Make predictions using the current values of theta and the input features with a simple
linear regression model where density predicts wine quality.
3. Calculate the error between the predicted values and the actual target values.
4. Compute the cost function (mean squared error) to measure how far off the predictions
are.
5. Update theta by moving in the direction that reduces the error (the gradient descent
step).
6. Repeat this process for the specified number of iterations, tracking the cost and theta
values as you go.
7. Return the final optimized theta and the history of theta and cost values.
Gradient descent ensures that, with each iteration, the model gets better at making predictions
by continuously reducing the cost.
This is a basic introduction to linear regression with gradient descent, using only one feature
(density) and optimizing parameters iteratively.
28
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
# Load White Wine Quality Dataset
white_wine = datasets.load_dataset('codesignal/wine-quality',
split='white')
white_wine = pd.DataFrame(white_wine)
# Only consider the 'density' column as a predictive feature for now
x = pd.DataFrame(white_wine['density'])
y = white_wine['quality']
print(f"Features used ({x.shape[1]} total):")
for i, feature in enumerate(x.columns, 1):
print(f"{i}. {feature}")
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=0)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
29
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.001
iters = 2000
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
print("\nTraining model...")
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='green')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.show()
print("\nFinal parameters for each feature:")
final_theta = theta_list[-1].flatten() # Access the last element and
flatten it
feature_importance = pd.DataFrame({
'Feature': x.columns,
'Parameter Value': final_theta
})
print(feature_importance)
feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute
Importance', ascending=False)
print(feature_importance.to_string(index=False))
# Calculate R-squared for training data
y_pred_train = np.dot(x_train, final_theta)
#ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean())
** 2)
ss_tot = np.sum((y_train - y_train.mean()) ** 2)
30
#ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)
ss_res = np.sum((y_train - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)
# Calculate R-squared for test data
y_pred_test = np.dot(x_test, final_theta)
ss_tot = np.sum((y_test - y_test.values.mean()) ** 2)
ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
Features used (1 total):
1. density
Training model...
Iteration 500/2000, Cost: 6.873248
Iteration 1000/2000, Cost: 2.803397
Iteration 1500/2000, Cost: 1.288870
Iteration 2000/2000, Cost: 0.725263
Final parameters for each feature:
Feature Parameter Value
0 density 5.110624
Feature Parameter Value Absolute Importance
density 5.110624 5.110624
Model Performance:
Training R-squared: -6393.4648
Testing R-squared: -0.7097
The parameter value of 5.110624 for the feature density
indicates the weight that the gradient descent algorithm has
learned for this feature in predicting wine quality. Here's
what it means:
● Positive Value: A positive parameter value suggests that
as the density of the wine increases, the predicted
quality also increases, according to the model.
● Magnitude: The magnitude of the parameter indicates the
31
strength of the relationship. A larger absolute value
would suggest a stronger influence of density on the
predicted quality.
In this context, the "Parameter Value" and "Absolute
Importance" are indeed the same because:
● Parameter Value: This is the weight that the model has
learned for the feature density.
● Absolute Importance: This is simply the absolute value of
the parameter. Since the parameter is positive, its
absolute value is the same as the parameter itself.
In cases with multiple features, the absolute importance helps
compare the relative influence of each feature on the model's
predictions, regardless of the sign of the parameter. Here,
with only one feature, they naturally match.
The "Model Performance" section in the output refers to the
evaluation of how well your linear regression model is fitting
the data. Specifically, it uses the R-squared (R2) metric to
assess performance:
● R-squared (R2): This metric indicates the proportion of
variance in the target variable that is predictable from
the input features. It ranges from 0 to 1, where:
● 1 means the model perfectly predicts the target
variable.
● 0 means the model does not explain any variability in
the target variable.
● Negative values, like in your output, suggest that
the model is performing worse than a simple mean-
based prediction.
In your output:
32
● Training R-squared: Indicates how well the model fits the
training data. A negative value suggests poor performance.
● Testing R-squared: Indicates how well the model
generalizes to unseen data. A negative value here also
suggests poor generalization.
The negative R-squared values imply that the model is not
capturing the relationship between density and quality
effectively. You might need to explore other features or model
types for better performance
Your approach to calculating the R-squared value for the test
dataset is correct, but the negative R-squared value suggests
that the model is performing poorly. This could be due to
several reasons, such as:
1. Feature Scaling: Ensure that the feature (pH) is scaled
properly. Gradient descent can be sensitive to the scale
of the features.
2. Learning Rate: The learning rate (alpha) might be too high
or too low. Experiment with different values to see if the
model's performance improves.
3. Model Complexity: Using only one feature (pH) might not be
sufficient to capture the complexity of the data. Consider
adding more features to improve the model.
4. Data Quality: Check if there are any issues with the data,
such as outliers or missing values, that might affect the
model's performance.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
33
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 1500
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
34
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha,
iters)
# Test the performance of the model on the testing dataset
final_theta=theta_list[-1].flatten()
y_pred_test = np.dot(x_test, final_theta)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Testing R-squared: {r2_test:.4f}")
mse = mean_squared_error(y_test, y_pred_test)
print(f'Mean Squared Error: {mse}')
# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'Mean Absolute Error: {mae}')
# R-squared (Coefficient of Determination)
r2 = r2_score(y_test, y_pred_test)
print(f'R-squared: {r2}')
plt.plot(range(1, iters + 1), cost_list, color='orange')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
The change in the "Testing R-squared" output is due to the
learner correctly using the final theta from the training phase
to calculate predictions for the test dataset. Here's what was
done:
1. Using Final Theta: The learner used theta_list[-1] to get
the final theta from the training phase, which is crucial
35
for making accurate predictions on the test dataset.
2. Calculating Predictions: They calculated y_pred_test using
the test dataset and the final theta.
3. R-squared Calculation: The R-squared value was calculated
using the formula:
4. R2=1−SSresSStot
5. R
6. 2
7. =1−
8. SS
9. tot
10.
11. SS
12. res
13.
14.
15. where
16. SSres
17. SS
18. res
19.
20. is the sum of squares of residuals and
21. SStot
22. SS
23. tot
24.
25. is the total sum of squares.
The negative R-squared value indicates that the model is not
performing well on the test dataset, which could be due to the
simplicity of using only one feature (pH) or other factors like
feature scaling or learning rate.
36
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
37
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 1500
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='orange')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
# Test the performance of the model on the testing dataset
final_theta=theta_list[-1].flatten()
y_pred_test = np.dot(x_test, final_theta)
# R-squared (Coefficient of Determination) direct method from sklearn
r2 = r2_score(y_test, y_pred_test)
print(f'R-squared: {r2}')
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
ss_res = np.sum((y_test - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"R-squared manual calc: {r2_test:.4f}")
mse = mean_squared_error(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)
38
CHATGPT R-square
Explanation of Performance Metrics:
1. Mean Squared Error (MSE):
○ MSE is a common metric for regression problems. It calculates the
average squared difference between the predicted and actual values.
○ Lower values of MSE indicate better performance (closer to zero means
the model's predictions are closer to the true values).
2. Formula:
Where ypred,i is the predicted value, and ytrue,iy_{\text{true},i}ytrue,i is
the actual value.
3. Mean Absolute Error (MAE):
○ MAE calculates the average absolute difference between predicted and
actual values. Unlike MSE, it doesn't square the errors, so it's less
sensitive to outliers.
○ Like MSE, a lower value of MAE indicates better model performance.
4. Formula:
5. R-squared (R²):
○ R² is a statistical measure that represents the proportion of the
variance for the target variable that is explained by the model. It's a
measure of how well the regression predictions approximate the real
data points.
○ The value of R² is between 0 and 1, where 1 means perfect predictions
and 0 means the model explains none of the variability in the target
variable.
6. Formula:
39
Where:
○ ypredy_{\text{pred}}ypred are the predicted values.
○ ytruey_{\text{true}}ytrue are the actual values.
○ yˉtrue\bar{y}_{\text{true}}yˉtrue is the mean of the actual values.
Summary of Steps:
1. Make Predictions: Use the trained model (optimized theta) to make predictions
on the test set (x_test).
2. Evaluate the Model: Compute metrics like MSE, MAE, or R² to understand how
well your model is performing on the test data.
3. Interpret the Results: Lower MSE and MAE values indicate better model
performance. For R², a value close to 1 indicates a good fit.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Reshape y_test to match the shape of predictions for matrix multiplication
y_test = np.array(y_test).reshape(-1, 1)
# 1. Use `theta` to predict the output (quality) for the test dataset
# Since x_test is a single feature (density), we use matrix multiplication
predictions = np.dot(x_test, theta)
# 2. Calculate the errors (residuals)
error = predictions - y_test
# 3. Calculate different performance metrics:
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, predictions)
40
print(f'Mean Squared Error: {mse}')
# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {mae}')
# R-squared (Coefficient of Determination)
r2 = r2_score(y_test, predictions)
print(f'R-squared: {r2}')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
41
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
if (i+1) % 10 == 0:
print(f"Iteration {i+1}/{iterations}, Cost: {np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 100
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='blue')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
#Training Performance
final_theta=theta_list[-1].flatten()
y_pred_train=np.dot(x_train, final_theta)
r2_train=r2_score(y_train,y_pred_train)
print(f'R-squared train: {r2_train}')
# Test Performance
y_pred_test = np.dot(x_test, final_theta)
r2_test = r2_score(y_test, y_pred_test)
print(f'R-squared test: {r2_test}')
42
#Manual Test Performance just for illustration
#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")
#mse = mean_squared_error(y_test, y_pred_test)
#mae = mean_absolute_error(y_test, y_pred_test)
#print(f'MSE, MAE:')
#print(mse, mae)
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
x = pd.DataFrame(red_wine['pH'])
43
y = red_wine['quality']
plt.plot(x)
plt.show()
#scaler = StandardScaler()
#x = scaler.fit_transform(x_raw)
#x = pd.DataFrame(x, columns=x_raw.columns)
print(x.describe())
#print(x_raw.describe())
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost: {np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 3000
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
44
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta, alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='blue')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
#Training Performance
final_theta=theta_list[-1].flatten()
y_pred_train=np.dot(x_train, final_theta)
r2_train=r2_score(y_train,y_pred_train)
print(f'R-squared train: {r2_train}')
# Test Performance
y_pred_test = np.dot(x_test, final_theta)
r2_test = r2_score(y_test, y_pred_test)
print(f'R-squared test: {r2_test}')
#Manual Test Performance just for illustration
#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")
mse = mean_squared_error(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)
FRESH BASIC
import numpy as np
import pandas as pd
45
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
x = pd.DataFrame(red_wine['pH'])
y = red_wine['quality']
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
46
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 1500
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)
# Test the performance of the model on the testing dataset
y_test = np.array(y_test).reshape(-1, 1)
g_test, theta_test_list, cost_test_list = gradient_descent(x_test,
y_test, theta, alpha, iters)
#Training Performance
final_theta=theta_list[-1].flatten()
y_pred_train=np.dot(x_train, final_theta)
r2_train=r2_score(y_train,y_pred_train)
print(f'R-squared train: {r2_train}')
# Test Performance
y_pred_test = np.dot(x_test, final_theta)
r2_test = r2_score(y_test, y_pred_test)
print(f'R-squared test: {r2_test}')
mse = mean_squared_error(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)
plt.plot(range(1, iters + 1), cost_test_list, color='orange')
plt.rcParams["figure.figsize"] = (10,6)
47
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
# ALL FEATURES EXCEPT TARGET
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.metrics import mean_squared_error, mean_absolute_error,
r2_score
from sklearn.preprocessing import StandardScaler
# Load Red Wine Quality Dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
# Only consider the 'pH' column as a predictive feature for now
#x = pd.DataFrame(red_wine['pH'])
#y = red_wine['quality']
#plt.plot(x)
#plt.show()
x_raw = red_wine.drop('quality', axis=1)
y = red_wine['quality']
print(x_raw.isnull().sum())
x_raw = x_raw.fillna(0) # Replace NaN with 0
scaler = StandardScaler()
48
x = scaler.fit_transform(x_raw)
x = pd.DataFrame(x, columns=x_raw.columns)
print(x.describe())
#print(x_raw.describe())
# Splitting datasets into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,
random_state=10)
# We set our parameters to start at 0
theta = np.zeros(x_train.shape[1]).reshape(-1, 1)
# Gradient Descent Function
def gradient_descent(x, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1 / (2*m) * np.dot(error.T, error)
cost_list.append(np.squeeze(cost))
theta = theta - (alpha * (1/m) * np.dot(x.T, error))
theta_list.append(theta)
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Define the number of iterations and alpha value
alpha = 0.01
iters = 3000
49
# Applying Gradient Descent
y_train = np.array(y_train).reshape(-1, 1)
g, theta_list, cost_list = gradient_descent(x_train, y_train, theta,
alpha, iters)
plt.plot(range(1, iters + 1), cost_list, color='blue')
plt.rcParams["figure.figsize"] = (10,6)
plt.grid()
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent on test dataset')
plt.show()
#Training Performance
final_theta=theta_list[-1].flatten()
y_pred_train=np.dot(x_train, final_theta)
r2_train=r2_score(y_train,y_pred_train)
print(f'R-squared train: {r2_train}')
# Test Performance
y_pred_test = np.dot(x_test, final_theta)
r2_test = r2_score(y_test, y_pred_test)
print(f'R-squared test: {r2_test}')
#Manual Test Performance just for illustration
#ss_tot = np.sum((y_test - y_test.mean()) ** 2)
#ss_res = np.sum((y_test - y_pred_test) ** 2)
#r2_test = 1 - (ss_res / ss_tot)
#print(f"R-squared manual calc: {r2_test:.4f}")
50
mse = mean_squared_error(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'MSE, MAE:')
print(mse, mae)
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
dtype: int64
fixed acidity volatile acidity ... sulphates alcohol
count 1.599000e+03 1.599000e+03 ... 1.599000e+03 1.599000e+03
mean 3.554936e-16 1.733031e-16 ... 6.754377e-16 1.066481e-16
std 1.000313e+00 1.000313e+00 ... 1.000313e+00 1.000313e+00
min -2.137045e+00 -2.278280e+00 ... -1.936507e+00 -1.898919e+00
25% -7.007187e-01 -7.699311e-01 ... -6.382196e-01 -8.663789e-01
50% -2.410944e-01 -4.368911e-02 ... -2.251281e-01 -2.093081e-01
75% 5.057952e-01 6.266881e-01 ... 4.240158e-01 6.354971e-01
max 4.355149e+00 5.877976e+00 ... 7.918677e+00 4.202453e+00
[8 rows x 11 columns]
Iteration 500/3000, Cost: 16.014052
Iteration 1000/3000, Cost: 16.012998
Iteration 1500/3000, Cost: 16.012576
Iteration 2000/3000, Cost: 16.012360
Iteration 2500/3000, Cost: 16.012244
Iteration 3000/3000, Cost: 16.012180
51
R-squared train: -49.69847600642071
R-squared test: -46.29026820603956
MSE, MAE:
33.02437063055097 5.698079446215109
#CLAUDE all features
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import datasets
from sklearn.preprocessing import StandardScaler
# Load the dataset
red_wine = datasets.load_dataset('codesignal/wine-quality', split='red')
red_wine = pd.DataFrame(red_wine)
52
# Separate features (X) and target (y)
# Exclude 'quality' which is our target variable
X = red_wine.drop('quality', axis=1)
y = red_wine['quality']
# Print feature names and shape
print(f"Features used ({X.shape[1]} total):")
for i, feature in enumerate(X.columns, 1):
print(f"{i}. {feature}")
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3, random_state=0)
# Initialize parameters (theta) for all features
theta = np.zeros(X_train.shape[1]).reshape(-1, 1)
def gradient_descent(X, y, theta, alpha, iterations):
m = y.size
cost_list = []
theta_list = [theta.copy()]
# Add progress tracking
for i in range(iterations):
prediction = np.dot(X, theta)
error = prediction - y.values.reshape(-1, 1)
cost = 1/(2*m) * np.dot(error.T, error)
theta = theta - (alpha * (1/m) * np.dot(X.T, error))
53
cost_list.append(np.squeeze(cost))
theta_list.append(theta.copy())
if (i+1) % 500 == 0:
print(f"Iteration {i+1}/{iterations}, Cost:
{np.squeeze(cost):.6f}")
return theta, theta_list, cost_list
# Set parameters
alpha = 0.01 # Increased learning rate since we scaled the features
iters = 3000
# Run gradient descent
print("\nTraining model...")
final_theta, theta_history, cost_history = gradient_descent(
X_train, y_train, theta, alpha, iters
)
# Plot cost convergence
plt.figure(figsize=(12, 6))
plt.plot(range(1, iters + 1), cost_history, color='blue')
plt.xlabel('Number of iterations')
plt.ylabel('Cost (J)')
plt.title('Convergence of gradient descent')
plt.grid(True)
plt.show()
# Print final parameters and their importance
print("\nFinal parameters for each feature:")
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Parameter Value': final_theta.flatten()
54
})
feature_importance['Absolute Importance'] =
abs(feature_importance['Parameter Value'])
feature_importance = feature_importance.sort_values('Absolute
Importance', ascending=False)
print(feature_importance.to_string(index=False))
# Calculate R-squared for training data
y_pred_train = np.dot(X_train, final_theta)
ss_tot = np.sum((y_train.values.reshape(-1, 1) - y_train.values.mean())
** 2)
ss_res = np.sum((y_train.values.reshape(-1, 1) - y_pred_train) ** 2)
r2_train = 1 - (ss_res / ss_tot)
# Calculate R-squared for test data
y_pred_test = np.dot(X_test, final_theta)
ss_tot = np.sum((y_test.values.reshape(-1, 1) - y_test.values.mean()) **
2)
ss_res = np.sum((y_test.values.reshape(-1, 1) - y_pred_test) ** 2)
r2_test = 1 - (ss_res / ss_tot)
print(f"\nModel Performance:")
print(f"Training R-squared: {r2_train:.4f}")
print(f"Testing R-squared: {r2_test:.4f}")
Features used (11 total):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
55
9. pH
10. sulphates
11. alcohol
Training model...
Iteration 500/3000, Cost: 16.170979
Iteration 1000/3000, Cost: 16.167348
Iteration 1500/3000, Cost: 16.165631
Iteration 2000/3000, Cost: 16.164715
Iteration 2500/3000, Cost: 16.164220
Iteration 3000/3000, Cost: 16.163950
Final parameters for each feature:
Feature Parameter Value Absolute Importance
density -0.430942 0.430942
fixed acidity 0.426656 0.426656
sulphates 0.282146 0.282146
free sulfur dioxide 0.196943 0.196943
residual sugar 0.150299 0.150299
chlorides -0.129332 0.129332
total sulfur dioxide -0.070154 0.070154
alcohol 0.058984 0.058984
pH 0.045277 0.045277
citric acid 0.037427 0.037427
volatile acidity -0.034311 0.034311
Model Performance:
Training R-squared: -46.7194
Testing R-squared: -53.4278
56
Dolor sit
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut
laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation
ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.
Details
Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim
placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum
claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius. Lorem ipsum
dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore
magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper
suscipit lobortis nisl ut aliquip ex ea commodo consequat.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut
laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation
ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in
hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros
et accumsan.