Professional Machine Learning
Professional Machine Learning
6. Model Deployment and Maintenance: Model will be sent to real-world cases and␣
↪then with more data it gets trained more again and again'''
actual learning process in the fit method, which you call while providing the␣
↪input data
and labels in the form of X_train and y_train arrays. Predictors provide a␣
↪predict method to take the data which needs to be predicted
transformation with respect to the parameters that have been learned by the fit␣
↪method
and provides the predicted values or labels. Sklearn works on pandas dfs and␣
↪numpy arrays.
Pipeline objects chain multiple estimators into a single one. Thus, you can␣
↪encapsulate
1
(Preprocess data by relevant transformation)
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)'''
print('.')
df.head()
Y=df['Churn'].values #target
#.values is used to convert it to numpy array
knn=KNeighborsClassifier(n_neighbors=10)
knn.fit(X,Y)
[75]: KNeighborsClassifier(n_neighbors=10)
2
[76]: import numpy as np
#let's test how it predicts:
y_pred=knn.predict(np.array([[128,25,265.1,197.4,244.7,10.
↪01],[20,150,155,200,10,3]]))
print(f'Predictions: {y_pred}')
Predictions: [1 1]
y=df['Churn'].values #target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Print the accuracy
print(accuracy_score(y_test, y_pred))
0.8111067961165048
[78]: #Large K means less complex model which can cause under-fitting
#Smakk K means more complex model which can cause over-fiiting i.e. considering␣
↪noise (messy data) as real data
# Create neighbors
neighbors = np.arange(1,15)
train_accuracies = {}
test_accuracies = {}
for neighbor in neighbors:
# Set up a KNN Classifier
knn = KNeighborsClassifier(n_neighbors=neighbor)
# Fit the model
knn.fit(X_train, y_train)
# Compute accuracy
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)
print(train_accuracies, '\n', test_accuracies)
3
0.8639779413192489, 9: 0.8601526243228024, 10: 0.8580360783704538, 11:
0.8534534651158275, 12: 0.8532592865880891, 13: 0.8506378764636207, 14:
0.8498805802054409}
{1: 0.785242718446602, 2: 0.7761553398058253, 3: 0.8040388349514563, 4:
0.8034174757281554, 5: 0.8111067961165048, 6: 0.8150679611650485, 7:
0.817009708737864, 8: 0.8203495145631068, 9: 0.8199611650485437, 10:
0.8218252427184466, 11: 0.8213592233009709, 12: 0.8240776699029126, 13:
0.8225242718446601, 14: 0.8233009708737864}
4
[80]: #Accuracy is not always good for assessing classification model. It is the␣
↪ratio of the number of correct predictions to the total number of␣
↪predictions. Accuracy gives a general measure of how well the model performs␣
#We use classification metric to check other facts about model i.e:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
#Precision is the ratio of true positive predictions to the total number of␣
↪predicted positive cases. Precision measures the accuracy of positive␣
↪predictions. It tells you how many of the predicted positives are actually␣
#Recall is the ratio of true positive predictions to the total number of actual␣
↪positive cases. It measures how well the model identifies positive cases. It␣
#The F1 score is the harmonic mean of precision and recall. It combines both␣
↪metrics into a single score that balances their trade-offs. The F1 score is␣
↪especially useful when you need a single metric to evaluate the performance␣
#Which metric to use when depends on the models requirements like Use accuracy␣
↪when there is no class imbalance and you care about overall correctness of␣
↪model. Use Precsion when False Positive are more costly than false negatives␣
↪and you wanna minimize false alrams i.e If you want to minimize non-spam␣
↪emails being labeled as spam. Use Recall when False negatives are more␣
↪costly than false positives. You want to capture as many positives as␣
↪possible. i.e. You prefer to catch all possible cases of a disease, even if␣
↪some non-diseased people are flagged. Use F1 When: there is an imbalance␣
↪between classes. You need a balance between precision and recall. You're␣
↪concerned with both false positives and false negatives. i.e You want to␣
↪show the most relevant results (high precision), but you also don't want to␣
5
[81]: #ii. Logistic Regression for Classification: Only applicable for Binary target␣
↪variable.
print(y_prob)
[82]: #By default probility threshold is >=0.5 for class 1 (positive). If we vary␣
↪this threshold we can use ROC curve to see how it impacts the True positives␣
6
[83]: # Import roc_auc_score: '''The AUC value ranges from 0 to 1:
'''1.0: Perfect model (ideal performance with no errors).
0.5: Random guessing (the model is no better than chance).
0.0: Worst-case scenario (the model is predicting everything incorrectly).'''
from sklearn.metrics import roc_auc_score, confusion_matrix,␣
↪classification_report
# Calculate roc_auc_score
print(roc_auc_score(y_test, y_prob))
# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))
# Calculate the classification report
print(classification_report(y_test, y_pred))
#ROC Curve is mostly used to test the Binary classification model.
0.8945414435900391
[[5545 1248]
[1103 4979]]
precision recall f1-score support
7
1 0.80 0.82 0.81 6082
#i. Ridge Regression (L2 Regularization): Minimizes the loss function while␣
↪adding a penalty proportional to the sum of squared coefficients. Hence,␣
↪shrinks the coefficients towards zero, but doesn’t eliminate them. Helps in␣
#ii. Lasso Regression (L1 Regularization): Minimizes the loss function while␣
↪adding a penalty proportional to the sum of the absolute values of the␣
↪the optimization problem ('liblinear' for L1, 'lbfgs' and others for L2,␣
print(y_prob)
print(classification_report(y_test, y_pred))
#Note that we select optimized value of C by Hyperparameter tuning for best␣
↪results.
8
0 0.83 0.82 0.83 6793
1 0.80 0.82 0.81 6082
[85]: #iii Multi-Class Logistic Regression: The idea of logistic regression can be␣
↪extended to multi-class prediction as follow:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
lr_ovr = LogisticRegression(multi_class="ovr")
lr_ovr.fit(X_train, y_train)
print("One-vs-Rest Logistic Regression")
print("Training Accuracy:", lr_ovr.score(X_train, y_train))
print("Test Accuracy :", lr_ovr.score(X_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, lr_ovr.
↪predict(X_test)))
lr_mn = LogisticRegression(multi_class="multinomial")
lr_mn.fit(X_train, y_train)
print("Multinomial Logistic Regression")
print("Training Accuracy:", lr_mn.score(X_train, y_train))
print("Test Accuracy :", lr_mn.score(X_test, y_test))
print("\nClassification Report:\n", classification_report(y_test, lr_mn.
↪predict(X_test)))
9
Classification Report:
precision recall f1-score support
accuracy 0.97 30
macro avg 0.97 0.96 0.97 30
weighted avg 0.97 0.97 0.97 30
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
[86]: #iv. SVC (Support Vector Classifier): They aim to find the optimal boundary␣
↪(hyperplane) that best separates data points of different classes with the␣
↪maximum possible margin. This optimal hyperplane ensures that future data␣
↪points are classified with higher confidence. They are alike Logistic␣
↪boundaries including and other than linear one hence is also applicable to␣
↪etc.) for each class using boundary data points(Support vectors) hence,␣
↪ensuring maximum class margins (Margin: The distance between the hyperplane␣
↪and the closest data points from each class). SVM aims to maximize this␣
10
'''Linear Kernel (kernel='linear'): Suitable for linearly separable data.␣
↪Faster to compute.
Sigmoid Kernel (kernel='sigmoid'): Less commonly used; can behave like a neural␣
↪network.'''
svm.fit(X_train, y_train)
train_accuracy = svm.score(X_train, y_train)
test_accuracy = svm.score(X_test, y_test)
print(f"SVC Training Accuracy: {train_accuracy:.4f}")
print(f"SVC Test Accuracy : {test_accuracy:.4f}")
11
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)
12
[92]: #2. Regression:
import pandas as pd
df = pd.read_csv("diabetes_clean.csv") #https://www.kaggle.com/datasets/
↪saurabh00007/diabetescsv
df.head()
diabetes
0 1
1 0
2 1
3 0
4 1
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)
#Underneath this regression performs OLS (ordinary least square regression i.e.␣
↪minimizing the least square distance(residual) function (loss/cost/error␣
[94]: 0.28280468810375137
rmse
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\metrics\_regression.py:483:
FutureWarning: 'squared' is deprecated in version 1.4 and will be removed in
13
1.6. To calculate the root mean squared error, use the
function'root_mean_squared_error'.
warnings.warn(
[95]: 26.34145958223226
↪handle this bias. For that purpose we split our data into k fold (k is upto␣
↪metric by considering each fi as test data and other as train data once. ␣
[98]: #i. Ridge Regression (L2 Regularization): Minimizes the sum of squared errors␣
↪(SSE) while adding a penalty proportional to the sum of squared coefficients.
↪ Hence, shrinks the coefficients towards zero, but doesn’t eliminate them.␣
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
14
ridge_scores = []
for alpha in alphas:
# Create a Ridge regression model
ridge = Ridge(alpha=alpha)
# Fit the data
ridge.fit(X_train, y_train)
# Obtain R-squared
score = ridge.score(X_test, y_test)
ridge_scores.append(score)
print(ridge_scores) #hence large values of alpha (overpanalizing loss function)␣
↪don't cause underfitting ! For practical modeling we don't need to simulate␣
↪for different values of alpha, take one and execute the model.
[99]: #ii. Lasso Regression (L1 Regularization): Minimizes the sum of squared errors␣
↪(SSE) while adding a penalty proportional to the sum of the absolute values␣
↪of the coefficients. Lasso can shrink some coefficients exactly to zero,␣
# Import Lasso
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt# Instantiate a lasso regression model
lasso = Lasso()
# Fit the model to the data
lasso.fit(X, y)
# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)
df_features=df.drop('glucose',axis=1)
plt.bar(df_features.columns,lasso_coef)
plt.xticks(rotation=45)
plt.show()
15
[100]: #iii. Elastic Net (Combination of L1 and L2 Regularization): Combine Ridge (L2)␣
↪and Lasso (L1) regularization penalties. It combines the benefits of Ridge␣
↪and Lasso by shrinking coefficients while still allowing feature selection.
Lasso: When you want to perform feature selection, especially when many␣
↪features are irrelevant.
Elastic Net: When you need a balance between Ridge and Lasso, especially when␣
↪you have correlated features.'''
print('.')
[102]: #5. Hyperparameter Tuning: Search of the best/optimised parameter (k, alpha etc␣
↪arguments of model instatiator) values for our model.
16
from sklearn.linear_model import Lasso
# Set up the parameter grid i.e which parameters to search
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}
kf = KFold(n_splits=6, shuffle=True, random_state=5) #for cross-validation to␣
↪avoid over-fititng
# Instantiate lasso_cv
lasso = Lasso() #model instantiate
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf) #
# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))
# Instantiate lasso_cv
lasso = Lasso() #model instantiate
lasso_cv = RandomizedSearchCV(lasso, param_grid, cv=kf, n_iter=3) #n_iter is␣
↪optional which specifies how many random values of parameters to test
[104]: #Remember that we must apply data science i.e. Cleaning, Analysis,␣
↪Visualisation to understand relations, and preprocessing before applying␣
[105]:
17
#6. Pipelinig the ML project: Handling Categorical/Missing Data/Data␣
↪Preprocessing and Many more in one go : For conversion of Categorical data␣
↪Binary encoding, Pandas get dummies method etc. So we must convert our data␣
↪to numbers before apply ML models. Also for handling missing data we also␣
↪know various techniques of Data Analysis, however Sklearn has one of it's␣
↪own method of refilling missing data using sklearn.impute model. Here we␣
↪also discussed about the ML pipelining where we pipeline steps that code␣
import pandas as pd
df=pd.read_csv('music_genre.csv') #
df.head()
valence music_genre
0 0.759 Electronic
1 0.531 Electronic
2 0.333 Electronic
3 0.270 Electronic
4 0.323 Electronic
[106]: df.
↪drop(['instance_id','artist_name','track_name','obtained_date','key'],axis=1,inplace=True)␣
df.head()
18
[106]: popularity acousticness danceability duration_ms energy \
0 27.0 0.00468 0.652 -1.0 0.941
1 31.0 0.01270 0.622 218293.0 0.890
2 28.0 0.00306 0.620 215613.0 0.755
3 34.0 0.02540 0.774 166875.0 0.700
4 32.0 0.00465 0.638 222369.0 0.587
df.head()
19
0 True False False
1 True False False
2 True False False
3 True False False
4 True False False
[5 rows x 23 columns]
↪data.
df['tempo']=df['tempo'].astype('float')
df.dtypes
20
[109]: popularity float64
acousticness float64
danceability float64
duration_ms float64
energy float64
instrumentalness float64
liveness float64
loudness float64
speechiness float64
tempo float64
valence float64
music_genre_Alternative bool
music_genre_Anime bool
music_genre_Blues bool
music_genre_Classical bool
music_genre_Country bool
music_genre_Electronic bool
music_genre_Hip-Hop bool
music_genre_Jazz bool
music_genre_Rap bool
music_genre_Rock bool
mode_Major bool
mode_Minor bool
dtype: object
[110]: # The below modeling includes pipelining, Cross validation and Hyper-parameters␣
↪tuning and Model Assessment one in all
21
steps = [("imputer", imputer), ("scaler", SS), ("lasso", lasso)]
# Create the pipeline
pipeline = Pipeline(steps)
# Cross-validation setup
kf = KFold(n_splits=6, shuffle=True, random_state=5)
# Hyperparameter tuning setup
param = {"lasso__alpha": np.linspace(0.0001, 1, 20)} # Alpha range for Lasso
# Use RandomizedSearchCV with the pipeline
cv = RandomizedSearchCV(pipeline, param, cv=kf, n_iter=3, random_state=42)
# Fit the RandomizedSearchCV pipeline to the training data
cv.fit(X_train, y_train)
# Make predictions on the test set
y_pred = cv.predict(X_test)
# Evaluate the model using appropriate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print evaluation metrics
print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")
# Get the best hyperparameters found by RandomizedSearchCV
print(f"Best Parameters: {cv.best_params_}")
y=df['Churn'].values #target
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)
results = []
# Loop through the models' values
for model in models.values():
22
# Instantiate a KFold object
kf = KFold(n_splits=6, random_state=12, shuffle=True)
# Perform cross-validation
cv_results = cross_val_score(model, X_train, y_train, cv=kf)
results.append(cv_results)
print(results)
plt.boxplot(results, labels=models.keys())
plt.show()
[112]: #Nota Bina: While applying any ML model in Supervised learning, to check␣
↪whether model is ok for the data asses model on training data i.e. find␣
↪score(X_test,y_test). This will help you choose best model for your data, to␣
23
#In supervised learning, overfiting referes to doing better on the training set␣
↪than the test set.
24
25
train_points.shape #test_points contain a 2D array with two columns (features)␣
↪of 300 rows (records)
[114]: (300, 2)
↪50456344e-01],[-1.59589493e+00, -1.53932892e-01],[-1.01829770e+00, 3.
↪88542370e-02],[ 1.24819659e+00, 6.60041803e-01],[-1.25551377e+00, -2.
↪96172009e-02],[-1.41864559e+00, -3.58230179e-01],[ 5.25758326e-01, 8.
↪70500543e-01],[ 5.55599988e-01, 1.18765072e+00],[ 2.81344439e-02, -6.
↪99111314e-01]])
labels = model.predict(new_points)
# Print cluster labels of new_points
print(labels)
[1 1 0 2 2 2 1 0 0 0 2 0 1 1 0 2 2 1 2 2 1 2 0 2 1 1 1 1 0 1 1 0 0 2]
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
[116]: #Visualizing Clusters: To decide the value of K for Clustering we analyse the␣
↪data by visualization and then decide how many clusters are possible.
26
# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.8)
# Assign the cluster centers: centroids
centroids = model.cluster_centers_ #The mean of a cluster is called its␣
↪centroid.
↪A good clustering model has less numbe of clusters and are tight. So we can␣
ks = range(1, 6)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model
27
model=KMeans(n_clusters=k)
# Fit model to samples
model.fit(train_points)
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
#The inertia decreases very slowly from 3 clusters to 4, so it looks like 3␣
↪clusters would be a good choice for this data.
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
28
[118]: #ii. Crosstab for CLuster analysis:
# Create a KMeans model with 3 clusters: model
import pandas as pd
model = KMeans(n_clusters=3)
# Use fit_predict to fit model and obtain cluster labels: labels
labels =model.fit_predict(train_points) #fits model and then asks it predict
labels
C:\Users\14274\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1446:
UserWarning: KMeans is known to have a memory leak on Windows with MKL, when
there are less chunks than available threads. You can avoid it by setting the
environment variable OMP_NUM_THREADS=2.
warnings.warn(
[118]: array([1, 2, 0, 0, 2, 2, 0, 1, 2, 2, 0, 1, 2, 0, 2, 1, 0, 0, 1, 0, 2, 1,
2, 1, 1, 2, 1, 1, 1, 2, 0, 0, 0, 2, 1, 2, 1, 1, 2, 1, 1, 0, 2, 2,
2, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 2, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2,
0, 2, 2, 1, 0, 2, 0, 1, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2, 1,
1, 0, 2, 0, 2, 1, 1, 1, 0, 2, 2, 0, 2, 1, 2, 0, 1, 0, 0, 0, 2, 2,
1, 2, 0, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 2, 0, 1, 2, 2, 0, 0,
2, 1, 2, 1, 0, 2, 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 1, 0, 0, 2,
29
0, 2, 1, 1, 0, 2, 0, 0, 0, 2, 1, 1, 2, 0, 0, 1, 1, 0, 1, 1, 2, 1,
0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 2, 0, 1, 1, 1, 1, 2, 0, 1, 2, 2, 2,
1, 2, 1, 1, 2, 0, 0, 1, 0, 1, 1, 2, 2, 1, 0, 2, 0, 1, 0, 2, 1, 2,
2, 2, 2, 0, 0, 0, 1, 1, 2, 1, 0, 2, 1, 1, 2, 1, 0, 0, 0, 0, 0, 2,
1, 1, 0, 0, 1, 2, 0, 2, 2, 1, 1, 2, 2, 2, 1, 0, 1, 2, 1, 0, 0, 0,
0, 0, 1, 1, 2, 1, 1, 2, 0, 0, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2, 2, 1,
0, 2, 2, 0, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2])
[119]: varieties=[]
for i in labels:
if i==0:
varieties.append('Cluster_1')
elif i==1:
varieties.append('Cluster_2')
else:
varieties.append('Cluster_3')
print(varieties)
30
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_2',
'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_1',
'Cluster_1', 'Cluster_2', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_1', 'Cluster_2',
'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3',
'Cluster_3', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1', 'Cluster_1',
'Cluster_1', 'Cluster_3', 'Cluster_2', 'Cluster_2', 'Cluster_1', 'Cluster_1',
'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_1',
'Cluster_2', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_1', 'Cluster_1',
'Cluster_1', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_2',
'Cluster_1', 'Cluster_1', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_2',
'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_2', 'Cluster_1', 'Cluster_3',
'Cluster_3', 'Cluster_1', 'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3',
'Cluster_2', 'Cluster_2', 'Cluster_2', 'Cluster_3', 'Cluster_3', 'Cluster_3']
↪with, the clustering may not always be this good. Is there anything you can␣
↪do in such situations to improve your clustering? You'll find out next!
[121]: #When data has high variablity it can impact clustering negatively so we should␣
↪Should standarize (same scale for each data 0-1) or normalize (scale based␣
↪on each cols. own data) our data before apply clustering algorithms.
31
kmeans = KMeans(n_clusters=3)
pipeline = make_pipeline(scaler,kmeans)
pipeline.fit(train_points)
labels = pipeline.predict(train_points)
df = pd.DataFrame({'labels':labels,'varieties':varieties})
ct = pd.crosstab(df['labels'],df['varieties'])
# Display ct
print(ct)
[122]: #2. Hierarchial Clusterning for Visualization of Data: Creates Hierarchy of any␣
↪sort of data for better visualziation to non-tech audiences.
print(train_points[0:50].shape)
(50, 2)
↪you draw horizontal line at any point on y-axist the number of lines it␣
32
[124]: #We can printout data of these clustes as below:
import pandas as pd
from scipy.cluster.hierarchy import fcluster
# Use fcluster to extract labels: labels
labels = fcluster(mergings, 2, criterion='distance') #2 is height limit at␣
↪y-axis (distance)
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
# Display ct
ct
#Note there is nothing like prediction thing in HC, it is just for visualization
[124]: varieties C0 C1 C10 C11 C12 C13 C14 C15 C16 C17 … C45 C46 C47 \
labels …
1 0 1 0 0 1 0 1 0 0 0 … 0 0 0
2 0 0 1 0 0 1 0 0 1 1 … 0 0 1
3 1 0 0 1 0 0 0 1 0 0 … 1 1 0
33
labels
1 0 0 1 0 0 1 1
2 0 1 0 1 0 0 0
3 1 0 0 0 1 0 0
[3 rows x 50 columns]
[125]: #3. t-SNE for 2D visualization of higher dimension data: HC is good for small␣
↪data, but for big data we use t-SNE.
import pandas as pd
df=pd.read_csv('ANSUR II FEMALE Public.csv') #https://www.kaggle.com/datasets/
↪seshadrikolluri/ansur-ii
print(numeric_df.shape)
from sklearn.manifold import TSNE
m = TSNE(learning_rate=50)
tsne_features = m.fit_transform(numeric_df)
print(tsne_features) #Reduced to 2D
#Assigning t-SNE features to our dataset
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
import seaborn as sns
sns.scatterplot(x="x", y="y", data=df)
plt.show()
(1986, 99)
[[-45.502087 23.22181 ]
[-45.30828 24.381142]
[-44.91254 24.29287 ]
…
[ 36.55094 -24.02615 ]
[ 40.648514 -21.196135]
[ 39.447086 -25.889902]]
34
[126]: #We can further customize our plot based on some categories the data belong to:
cat_df = df.select_dtypes(include=['object'])
cat_df.head()
35
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)
plt.show()
plt.show()
36
[129]: #Segreggations based on WritingPreference feature:
sns.scatterplot(x="x", y="y",hue='WritingPreference', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)
plt.show()
#You can see the power of TSNE !!!!
37
[130]: #4. PCA for Dimensionality Reduction: PCA reduces dimensions of a Data Set fo␣
↪features to the intrinsic dimension (mandatory features only)
38
samples=np.array([[ 242. , 23.2, 25.4, 30. , 38.4, 13.4],[ 290. , ␣
↪24. , 26.3, 31.2, 40. , 13.8],[ 340. , 23.9, 26.5, 31.1, 39.
↪8, 15.1],[ 363. , 26.3, 29. , 33.5, 38. , 13.3],[ 430. , 26.5,␣
↪ 29. , 34. , 36.6, 15.1],[ 450. , 26.8, 29.7, 34.7, 39.2, ␣
↪14.2],[ 500. , 26.8, 29.7, 34.5, 41.1, 15.3],[ 390. , 27.6, 30.
↪ , 35. , 36.2, 13.4],[ 450. , 27.6, 30. , 35.1, 39.9, 13.
↪8],[ 500. , 28.5, 30.7, 36.2, 39.3, 13.7],[ 475. , 28.4, 31. ,␣
↪ 36.2, 39.4, 14.1],[ 500. , 28.7, 31. , 36.2, 39.7, 13.3],[␣
↪500. , 29.1, 31.5, 36.4, 37.8, 12. ],[ 600. , 29.4, 32. , 37.
↪2, 40.2, 13.9],[ 600. , 29.4, 32. , 37.2, 41.5, 15. ],[ 700. ,␣
↪ 30.4, 33. , 38.3, 38.8, 13.8],[ 700. , 30.4, 33. , 38.5, ␣
↪38.8, 13.5],[ 610. , 30.9, 33.5, 38.6, 40.5, 13.3],[ 650. , 31.
↪ , 33.5, 38.7, 37.4, 14.8],[ 575. , 31.3, 34. , 39.5, 38.3, ␣
↪ 14.1],[ 685. , 31.4, 34. , 39.2, 40.8, 13.7],[ 620. , 31.5, ␣
↪34.5, 39.7, 39.1, 13.3],[ 680. , 31.8, 35. , 40.6, 38.1, 15.
↪1],[ 700. , 31.9, 35. , 40.5, 40.1, 13.8],[ 725. , 31.8, 35. ,␣
↪ 40.9, 40. , 14.8],[ 720. , 32. , 35. , 40.6, 40.3, 15. ],[␣
↪714. , 32.7, 36. , 41.5, 39.8, 14.1],[ 850. , 32.8, 36. , 41.
↪6, 40.6, 14.9],[1000. , 33.5, 37. , 42.6, 44.5, 15.5],[ 920. ,␣
↪ 35. , 38.5, 44.1, 40.9, 14.3],[ 955. , 35. , 38.5, 44. , ␣
↪41.1, 14.3],[ 925. , 36.2, 39.5, 45.3, 41.4, 14.9],[ 975. , 37.
↪4, 41. , 45.9, 40.6, 14.7],[ 950. , 38. , 41. , 46.5, 37.9, ␣
↪ 13.7],[ 40. , 12.9, 14.1, 16.2, 25.6, 14. ],[ 69. , 16.5, ␣
↪18.2, 20.3, 26.1, 13.9],[ 78. , 17.5, 18.8, 21.2, 26.3, 13.
↪7],[ 87. , 18.2, 19.8, 22.2, 25.3, 14.3],[ 120. , 18.6, 20. ,␣
↪ 22.2, 28. , 16.1],[ 0. , 19. , 20.5, 22.8, 28.4, 14.7],[␣
↪110. , 19.1, 20.8, 23.1, 26.7, 14.7],[ 120. , 19.4, 21. , 23.
↪7, 25.8, 13.9],[ 150. , 20.4, 22. , 24.7, 23.5, 15.2],[ 145. ,␣
↪ 20.5, 22. , 24.3, 27.3, 14.6],[ 160. , 20.5, 22.5, 25.3, ␣
↪27.8, 15.1],[ 140. , 21. , 22.5, 25. , 26.2, 13.3],[ 160. , 21.
↪1, 22.5, 25. , 25.6, 15.2],[ 169. , 22. , 24. , 27.2, 27.7, ␣
↪ 14.1],[ 161. , 22. , 23.4, 26.7, 25.9, 13.6],[ 200. , 22.1, ␣
↪23.5, 26.8, 27.6, 15.4],[ 180. , 23.6, 25.2, 27.9, 25.4, 14.␣
↪],[ 290. , 24. , 26. , 29.2, 30.4, 15.4],[ 272. , 25. , 27. , ␣
↪ 30.6, 28. , 15.6],[ 390. , 29.5, 31.7, 35. , 27.1, 15.3],[ ␣
↪6.7, 9.3, 9.8, 10.8, 16.1, 9.7],[ 7.5, 10. , 10.5, 11.
↪6, 17. , 10. ],[ 7. , 10.1, 10.6, 11.6, 14.9, 9.9],[ 9.7,␣
↪ 10.4, 11. , 12. , 18.3, 11.5],[ 9.8, 10.7, 11.2, 12.4, ␣
↪16.8, 10.3],[ 8.7, 10.8, 11.3, 12.6, 15.7, 10.2],[ 10. , 11.
↪3, 11.8, 13.1, 16.9, 9.8],[ 9.9, 11.3, 11.8, 13.1, 16.9, ␣
↪ 8.9],[ 9.8, 11.4, 12. , 13.2, 16.7, 8.7],[ 12.2, 11.5, ␣
↪12.2, 13.4, 15.6, 10.4],[ 13.4, 11.7, 12.4, 13.5, 18. , 9.
↪4],[ 12.2, 12.1, 13. , 13.8, 3916.5, 9.1],[ 19.7, 13.2, 14.3,␣
↪ 15.2, 18.9, 13.6],[ 19.9, 13.8, 15. , 16.2, 18.1, 11.6],[␣
↪200. , 30. , 32.3, 34.8, 16. , 9.7],[ 300. , 31.7, 34. , 37.
↪8, 15.1, 11. ],[ 300. , 32.7, 35. , 38.8, 15.3, 11.3],[ 300. ,␣
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler,pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
features = range(pca.n_components_) #total no. of components
plt.bar(features,pca.explained_variance_) #explains variance of each feature
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
#So we can see that only 2 features are most important one, hence its intrinsic␣
↪dimension is 2, so we can use n_componets arg=2 in PCA to drop other␣
40
[131]: scaler = StandardScaler()
pca = PCA(n_components=2)
pipeline = make_pipeline(scaler,pca)
pca_features=pipeline.fit_transform(samples)
print(pca_features.shape)
pca_features
(85, 2)
41
[-1.66919652, -0.48749439],
[-1.53019235, -1.26864097],
[-1.64591521, -0.80754599],
[-1.49861879, -0.69611918],
[-1.46918052, -0.36116684],
[-1.30868872, -0.62387538],
[-1.2912297 , -0.67610179],
[-1.19653317, -0.87126453],
[-1.26874943, -0.15394748],
[-1.20676216, -0.75223637],
[-1.00605794, -0.49490585],
[-1.09293025, -0.21265068],
[-0.97227047, -0.93109612],
[-0.87563108, -0.27891495],
[-0.54366744, -1.07524439],
[-0.45311316, -0.95766887],
[ 0.24132972, -0.70313608],
[-3.2247437 , 1.46444109],
[-3.11578856, 1.31921191],
[-3.13606884, 1.49511086],
[-3.01196518, 0.73797522],
[-3.01859702, 1.24741427],
[-3.01953587, 1.35709736],
[-2.94938148, 1.422223 ],
[-2.9682391 , 1.72456031],
[-2.95851531, 1.80816635],
[-2.91341611, 1.31381391],
[-2.88193847, 1.49337533],
[-2.85443347, 1.7047016 ],
[-2.56545202, 0.05902699],
[-2.52301149, 0.79959474],
[-0.21609392, 1.94129785],
[ 0.18753056, 1.60003896],
[ 0.32007038, 1.50797703],
[ 0.52734202, 1.92249333],
[ 0.83194827, 1.37991283],
[ 0.72243211, 2.09400048],
[ 1.37907983, 2.21751453],
[ 1.44168633, 2.18030073],
[ 1.57043452, 1.58066239],
[ 1.71845528, 2.12978513],
[ 1.93944502, 2.11600295],
[ 2.44514154, 2.04389519],
[ 3.1608639 , 1.79776573],
[ 4.09193928, 1.58736259],
[ 4.9648268 , 2.55461606],
[ 4.90112817, 2.55764882],
42
[ 5.49512681, 2.09367309]])
↪columns of that array represent the weight of each word in the document.␣
↪Most of the time we get sparse array (which contains 0s mostly). So␣
↪application.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF
import matplotlib.pyplot as plt
import seaborn as sns
# Sample documents
documents = ['I say good', 'I say bad', 'nothing'] # 3 Customer remarks
# Step 1: TF-IDF Transformation
tfidf = TfidfVectorizer()
csr_mat = tfidf.fit_transform(documents)
cols_name = tfidf.get_feature_names_out()
df_tfidf = pd.DataFrame(csr_mat.toarray(), index=['Customer_1', 'Customer_2',␣
↪'Customer_3'],columns=cols_name)
print("TF-IDF DataFrame:")
print(df_tfidf)
TF-IDF DataFrame:
bad good nothing say
Customer_1 0.000000 0.795961 0.0 0.605349
Customer_2 0.795961 0.000000 0.0 0.605349
Customer_3 0.000000 0.000000 1.0 0.000000
↪any type of array but is useful for arrays with both positive and negative␣
43
print(df_svd_components) #This tells the weightage of each word in new␣
↪components
↪equires the input array to be non-negative, and is useful for image decomp,␣
↪word docs that never has negative values, recommender system etc. it is more␣
44
nothing 1.301
Name: NMF_Component_2, dtype: float64
[136]: #We can reconstruct our original sparse-matrix by using NMF features and NMF␣
↪components (3 as specified above):
import numpy as np
from scipy.sparse import csr_matrix
reconstructed_mat = np.dot(nmf_features, nmf.components_)
# If you prefer to have the reconstructed matrix in sparse format
reconstructed_sparse = csr_matrix(reconstructed_mat)
# Display the reconstructed matrix
print(reconstructed_mat.round(3)) #it is almost same as our original matrix
pd.DataFrame(reconstructed_mat.round(3),␣
↪index=['Customer_1','Customer_2','Customer_3'],columns=cols_name)
[137]: #Statistical ML
[138]: #1. Decision Trees: They don't need feature scaling because they are based on␣
↪if-else rules. They can be used for both regression and classifications␣
↪tasks.
data.head()
45
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
[5 rows x 33 columns]
[139]: data['diagnosis'].unique()
[140]: data['diagnosis']=data['diagnosis'].replace('M',1).replace('B',0)
data.head()
C:\Users\14274\AppData\Local\Temp\ipykernel_9596\1286790736.py:1: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future
version. To retain the old behavior, explicitly call
`result.infer_objects(copy=False)`. To opt-in to the future behavior, set
`pd.set_option('future.no_silent_downcasting', True)`
data['diagnosis']=data['diagnosis'].replace('M',1).replace('B',0)
46
smoothness_mean compactness_mean concavity_mean concave points_mean \
0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430
fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
[5 rows x 33 columns]
[141]: y=data['diagnosis']
X=data[['concave points_mean','radius_mean']]
dt = DecisionTreeClassifier(max_depth=7, random_state=1)
dt.fit(X_train,y_train)
print(dt.score(X_test,y_test))
0.8859649122807017
47
#Varaince: measures how much a model's predictions fluctuate for different␣
↪training datasets. High variance model causes overfiting. We check variance␣
↪from high variance CV error of f > training set error of f, and is said to␣
↪e. decrease max depth, increase min samples per leaf, gather more data etc.
#Bias: Difference between original values and predicted values. High Bias model␣
↪causes underfiting. If f suffers from high bias then CV error of f � training␣
↪set error of f and both are high and is said to underfit the training set.␣
↪To remedy underfitting increase model complexity i.e. increase max depth,␣
#Model Complexity: Flexibility of the model is called it's complexity and is␣
↪controlled by hyperparameters (Maximum tree depth, Minimum samples per leaf␣
↪etc.). Generally, highly complex models cause high variance and hence␣
↪overfiting while very less complex models cause high bias which result in␣
#Generalization error: The overall error of the model. Test set error of the␣
↪model is called gen. error.
[144]: #3. Ensemble Modeling: Apply multiple models to data and select the best one.␣
↪They can be used for both regression and classifications tasks.
48
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers) #takes the outputs of the models␣
↪defined in the list classifiers and assigns labels by majority voting.
[145]: #4. Bagging: Bagging is like the basic algorithm for ensembles, except that,␣
↪instead of fitting the various models to the same data, instead a single␣
↪controls overfiting.
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test)) #Test accuracy is␣
↪better then simple ensemble learning
[146]: #OOB (out of bag) Evaluation: In bagging since we use bootstraping so many of␣
↪data gets missed while training (on average 37% is missed due to␣
↪bootstrapping). So we can use this 37% data for testing instead of␣
49
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
# Evaluate OOB accuracy
acc_oob = bc.oob_score_
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test,␣
↪acc_oob)) #Accuracies are close so means less variance or overfitting
[147]: #i. Random Forest: It is a Bagging ensemble method with DecisionTree as its␣
↪base model with one important extension: in addition to sampling the␣
↪records, the algorithm also samples the variables i.e uses some specific␣
import pandas as pd
df=pd.read_csv('data_cancer.csv')
df.head()
fractal_dimension_worst Unnamed: 32
50
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
[5 rows x 33 columns]
[148]: y=df['area_mean']
X=df[['texture_mean','perimeter_mean','radius_mean','smoothness_mean','compactness_mean']]
↪training data
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
importances = pd.Series(data=rf.feature_importances_,index= X_train.columns)␣
↪#Which features were how much important in the prediction on test data, it␣
↪changes with test data because for predicting each test point model picks␣
↪specific features.
importances
[151]: #5. Boosting: Boosting algorithm trains models (can be same model again and␣
↪again) sequentially where each successive model tries better predictions␣
↪then previous one, hence reduces bias with final decision as weighted␣
↪average for regression), with more accurate models having more influence.
51
#i. Ada Boost: In this boosting each successive model pays more attention to␣
↪wrongly predicted instances of it's predecessor, however each model is␣
↪trained on whole data. Alpha is error that defines the weight of the model␣
↪in the final decision in combination with learning rate (we'll see in␣
dt = DecisionTreeClassifier(max_depth=2, random_state=1)
ada = AdaBoostClassifier(dt, n_estimators=180, random_state=1) #n_estimators␣
↪defines no. of models to use, and learning rate (eta) combined with alpha is␣
ada.fit(X_train, y_train)
# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]
from sklearn.metrics import roc_auc_score
# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',ada_roc_auc)
C:\Users\14274\anaconda3\Lib\site-
packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R
algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME
algorithm to circumvent this warning.
warnings.warn(
ROC AUC score: 0.9704034391534391
[152]: #ii. Gradient Boosting(GB): In each model is trained on whole training data␣
↪with more focus on errors data, however in GB the error data of one model␣
↪becomes training data of its successor. Hence, final decision can be made by␣
↪any model on the way till end (i.e. the model with minimum error on that␣
↪data point(s)).
52
[153]: #Stochastic Gradient Boosting(SGB): Gradient Boosting involving Data and␣
↪Feature Subsampling (without replacement) at each split(change in model). It␣
↪bias.
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingClassifier
sgbr = GradientBoostingClassifier(max_depth=4, subsample=0.9,max_features=0.
↪75,n_estimators=200,random_state=2) #Sample 90% of prior data for training␣
[154]: #Note: Use Hyperparameter Tuning to optimize your ML models for better␣
↪performance.
[155]: #Extreme Gradient Boosting(XGB): XGB is most optimized Boosting as well as␣
↪Supervised ML algorithm. It incorporates L1 and L2 regularization with the␣
↪missing data, while XGB automatically learns optimal splits for missing␣
'''The booster parameter defines base learner (gbtree, gblinear, dart etc.)
The objective parameter defines the learning task and loss function that␣
↪XGBoost
will use to optimize during training. It tells the algorithm what kind of␣
↪problem it's
solving and how to measure the error. "binary:logistic": This specifies that␣
↪the model
53
xg_cl.fit(X_train,y_train)
y_pred_proba = xg_cl.predict_proba(X_test)[:,1]
xg_cl_roc_auc = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:',xg_cl_roc_auc)
[156]: #XGB CV: XGB has built-in CV method. We don't need to apply CV explicitly
import pandas as pd
churn_data = pd.read_csv("Churn.csv")
churn_data.head()
params={'booster':'gbtree',"objective":"binary:logistic","max_depth":
↪4,'reg_lambda':10}
cv_results
54
3 0.087654 0.000513 0.087660 0.001533
4 0.087489 0.000489 0.087737 0.001391
5 0.087375 0.000343 0.087706 0.001436
6 0.086614 0.000620 0.086914 0.001782
7 0.085205 0.000720 0.085640 0.001271
8 0.085806 0.000312 0.086184 0.001344
9 0.084113 0.000614 0.084397 0.002022
Accuracy: 0.915603
gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator=gbm,␣
↪param_distributions=gbm_param_grid, n_iter=25,␣
randomized_mse.fit(X, y)
55
print("Best parameters found: ",randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
↪np.arange(.1,1.05,.05) }
randomized_neg_mse =␣
↪RandomizedSearchCV(estimator=xgb_pipeline,param_distributions=gbm_param_grid,␣
randomized_neg_mse.fit(X, y)
print("Best rmse: ", np.sqrt(np.abs(randomized_neg_mse.best_score_)))
print("Best model: ", randomized_neg_mse.best_estimator_)
56
[162]: '''XGB is very powerful ML library and most widely used. What We Have Not␣
↪Covered (And How You CanProceed):
#XGB is ideal for any kind of data, but is not optimal for Image Processing,␣
↪NLP, CV tasks, rather use Deep learning for them.
[162]: 'XGB is very powerful ML library and most widely used. What We Have Not Covered
(And How You CanProceed):\n-Using XGBoost for ranking/recommendation
problems(Netflix/Amazon problem)\n-Using more sophisticated hyperparameter
tuning strategiesfor tuning XGBoost models (Bayesian Optimization, there are
entire new field for it)\n-Using XGBoost as part of an ensemble of other models
for regression/classification, XGB itself is ensemble but nothing stops us to
ensemble it with other models even with XGB!'
[164]: #a General Insight to Data: We can peak into our data by visualization using␣
↪simple pair-plot for low dimension data, however for high dimension data we␣
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv('ANSUR II FEMALE Public.csv') #https://www.kaggle.com/datasets/
↪seshadrikolluri/ansur-ii
numeric_df = df.select_dtypes(include=['number'])
print(df.shape)
print(numeric_df.shape)
from sklearn.manifold import TSNE
m = TSNE(learning_rate=50)
tsne_features = m.fit_transform(numeric_df)
print(tsne_features) #Reduced to 2D
#Assigning t-SNE features to our dataset
df['x'] = tsne_features[:,0]
df['y'] = tsne_features[:,1]
sns.scatterplot(x="x", y="y",hue='Branch', data=df)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize='small',␣
↪markerscale=0.7)
plt.show()
(1986, 108)
(1986, 99)
[[-46.565998 22.655666]
57
[-45.529686 23.14485 ]
[-45.72889 23.567385]
…
[ 35.08874 -23.190556]
[ 39.116215 -20.26782 ]
[ 37.955853 -25.027208]]
[165]: #b. Feature Selection: Selecting only Important Features for Modeling:
#i. Dropping Features with Low variance and High Missing Values:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.005) #Filter features with variance less␣
↪than/equal to 0.005
mask = sel.get_support()
reduced_df = numeric_df.loc[:, mask]
print(reduced_df.shape) #We dropped lot of features
(1986, 31)
[166]: print(df.shape)
mask = df.isna().sum() / len(df) < 0.3 #Filter rows with less than 30% missing␣
↪values
(1986, 110)
58
(1986, 109)
[167]: reduced_df.head()
x y
0 -46.565998 22.655666
1 -45.529686 23.144850
2 -45.728889 23.567385
3 -46.882313 23.612629
4 -47.587242 23.539295
59
[168]: #ii. Dealing Highly Coorelated Features:
#Visualization:
corr=reduced_df.select_dtypes(include=['number']).corr()
corr #We can drop one of two highly coorelated featurs
acromialheight acromionradialelength \
SubjectId 0.004100 0.022717
abdominalextensiondepthsitting 0.214947 0.237222
acromialheight 1.000000 0.811059
acromionradialelength 0.811059 1.000000
anklecircumference 0.350197 0.259060
… … …
Age 0.039226 0.057805
Heightin 0.896396 0.728164
Weightlbs 0.553647 0.505108
x 0.061294 0.147724
y 0.002129 0.041040
anklecircumference axillaheight \
SubjectId -0.031470 0.004606
abdominalextensiondepthsitting 0.372719 0.151886
acromialheight 0.350197 0.981581
acromionradialelength 0.259060 0.791981
anklecircumference 1.000000 0.324684
… … …
Age -0.086713 0.000305
Heightin 0.345679 0.891467
Weightlbs 0.586055 0.493883
x -0.072520 0.062300
y 0.004004 -0.003276
balloffootcircumference balloffootlength \
SubjectId 0.000897 0.033391
abdominalextensiondepthsitting 0.316927 0.241917
60
acromialheight 0.468676 0.676899
acromionradialelength 0.398971 0.616667
anklecircumference 0.590284 0.413988
… … …
Age 0.051034 0.016836
Heightin 0.462897 0.646926
Weightlbs 0.553959 0.522848
x -0.021848 0.096852
y -0.017098 0.019238
biacromialbreadth bicepscircumferenceflexed \
SubjectId -0.029601 -0.010903
abdominalextensiondepthsitting 0.225261 0.729729
acromialheight 0.511699 0.278318
acromionradialelength 0.472225 0.276870
anklecircumference 0.326983 0.499570
… … …
Age -0.016117 0.268575
Heightin 0.518662 0.243009
Weightlbs 0.442281 0.833521
x -0.040787 0.031924
y -0.014967 0.076291
… weightkg wristcircumference \
SubjectId … -0.010213 -0.010356
abdominalextensiondepthsitting … 0.791290 0.474677
acromialheight … 0.556479 0.531893
acromionradialelength … 0.510630 0.451131
anklecircumference … 0.608834 0.644307
… … … …
Age … 0.220650 0.064341
Heightin … 0.502772 0.511133
Weightlbs … 0.970784 0.695328
x … 0.048091 0.010348
y … 0.061714 0.002420
61
y -0.002867 0.029857 0.022460
y
SubjectId 0.000265
abdominalextensiondepthsitting 0.053877
acromialheight 0.002129
acromionradialelength 0.041040
anklecircumference 0.004004
… …
Age 0.066305
Heightin 0.009133
Weightlbs 0.054477
x 0.048208
y 1.000000
62
4 308 214 1210
bicepscircumferenceflexed … waistfrontlengthsitting \
0 315 … 345
1 272 … 329
2 300 … 367
3 364 … 371
4 320 … 380
[5 rows x 99 columns]
[170]: X=df_1[['abdominalextensiondepthsitting','acromialheight','acromionradialelength','anklecircum
y=df_1['DODRace'].astype('category')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression(multi_class="multinomial")
lr.fit(X_train_std, y_train)
X_test_std = scaler.transform(X_test)
y_pred = lr.predict(X_test_std)
print(accuracy_score(y_test, y_pred))
63
print(dict(zip(X.columns, abs(lr.coef_[0])))) #Gives coefficients of each␣
↪feature used in modeling
0.6359060402684564
{'abdominalextensiondepthsitting': 0.21515381931083946, 'acromialheight':
0.8877099972958575, 'acromionradialelength': 0.10341865403147825,
'anklecircumference': 0.663169629845846, 'axillaheight': 0.17577417771122847,
'balloffootcircumference': 0.35320106477499696, 'balloffootlength':
0.44790839786798325, 'biacromialbreadth': 0.14615567832723877,
'bicepscircumferenceflexed': 0.0853947527952258}
0.6124161073825504
C:\Users\14274\AppData\Local\Temp\ipykernel_9596\2131005849.py:1:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
[172]: #Be catuios when you use lasso regularization in your model, because it already␣
↪reduces the cofficients to zero for non-important features so just drop zero␣
#Note drop only least important one feature only, not two or more because when␣
↪you drop one the importance of other changes when you apply the same model␣
↪again. If you further wanna drop features, drop one then apply model again␣
0.5637583892617449
[174]: print(rf.feature_importances_)
mask = rf.feature_importances_ >0.11
64
print(mask)
X_reduced = X.loc[:, mask]
print(X_reduced.columns)
lcv = LassoCV()
lcv.fit(X_train, y_train)
lcv.score(X_test, y_test)
lcv_mask = lcv.coef_ != 0
sum(lcv_mask)
[176]: 5
[177]: from sklearn.feature_selection import RFE #As we discussed above we first check␣
↪importanceo of features and drop least important feature, apply model again␣
↪then again drop leat imp. feature and so on. The Recursive feature␣
rfe_rf.fit(X_train, y_train)
rf_mask = rfe_rf.support_
65
Fitting estimator with 63 features.
Fitting estimator with 58 features.
Fitting estimator with 53 features.
Fitting estimator with 48 features.
Fitting estimator with 43 features.
Fitting estimator with 38 features.
Fitting estimator with 33 features.
Fitting estimator with 28 features.
Fitting estimator with 23 features.
Fitting estimator with 18 features.
Fitting estimator with 13 features.
Fitting estimator with 8 features.
rfe_gb.fit(X_train, y_train)
gb_mask = rfe_gb.support_
[ ]: import numpy as np
votes = np.sum([lcv_mask, rf_mask], axis=0)
print(votes)
mask = votes >= 1 #We filtered out all features that didn't get vote of␣
↪retention from any of above models.
[ ]: reduced_X.head()
66
pipe = Pipeline([('scaler', StandardScaler()),('reducer', PCA(n_components=0.
↪9)),('classifier', RandomForestClassifier())]) ##We kept 90% variance of our␣
↪data for modeling (i.e. only features that explain 90% variance will be␣
↪selected)
pipe.fit(X_train, y_train)
print(pipe['reducer'].explained_variance_ratio_.sum()) #Tells how much variance␣
↪each feature explains
print(pipe.score(X_test,y_test))
67