0% found this document useful (0 votes)

128 views25 pages

Advanced Regression with IPL Data

This document discusses loading and preprocessing an IPL dataset for regression analysis. It loads the dataset, encodes categorical features, standardizes features and the target variable, splits the data into train and test sets, builds linear regression and regularization models, and analyzes their performance. It also discusses dealing with imbalanced classification datasets by upsampling the minority class.

Uploaded by

sushanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views25 pages

Advanced Regression with IPL Data

Uploaded by

sushanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Advanced Machine Learning

6.3 Advanced Regression Models

6.4.1.1 Loading IPL Dataset

ipl_auction_df = pd.read_csv( 'IPL IMB381IPL2013.csv' )
ipl_auction_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
Sl.NO. 130 non-null int64
PLAYER NAME 130 non-null object
AGE 130 non-null int64
COUNTRY 130 non-null object
TEAM 130 non-null object
PLAYING ROLE 130 non-null object
T-RUNS 130 non-null int64
T-WKTS 130 non-null int64
ODI-RUNS-S 130 non-null int64
ODI-SR-B 130 non-null float64
ODI-WKTS 130 non-null int64
ODI-SR-BL 130 non-null float64
CAPTAINCY EXP 130 non-null int64
RUNS-S 130 non-null int64
HS 130 non-null int64
AVE 130 non-null float64
SR-B 130 non-null float64
SIXERS 130 non-null int64
RUNS-C 130 non-null int64
WKTS 130 non-null int64
AVE-BL 130 non-null float64
ECON 130 non-null float64
SR-BL 130 non-null float64
AUCTION YEAR 130 non-null int64
BASE PRICE 130 non-null int64
SOLD PRICE 130 non-null int64
dtypes: float64(7), int64(15), object(4)
memory usage: 26.5+ KB

X_features = ['AGE', 'COUNTRY', 'PLAYING ROLE',

'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B',
'ODI-WKTS', 'ODI-SR-BL', 'CAPTAINCY EXP', 'RUNS-S',
'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS',
'AVE-BL', 'ECON', 'SR-BL']

# categorical_features is initialized with the categorical variable names.

categorical_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']
#get_dummies() is invoked to return the dummy features.
ipl_auction_encoded_df = pd.get_dummies( ipl_auction_df[X_features],
columns = categorical_features,
drop_first = True )
ipl_auction_encoded_df.columns
Index(['T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'OD
I-SR-BL',
'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS', 'A
VE-BL',
'ECON', 'SR-BL', 'AGE_2', 'AGE_3', 'COUNTRY_BAN', 'COUNTRY_EN
G',
'COUNTRY_IND', 'COUNTRY_NZ', 'COUNTRY_PAK', 'COUNTRY_SA', 'CO
UNTRY_SL',
'COUNTRY_WI', 'COUNTRY_ZIM', 'PLAYING ROLE_Batsman',
'PLAYING ROLE_Bowler', 'PLAYING ROLE_W. Keeper', 'CAPTAINCY E
XP_1'],
dtype='object')

X = ipl_auction_encoded_df
Y = ipl_auction_df['SOLD PRICE']

6.4.1.2 Standardize X & Y

from sklearn.preprocessing import StandardScaler

## Initializing the StandardScaler

X_scaler = StandardScaler()
## Standardize all the feature columns
X_scaled = X_scaler.fit_transform(X)

## Standardizing Y explictly by subtracting mean and

## dividing by standard deviation
Y = (Y - Y.mean()) / Y.std()
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/prepr
ocessing/data.py:617: DataConversionWarning: Data with input dtype u
int8, int64, float64 were all converted to float64 by StandardScale
r.
return self.partial_fit(X, y)
/Users/manaranjan/anaconda/lib/python3.5/site-packages/sklearn/base.
py:462: DataConversionWarning: Data with input dtype uint8, int64, f
loat64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)

6.4.1.3 Split the dataset into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X_scaled,
Y,
test_size=0.2,
random_state = 42)

6.4.1.4 Build the model

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

linreg.coef_

array([-0.43539611, -0.04632556, 0.50840867, -0.03323988, 0.222037

7 ,
-0.05065703, 0.17282657, -0.49173336, 0.58571405, -0.116547

53,
0.24880095, 0.09546057, 0.16428731, 0.26400753, -0.082533
41,
-0.28643889, -0.26842214, -0.21910913, -0.02622351, 0.248178
98,
0.18760332, 0.10776084, 0.04737488, 0.05191335, 0.012352
45,
0.00547115, -0.03124706, 0.08530192, 0.01790803, -0.050774
54,
0.18745577])

## The dataframe has two columns to store feature name

## and the corresponding coefficient values
columns_coef_df = pd.DataFrame( { 'columns': ipl_auction_encoded_df.columns,
'coef': linreg.coef_ } )
## Sorting the features by coefficient values in descending order
sorted_coef_vals = columns_coef_df.sort_values( 'coef', ascending=False)

6.4.1.5 Plotting the coeﬃcient values

plt.figure( figsize = ( 8, 6 ))
## Creating a bar plot
sn.barplot(x="coef", y="columns",
data=sorted_coef_vals);
plt.xlabel("Coefficients from Linear Regression")
plt.ylabel("Features")

Text(0,0.5,'Features')

6.4.1.6 Calculate R-Squared value

from sklearn import metrics

# Takes a model as a parameter

# Prints the RMSE on train and test set
def get_train_test_rmse( model ):
# Predicting on training dataset
y_train_pred = model.predict( X_train )
# Compare the actual y with predicted y in the training dataset
rmse_train = round(np.sqrt(metrics.mean_squared_error( y_train, y_train_pred
)), 3)
# Predicting on test dataset
y_test_pred = model.predict( X_test )
# Compare the actual y with predicted y in the test dataset
rmse_test = round(np.sqrt(metrics.mean_squared_error( y_test, y_test_pred
)), 3)
print( "train: ", rmse_train, " test:", rmse_test )

get_train_test_rmse( linreg )
train: 0.679 test: 0.749

6.4.2 Applying Regularization

6.4.2.1 Ridge Regression

# Importing Ridge Regression

from sklearn.linear_model import Ridge

# Applying alpha = 1 and running the algorithms for maximum of 500 iterations
ridge = Ridge(alpha = 1, max_iter = 500)
ridge.fit( X_train, y_train )

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=500, normal

ize=False,
random_state=None, solver='auto', tol=0.001)

get_train_test_rmse( ridge )
train: 0.68 test: 0.724

ridge = Ridge(alpha = 2.0, max_iter = 1000)

ridge.fit( X_train, y_train )
get_train_test_rmse( ridge )
train: 0.682 test: 0.706

6.4.2.2 Lasso Regression

# Importing Ridge Regression
from sklearn.linear_model import Lasso

# Applying alpha = 1 and running the algorithms for maximum of 500 iterations
lasso = Lasso(alpha = 0.01, max_iter = 500)
lasso.fit( X_train, y_train )
Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=500,
normalize=False, positive=False, precompute=False, random_state=N
one,
selection='cyclic', tol=0.0001, warm_start=False)

get_train_test_rmse( lasso )
train: 0.688 test: 0.698

## Storing the feature names and coefficient values in the DataFrame

lasso_coef_df = pd.DataFrame( { 'columns':
ipl_auction_encoded_df.columns,
'coef':
lasso.coef_ } )

## Filtering out coefficients with zeros

lasso_coef_df[lasso_coef_df.coef == 0]

coef columns

1 -0.0 T-WKTS

3 -0.0 ODI-SR-B

13 -0.0 AVE-BL

28 0.0 PLAYING ROLE_Bowler

6.4.2.3 Elastic Net Regression

0.01/1.01
0.009900990099009901

from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha = 1.01, l1_ratio = 0.0099, max_iter = 500)

enet.fit( X_train, y_train )
get_train_test_rmse( enet )
train: 0.794 test: 0.674

6.4 More Advanced Algorithms

bank_df = pd.read_csv( 'bank.csv')
bank_df.head(5)

housing- personal- curre

age job marital education default balance
loan loan campa

0 30 unemployed married primary no 1787 no no 1

1 33 services married secondary no 4789 yes yes 1

2 35 management single tertiary no 1350 yes no 1

3 30 management married tertiary no 1476 yes yes 4

4 59 blue-collar married secondary no 0 yes no 1

bank_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 11 columns):
age 4521 non-null int64
job 4521 non-null object
marital 4521 non-null object
education 4521 non-null object
default 4521 non-null object
balance 4521 non-null int64
housing-loan 4521 non-null object
personal-loan 4521 non-null object
current-campaign 4521 non-null int64
previous-campaign 4521 non-null int64
subscribed 4521 non-null object
dtypes: int64(4), object(7)
memory usage: 388.6+ KB

6.4.1 Dealing with imbalanced datasets

bank_df.subscribed.value_counts()
no 4000
yes 521
Name: subscribed, dtype: int64
## Importing resample from *sklearn.utils* package.
from sklearn.utils import resample

# Separate the case of yes-subscribes and no-subscribes

bank_subscribed_no = bank_df[bank_df.subscribed == 'no']
bank_subscribed_yes = bank_df[bank_df.subscribed == 'yes']

##Upsample the yes-subscribed cases.

df_minority_upsampled = resample(bank_subscribed_yes,
replace=True, # sample with replacement
n_samples=2000)

# Combine majority class with upsampled minority class

new_bank_df = pd.concat([bank_subscribed_no, df_minority_upsampled])

from sklearn.utils import shuffle

new_bank_df = shuffle(new_bank_df)

# Assigning list of all column names in the DataFrame

X_features = list( new_bank_df.columns )
# Remove the response variable from the list
X_features.remove( 'subscribed' )
X_features
['age',
'job',
'marital',
'education',
'default',
'balance',
'housing-loan',
'personal-loan',
'current-campaign',
'previous-campaign']

## get_dummies() will convert all the columns with data type as objects
encoded_bank_df = pd.get_dummies( new_bank_df[X_features], drop_first = True )
X = encoded_bank_df

# Encoding the subscribed column and assigning to Y

Y = new_bank_df.subscribed.map( lambda x: int( x == 'yes') )

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split( X,

Y,
test_size = 0.3,
random_state = 42 )

6.4.2 Logistic Regression model

6.4.2.1 Building the model

from sklearn.linear_model import LogisticRegression

## Initializing the model

logit = LogisticRegression()
## Fitting the model with X and Y values of the dataset
logit.fit( train_X, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_interce

pt=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='war
n',
tol=0.0001, verbose=0, warm_start=False)

pred_y = logit.predict(test_X)

6.4.2.2 Confusion Matrix

## Importing the metrics

from sklearn import metrics

## Defining the matrix to draw the confusion metrix from actual and predicted cl
ass labels
def draw_cm( actual, predicted ):
# Invoking confusion_matrix from metric package. The matrix will oriented as
[1,0] i.e.
# the classes with label 1 will be reprensted the first row and 0 as second
row
cm = metrics.confusion_matrix( actual, predicted, [1,0] )
## Confustion will be plotted as heatmap for better visualization
## The lables are configured to better interpretation from the plot
sn.heatmap(cm, annot=True, fmt='.2f',
xticklabels = ["Subscribed", "Not Subscribed"] ,
yticklabels = ["Subscribed", "Not Subscribed"] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

cm = draw_cm( test_y, pred_y )

cm
6 5 2 3 Classiﬁcation Report
print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.73 0.92 0.81 1225

1 0.60 0.27 0.37 575

micro avg 0.71 0.71 0.71 1800

macro avg 0.66 0.59 0.59 1800
weighted avg 0.69 0.71 0.67 1800

6.5.2.4 ROC AUC Score

## Predicting the probability values for test cases

predict_proba_df = pd.DataFrame( logit.predict_proba( test_X ) )
predict_proba_df.head()

0 1

0 0.704479 0.295521

1 0.853664 0.146336

2 0.666963 0.333037

3 0.588329 0.411671

4 0.707982 0.292018

## Initializing the DataFrame with actual class lables

test_results_df = pd.DataFrame( { 'actual': test_y } )
test_results_df = test_results_df.reset_index()
## Assigning the probability values for class label 1
test_results_df['chd_1'] = predict_proba_df.iloc[:,1:2]

test_results_df.head(5)

index actual chd_1

0 1321 0 0.295521

1 3677 0 0.146336

2 1680 1 0.333037

3 821 0 0.411671

4 921 0 0.292018
# Passing actual class labels and the predicted probability values to compute RO
C AUC score.
auc_score = metrics.roc_auc_score( test_results_df.actual, test_results_df.chd_1
)
round( float( auc_score ), 2 )

0.69

## The method takes the three following parameters

## model: the classification model
## test_X: X features of the test set
## test_y: actual labels of the test set
## Returns
## - ROC Auc Score
## - FPR and TPRs for different threshold values
def draw_roc_curve( model, test_X, test_y ):
## Creating and initializing a results DataFrame with actual labels
test_results_df = pd.DataFrame( { 'actual': test_y } )
test_results_df = test_results_df.reset_index()

# predict the probabilities on the test set

predict_proba_df = pd.DataFrame( model.predict_proba( test_X ) )

## selecting the probabilities that the test example belongs to class 1

test_results_df['chd_1'] = predict_proba_df.iloc[:,1:2]

## Invoke roc_curve() to return the fpr, tpr and threshold values.

## threshold values contain values from 0.0 to 1.0
fpr, tpr, thresholds = metrics.roc_curve( test_results_df.actual,
test_results_df.chd_1,
drop_intermediate = False )

## Getting the roc auc score by invoking metrics.roc_auc_score method

auc_score = metrics.roc_auc_score( test_results_df.actual, test_results_df.c
hd_1 )

## Setting the size of the plot

plt.figure(figsize=(8, 6))
## plotting the actual fpr and tpr values
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
## plotting th diagnoal line from (0,1)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
## Setting labels and titles
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

return auc_score, fpr, tpr, thresholds

## Invoking draw_roc_curve with the logistic regresson model
_, _, _, _ = draw_roc_curve( logit, test_X, test_y )

6.5.3 KNN Algorithm

## Importing the KNN classifier algorithm

from sklearn.neighbors import KNeighborsClassifier

## Initializing the classifier

knn_clf = KNeighborsClassifier()
## Fitting the model with the training set
knn_clf.fit( train_X, train_y )
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkows
ki',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

6.5.3.1 KNN Accuracy

## Invoking draw_roc_curve with the KNN model
_, _, _, _ = draw_roc_curve( knn_clf, test_X, test_y )

## Predicting on test set

pred_y = knn_clf.predict(test_X)
## Drawing the confusion matrix for KNN model
draw_cm( test_y, pred_y )
print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.85 0.77 0.81 1225

1 0.59 0.72 0.65 575

micro avg 0.75 0.75 0.75 1800

macro avg 0.72 0.74 0.73 1800
weighted avg 0.77 0.75 0.76 1800

6.5.3.2 GridSerach for most optimal parameters

## Importing GridSearchCV
from sklearn.model_selection import GridSearchCV

## Creating a dictionary with hyperparameters and possible values for searching

tuned_parameters = [{'n_neighbors': range(5,10),
'metric': ['canberra', 'euclidean', 'minkowski']}]

## Configuring grid search

clf = GridSearchCV(KNeighborsClassifier(),
tuned_parameters,
cv=10,
scoring='roc_auc')
## fit the search with training set
clf.fit(train_X, train_y )

GridSearchCV(cv=10, error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=3
0, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform'),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'n_neighbors': range(5, 10), 'metric': ['canberr
a', 'euclidean', 'minkowski']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='war
n',
scoring='roc_auc', verbose=0)

clf.best_score_
0.8368537419503068

clf.best_params_
{'metric': 'canberra', 'n_neighbors': 5}

6.5.4 Ensemble Methods

6.5.5 Random Forest

6.5.5.1 Buiding Random Forest Model

## Importing Random Forest Classifier from the sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

## Initializing the Random Forest Classifier with max_dept and n_estimators

radm_clf = RandomForestClassifier( max_depth=10, n_estimators=10)
radm_clf.fit( train_X, train_y )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion

='gini',
max_depth=10, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=No
ne,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

_, _, _, _ = draw_roc_curve( radm_clf, test_X, test_y );

6.5.5.2 Grid Search for Optimal Parameters

## Configuring parameters and values for searched
tuned_parameters = [{'max_depth': [10, 15],
'n_estimators': [10,20],
'max_features': ['sqrt', 'auto']}]

## Initializing the RF classifier

radm_clf = RandomForestClassifier()

## Configuring search with the tunable parameters

clf = GridSearchCV(radm_clf,
tuned_parameters,
cv=5,
scoring='roc_auc')

## Fitting the training set

clf.fit(train_X, train_y )

GridSearchCV(cv=5, error_score='raise-deprecating',
estimator=RandomForestClassifier(bootstrap=True, class_weight
=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=Non
e,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators='warn', n_job
s=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid=[{'n_estimators': [10, 20], 'max_depth': [10, 15],
'max_features': ['sqrt', 'auto']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score='war
n',
scoring='roc_auc', verbose=0)

clf.best_score_
0.9399595384858543

clf.best_params_
{'max_depth': 15, 'max_features': 'auto', 'n_estimators': 20}

6.5.5.3 Building the ﬁnal model with optimal parameter values

## Initializing the Random Forest Mode with the optimal values
radm_clf = RandomForestClassifier( max_depth=15, n_estimators=20, max_features =
'auto')
## Fitting the model with the training set
radm_clf.fit( train_X, train_y )

RandomForestClassifier(bootstrap=True, class_weight=None, criterion

='gini',
max_depth=15, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=No
ne,
oob_score=False, random_state=None, verbose=0,
warm_start=False)

6.5.5.4 ROC AUC Score

_, _, _, _ = draw_roc_curve( clf, test_X, test_y )

6.5.5.5 Drawing the confusion matrix

pred_y = radm_clf.predict( test_X )
draw_cm( test_y, pred_y )

print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.90 0.94 0.92 1225

1 0.86 0.78 0.82 575

micro avg 0.89 0.89 0.89 1800

macro avg 0.88 0.86 0.87 1800
weighted avg 0.89 0.89 0.89 1800

6.5.5.6 Finding important features

import numpy as np

# Create a dataframe to store the featues and their corresponding importances

feature_rank = pd.DataFrame( { 'feature': train_X.columns,
'importance': radm_clf.feature_importances_ } )

## Sorting the features based on their importances with most important feature a
t top.
feature_rank = feature_rank.sort_values('importance', ascending = False)

plt.figure(figsize=(8, 6))
# plot the values
sn.barplot( y = 'feature', x = 'importance', data = feature_rank );
feature_rank['cumsum'] = feature_rank.importance.cumsum() * 100
feature_rank.head(10)

feature importance cumsum

1 balance 0.269603 26.960282

0 age 0.203664 47.326707

3 previous-campaign 0.117525 59.079219

2 current-campaign 0.090085 68.087703

21 housing-loan_yes 0.039898 72.077486

15 marital_married 0.034329 75.510337

22 personal-loan_yes 0.027029 78.213244

17 education_secondary 0.023934 80.606690

4 job_blue-collar 0.023081 82.914811

16 marital_single 0.022495 85.164357

6.5.6 Boosting

6.5.6.1 Adaboost

## Importing Adaboost classifier

from sklearn.ensemble import AdaBoostClassifier

## Initializing logistic regression to use as base classifier

logreg_clf = LogisticRegression()

## Initilizing adaboost classifier with 50 classifers

ada_clf = AdaBoostClassifier(logreg_clf, n_estimators=50)

## Fitting adaboost model to training set

ada_clf.fit(train_X, train_y )
AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=LogisticRegression(C=1.0, class_weight=Non
e, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='war
n',
tol=0.0001, verbose=0, warm_start=False),
learning_rate=1.0, n_estimators=50, random_state=None)
_, _, _, _ = draw_roc_curve( ada_clf, test_X, test_y )

6.5.6.2 Gradient Boosting

## Importing Gradient Boosting classifier

from sklearn.ensemble import GradientBoostingClassifier

## Initializing Gradient Boosting with 500 estimators and max depth as 10.
gboost_clf = GradientBoostingClassifier( n_estimators=500, max_depth=10)

## Fitting gradient boosting model to training set

gboost_clf.fit(train_X, train_y )
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=10,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_iter_no_change=None, presort='auto', random_state=No
ne,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
_, _, _, _ = draw_roc_curve( gboost_clf, test_X, test_y )

from sklearn.model_selection import cross_val_score

gboost_clf = GradientBoostingClassifier( n_estimators=500, max_depth=10)

cv_scores = cross_val_score( gboost_clf, train_X, train_y, cv = 10, scoring = 'r
oc_auc' )

print( cv_scores )
print( "Mean Accuracy: ", np.mean(cv_scores), " with standard deviation of: ",
np.std(cv_scores))
[0.98241686 0.98105851 0.98084469 0.9585199 0.95482216 0.96667006
0.95342452 0.97368689 0.95937357 0.98174607]
Mean Accuracy: 0.969256322542174 with standard deviation of: 0.01
1406249012935668
gboost_clf.fit(train_X, train_y )
pred_y = gboost_clf.predict( test_X )
draw_cm( test_y, pred_y )

print( metrics.classification_report( test_y, pred_y ) )

precision recall f1-score support

0 0.96 0.95 0.96 1225

1 0.90 0.92 0.91 575

micro avg 0.94 0.94 0.94 1800

macro avg 0.93 0.94 0.94 1800
weighted avg 0.94 0.94 0.94 1800
import numpy as np

# Create a dataframe to store the featues and their corresponding importances

feature_rank = pd.DataFrame( { 'feature': train_X.columns,
'importance': gboost_clf.feature_importances_ } )

## Sorting the features based on their importances with most important feature a
t top.
feature_rank = feature_rank.sort_values('importance', ascending = False)

plt.figure(figsize=(8, 6))
# plot the values
sn.barplot( y = 'feature', x = 'importance', data = feature_rank );

AIML Project
No ratings yet
AIML Project
4 pages
ML Lab File
No ratings yet
ML Lab File
48 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
Aiml Practicals
No ratings yet
Aiml Practicals
22 pages
ML Lab Experiment Shivansh
No ratings yet
ML Lab Experiment Shivansh
29 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Train
No ratings yet
Train
17 pages
ML External Xerox
No ratings yet
ML External Xerox
1 page
ML Journal External
No ratings yet
ML Journal External
14 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
Exp - 6-Model Development - SDK - Ok
No ratings yet
Exp - 6-Model Development - SDK - Ok
11 pages
Da 012307
No ratings yet
Da 012307
8 pages
MLfull
No ratings yet
MLfull
29 pages
DA Practicle Answers Easyw
No ratings yet
DA Practicle Answers Easyw
30 pages
Practicalpgm ML
No ratings yet
Practicalpgm ML
33 pages
Machine Learning Evaluation Guide
100% (1)
Machine Learning Evaluation Guide
504 pages
ML Lab1
No ratings yet
ML Lab1
11 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
ML Minimized Programs
No ratings yet
ML Minimized Programs
9 pages
Btech1007022 Lab5.1
No ratings yet
Btech1007022 Lab5.1
9 pages
Sofcomputing Da2
No ratings yet
Sofcomputing Da2
7 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
1 page
ML Manual
No ratings yet
ML Manual
9 pages
ML Manual
No ratings yet
ML Manual
30 pages
Data Mining Practicals
No ratings yet
Data Mining Practicals
22 pages
Btech1007022 Lab5
No ratings yet
Btech1007022 Lab5
14 pages
Data Analytics
No ratings yet
Data Analytics
10 pages
ML Lab Programs
No ratings yet
ML Lab Programs
9 pages
LAB5 Regularization
No ratings yet
LAB5 Regularization
6 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
IoT Task4 21BEC0384
No ratings yet
IoT Task4 21BEC0384
9 pages
1
No ratings yet
1
13 pages
ML 6 7 8
No ratings yet
ML 6 7 8
10 pages
Zerox Ready
No ratings yet
Zerox Ready
21 pages
Assignment 2 ML
No ratings yet
Assignment 2 ML
11 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
ML Record
No ratings yet
ML Record
19 pages
Data Science with Python Tools
No ratings yet
Data Science with Python Tools
1 page
ML
No ratings yet
ML
17 pages
Chapter 4 - Linear Regression
100% (2)
Chapter 4 - Linear Regression
25 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
22 pages
ML Manual
No ratings yet
ML Manual
24 pages
SML - Week 3
No ratings yet
SML - Week 3
5 pages
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
No ratings yet
ML-Lab07-Building and Evaluating Multivariate Regression Models in Python
5 pages
Mlaifile1 3
No ratings yet
Mlaifile1 3
27 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
No ratings yet
SiddharthShah 1032221195 DivC 50 DL LabAssignment2
7 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
ML Lab Manual
No ratings yet
ML Lab Manual
17 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
Sales Forecasting with Stacking Models
No ratings yet
Sales Forecasting with Stacking Models
3 pages
Hemraj Python Ass1
No ratings yet
Hemraj Python Ass1
7 pages
F 11
No ratings yet
F 11
3 pages
Linear Regression
No ratings yet
Linear Regression
18 pages
Regression Analysis - Cheatsheet
No ratings yet
Regression Analysis - Cheatsheet
9 pages
ML Lab Record
No ratings yet
ML Lab Record
17 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Da Rec
No ratings yet
Da Rec
29 pages
Sec B Groups
No ratings yet
Sec B Groups
2 pages
Classification Problems
100% (1)
Classification Problems
25 pages
Introduction To Python: 1.1 Declaring Variables
No ratings yet
Introduction To Python: 1.1 Declaring Variables
9 pages
By Conducting A PESTEL Analysis, Assess How The General Environment Would Impact The Strategies of Mahindra Electric Mobility Limited (MEML)
No ratings yet
By Conducting A PESTEL Analysis, Assess How The General Environment Would Impact The Strategies of Mahindra Electric Mobility Limited (MEML)
1 page
MEML3
No ratings yet
MEML3
1 page
Decision Trees 2 PDF
No ratings yet
Decision Trees 2 PDF
39 pages
Based On Your Analyses, What Strategies Would You Recommend To MEML?
No ratings yet
Based On Your Analyses, What Strategies Would You Recommend To MEML?
1 page
Decision Trees 2 PDF
No ratings yet
Decision Trees 2 PDF
39 pages
Which of The Following Is An Example of An Inelastic Impression Material?
No ratings yet
Which of The Following Is An Example of An Inelastic Impression Material?
28 pages
Trustworthy AI: A Computational Perspective
No ratings yet
Trustworthy AI: A Computational Perspective
55 pages
Water Hammer
No ratings yet
Water Hammer
30 pages
Steelcord en
No ratings yet
Steelcord en
2 pages
10 1021@acs Langmuir 1c00641
No ratings yet
10 1021@acs Langmuir 1c00641
10 pages
T1ps - CHL
No ratings yet
T1ps - CHL
2 pages
Intracranial Aneurysm
0% (1)
Intracranial Aneurysm
5 pages
TFT LCD
No ratings yet
TFT LCD
30 pages
State Feedback Controller Design Using MATLAB
No ratings yet
State Feedback Controller Design Using MATLAB
19 pages
Module 4 Assessment Critical Paper
No ratings yet
Module 4 Assessment Critical Paper
4 pages
Australian Student Visa Guide
0% (1)
Australian Student Visa Guide
17 pages
DTDC Charges
No ratings yet
DTDC Charges
3 pages
Portable Ladders: Most Workers Injured in Falls From Ladders Are Less Than 10 Feet Above The Ladder S Base!
No ratings yet
Portable Ladders: Most Workers Injured in Falls From Ladders Are Less Than 10 Feet Above The Ladder S Base!
2 pages
Indian Institute of Technology Ropar: Application Form For Admission To Institute Post Doctoral Fellowship Programme
No ratings yet
Indian Institute of Technology Ropar: Application Form For Admission To Institute Post Doctoral Fellowship Programme
3 pages
Elevator Brochure
No ratings yet
Elevator Brochure
4 pages
Training Module
No ratings yet
Training Module
49 pages
Labelexpo Europe Special
No ratings yet
Labelexpo Europe Special
60 pages
Dental Decks - Periodontics Flashcards Quizlet
No ratings yet
Dental Decks - Periodontics Flashcards Quizlet
1 page
Mid Term-Ish Study Guide
No ratings yet
Mid Term-Ish Study Guide
3 pages
Report Card-768984
No ratings yet
Report Card-768984
6 pages
Forecasting - Sheet 2025 Store22
No ratings yet
Forecasting - Sheet 2025 Store22
3 pages
Monitoring The Forecast: Nonlinear and Multiple Regression Analysis
No ratings yet
Monitoring The Forecast: Nonlinear and Multiple Regression Analysis
2 pages
Essay Writing Guide for Students
No ratings yet
Essay Writing Guide for Students
2 pages
Valve Automation Brochure
No ratings yet
Valve Automation Brochure
8 pages
SAP FICO Pallavi Chawla. Original
No ratings yet
SAP FICO Pallavi Chawla. Original
7 pages
David Austin Roses. Combining The Old With The New
100% (1)
David Austin Roses. Combining The Old With The New
7 pages
Final SRI HARI SOP UCD University
No ratings yet
Final SRI HARI SOP UCD University
2 pages
Systems - Network - Administrator - Training - Plan - Version - 1.1 - SA
No ratings yet
Systems - Network - Administrator - Training - Plan - Version - 1.1 - SA
3 pages
Investor-Focused Restaurant Plan
100% (2)
Investor-Focused Restaurant Plan
28 pages
Ste D Ed.3.1 GB
No ratings yet
Ste D Ed.3.1 GB
4 pages