0% found this document useful (0 votes)

206 views14 pages

100 Days of Machine Learning

The document discusses various techniques for feature engineering in machine learning including handling missing values, encoding categorical features, scaling features, feature selection and transformation, encoding mixed data types, and handling date/time features. Common techniques include imputation, normalization, one-hot encoding, pipelines and column transformers.

Uploaded by

HASMUKH RUSHABH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views14 pages

100 Days of Machine Learning

Uploaded by

HASMUKH RUSHABH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 14

100 Days of Machine learning

Feature Engineering :
1. Transformation : Missing Value , Handling categorical value , Outiler
detection , Feature Slection
2. Construction
3. Selction
4. extraction

Missing Value : -> mean , median , mode

Categorical Value : -> Encoding
Outlier Dection : -> it is very harmful when you predicted the value
Feature Scaling : -> Example : Age vs Salary then we have to convert that into many
techinches like minmax scaling , standrization , normailization , mean absolute
scaling

1. Feature scaling : types -> standarzation and normailization

A) Standarzation :

Z-Score use the formaula = z = xi - x / sigma

in when you applied in any column mean value = 0 and standrad deviation will be 1
there is scaling having mean centerning and scaling by the facti=or standard
devation the size of scatter will be compresed
Note what happen to outiler whn u use z - score method ?

In some algothrim the standization doesnt work and outilers doesnt effect by
standrazation
when to use this algothrim : KNN , K-Mean , PCA , Atrifical nueral network ,
Gradient Descent

B) Normalization : This has many methods like minmax scale , maxabsscalling ,

Robustscalling
B.1) MinMaxScalling => Xi = Xi - Xmin / Xmax - Xmin { In this
Xmin is minmum value from dataset and Xmax will maximum value } and range will be
from [0,1]
B.2) Mean Normalization => Xi = Xi - Xmean / Xmax - Xmin {This is
mean centering} and range will be [-1 , 1] <= there is no libaray so you have to
write the code
B.3) MAxAbsScaling => Xi = Xi / |Xmax| {This is genrally uesd
when the dataset has the value of 0's}
B.4) RobustScalling => Xi - Xi - Xmeadian / IQR {This method is
slightl better than other it generally used for outliers}

When you working with data then ask some questions like
? Is feature scaling requried
? Most use is standazation then minmax scalling(image processing)

@@@@@ Diffrence between the normalization and standarziation is :

Standrization is divided by the standard devation after mean is substracted
In normalization is data is tranformed into a range of between 0 and 1 .

2. Encoding : it used for categorical data {Nominal -> there is no relationship

within the column (Ex: state) , ordinal -> is realtionship between the data (Ex:
review)}

1. Ordinal Encoding : This is used when we need output as integer

2. Label Encoding : This is used when you want output as categorical
When you have diffrent type categorical data then we have to do piepline
>>> # For Odinal Encoding
>>> from sklearn.preprocessing import OrdinalEncoding
>>> oe = OrdinalEncoding
>>> oe.OrdinalEncoding(categories=[['Poor' , 'Average', 'Good'] , ['School' ,
'UG' , 'PG']]
>>> X_train = oe.tranform(X_train)
>>> X_test = oe.tranform(X_test)

>>> # if i want output as categories so use label encoding

>>> from sklearn.preprocessing import LabelEncoding
>>> le = LabelEncoding()3+
>>> le.fit(y_train)
>>> y_train = le.tranform(y_train)
>>> y_test = le.taranform(t_test)

3. OneHotEncoding : Now it is for one hot encoding -> convert categorical to

intger but it has no realtionshpim between the data
DummyVariableTrap :
When you apply one hot encoding then N columns created as number of
diffrent data is there in that column then we have to remove the n-1 columns
because of colinearity
and the reason is that there will be realtion between the columns when
you craete the one hot encoding so we do N-1

OHE using most frequent varaible

Code :
>>> # One hot encoding
>>> pd.get_dummies(df,column=['fuel','owner'],drop_first=True)

3. Column Tranfomer : When the columns has intger value (simple imputeation) ex:
Age , other has catgorical data but some has (one hot encoding) Ex: city ,
genders ; OrdinalEncoding -> review
so what to do in this case ?
Ans is column tranfomer
Step 1 import all the libraiers like numpy , pandas , simpleimputer ,
onehotencoding and odinalencoding
Step 2 split the data in X_train ,X_test , Y_train , Y_test
Step 3 now apply the column tranfomer

>>> from sklearn.compose import ColumnTranfomer

>>> tranfomer = ColumnTranfomer(tranfomers=[
('tnf1' , SimpleImputer() , ['Fever']),
('tnf2',
Ordinalencoding(categories=[['mild' ,'strong']]),['cough']),
('tnf3', OneHotEncoding(sparse = False , drop =
'First'),['Gender','City'])
] , reminder = 'passthrough')

>>> tranfomer.fit_transform(X_train)
>>> tranfomer.fit_transform(X_test)

4. Pipelines : SckitLearn says that pipelines chains together multiple steps so

that the output of each step is used as input step of next blog
Pipelines makes it esay to apply the same preprocessing to train and
test.
5. Methamatical Tranfomation :
a) Log tranformation
b) Reciporical Tranformation
c) Square , Square root or reciporical Tranformation
d) Box - Cox Tranformation
e) Yeo Johnson tranformation
Why we need this type of function because whne you work with data and
model like linear and logicistic then you have to normailzed the data then this
mathematical functions are used .
How to find if data is normal ?
=> There are three ways a) sns.displot b) pd.skew() = 0 -> normal
otherwose it is not c) QQ Plot is Imp and reliable
QQ Plot means whne you plot this QQ Plot then it will
plot the data and there is axis at 45 degree angle

Now [ All these can be acess by importing scipy.stats as scst ]

a) Log tranformation is not applied in -ve values and when it is right skew
data
b) Reciprocial tranformation is (1/x)
c) Square tranformation is (x^2) for left skew data
d) Square root tranformation is (sqrt(x))
e) Box - Cox Tranformation formula is => xi ^ lambda => there are two
posiiblity a) lambda =! 0 , (xi^ lambda - 1 / lambda) b) lambda = 0 (ln(xi))
Range is from -5 to 5 and only applicable on graeter than 0
f) Yeo - Johnson tranformation is the advance version of box cox
tranformation and also it applicable on all numbers

6. Encoding numerical data : means when you need to convert numerical data in
categorical data
Why we use this ? because
There are two types -> a) Discretization (binning) => range
=> use : handel outliers , to improve spread value
1) unsuprvised => 1.1 Equal width
1.2 Equal Frequency 1.3 KMean
2) supervised => 2.1 Desison
Binnning
3) Custom

@@@ Equal Width : max - min / bins 👌Outliers , no changes in spread

@@@ Quaintal Binning : here i have to tell how much bins you want at that
Each conatins 10% of total observation , 👌Outliers , work properly on spread
@@@ Kmean Binning : it make cluster and use when it data is cluster
@@@ Custom Binning : with pandas
>>> from sklearn.preprocessing import KBinsDiscretizer
>>> kbins_age = KBinsDiscretizer(n_bins = 10 , encode = 'orignal' ,
strategy = 'quantile')

b) Binarization => constinous value

converted into binary value 0 or 1

7. Handleing Mixed Varaible : categorical + numerical data and there are two types
of mixed data 1.) b5 , c23... 2.) 1 , 2 , 3 , A , B....
Both can be handle by
1.) B5 and C23 anc be diffrentie by categorical and numerical columns
2.) here is same as above cat and num cols but diffrence here is null value
is more

Code :

# extract the numeric value

df['number_numerical'] = pd.to_numeric(df["number"], error = 'coerce' ,
downcast = 'intger')

# extract the categorical part

df['number_categorical'] = np.where(df['numerical_number'].isnull() ,
df['number'], np.nan)

#if cabin part has B5 , C33 then the code is

df['cabin_num'] = df['cabin'].str.etract('(\d+)') # capture
numeric value
df['cabin_cat'] = df['cabin'].str[0] # capture
first character

# Ticket part A/C 25356 then code is

# extract the last bit of ticket as number

df['ticket_num'] = df['Ticket'].apply(lambda s: s.split()[-1])

df['ticket_num'] =
pd.to_numeric(df['ticket_num'],errors='coerce’,downcast="integer")

# extract the first part of ticket as category

df['ticket_cat'] = df['Ticket'].apply(lambda s: s.split()[@])

df['ticket_cat'] = np.where(df['ticket_cat'].str.isdigit(), np.nan,
df['ticket_cat'])

8. How to handle date and time column :

why date and time is imp when you work with date and time of data type then
this data has object datatype
# convert to datatime type of data
date['date'] = pd.to_datatime(date['date'])

1. extract year :
date['date_year'] = date['date'].dt.year

2. extract month
date['date_year_num'] = date['date'].dt.month => this give me numerical
value
date['date_year_cat'] = date['date'].dt.month_name() => this give me
categorical value

3. extract days
date['date_days'] = date['date'].dt.day

4. extarct weeks
date['date_dow'] = date['date'].dt.dayofweek

5. day of week - name

date['date_dow_name'] = date['date'].dt.day_name()

9. Handleing Missing Data :

There are many option -> a) Remove , b) Imputetion (univariat and
multivariant)

A. CCA (Complete Case Anaylis) : :list wise deletion

Do deltion when the data missing is randomly thats why we delete that
thing. (5% < missing data)

Code :
# to find the % of column missing value then this code
df.isnull().mean()*100

# After finding that out of 13 , 5 are missing then the code is

to find the columns that are less than 5% missing data
cols = [var for var in df.columns if df[var].isnull().mean()
<0.05 and df[var].isnull().mean() >0]
cols #This gives me the list of columns

# to check how much data is reamaing after droping

len(df[cols].dropna()) / len(df)

# to check dataset shape

new_df = df[cols].dropna()
df.shape , new_df.shape

# To check after deletion and before is any changes data

distribution
# if any changes happen then we have to reverse it
new_df.hist(bins = 50 , density = True , figsize(12 ,12))
plt.show()

# ! Always do that when you have numerical data plot the graph
that gives me the overview
fig = plt.figure()
ax = fig.add_subplot(111)

# orignal data
df['traning_hours'].hist(bins = 50, ax = ax , density = True ,
colors = 'red')

# data after cca

new_df['traning_hours'].hist(bins = 50, ax = ax , density =
True , colors = 'green' , alpha = 0.8)

@ After ploting the graph if we see the over lapping then the
data missing is at random

# Now for categorical columns

temp = pd.concat([
# %of observation per category , orginal data
df['eduction_level'].value_counts() / len(df)

# %of observation per category , cca data

df['eduction_level'].value_counts() / len(df)

B. How to fill up numerical data that is missing value (univaraite)

B.1) mean and Median
B.2) Arbitary Value
B.3) End of Distrubtion
B.4) Randomly value

1) When you use this , graph is the normallly distrubuted then

use mean , on left or right skew use median
WHAT IS BENFIT ? : Simplicity and better performance on
less missing value
Disadavantages : Distrubution graph changes , Outliers are
generaterd , covaranice or correaltion changes
WHEN TO USE ? : When data missing at randomly and less than
5%

When you apply the mean and median on the data which is
missing value then the variance changes shrink in the sense
@@@@@ Fit on the X_train and tranform on X_test

2) Arbitary Value imputation : Genrally used in the categorical

value
Benfit and Disadavantage is same as mean and median
imputeation
# Data is not missing at randomly then we us ethsis
technique

3) End of Distribution Imputenation : It will take the last value

from that column and use as filling the value
and here there are two case a) Normal Distrbuted then we
use (meam _+ 3sigma_)
b) Skewed Data Then we use IQR
proximative Q1 - 1.5*IQR and Q3 + 1.5*IQR
Q3- Q1

Benfit and Disadavantge is same as above both and when to

use then we use Data is not missing ata random

C. Handleing the Categorical Data

we can fill up this missing value in catgeorical value by using two
technque
a. Mode : This applicable when data is missing less than 10%
b. Missing Category Imputeation : THe applicable when data is
missing greater than 10%

##### When thae data csv file has more than 50 coluns and we have to use only
4 to 5 then we use this
df = pd.read_csv("File_name.csv" , usecols= ['ABC' , 'DEF' , 'GHI'])

D. Random Imputeation and Missing Indicator :

1. When you apply this technique then from data it select the value from
the given dataset
It is easy to apply by pandas no changes in varaince and distrubution
This is applicable on the linear and logistic regression but not on
tree algo

Disadavantage:
Here the covarince changes
The missing value of X_train while using this technique it consume
memory on server because when new data comes then it will take from this X_train
memory
Will suited for linear model

Code :
# for numerical
df.isnull()*mean() * 100 # to check the percanatge of mising value

#Now the imputeation

X = df.drop(columns = ['Surived'])
y = df['Surived']

X_train , X_test , y_train , y_test = train_test_split(X,y , test_size

= 0.2 , random_state = 2)

# Nothing i have changed new columns

X_train['Age_impute'] = X_train['Age']
X_test['Age_impute'] = X_test['Age']

# Now applying this technque

X_train['Age_impute'][X_train['Age_impute'].isnull()] =
X_train['Age'].dropna().sample(X_train['Age_impute'].isnull().sum()).values
X_test['Age_impute'][X_test['Age_impute'].isnull()] =
X_test['Age'].dropna().sample(X_test['Age_impute'].isnull().sum()).values

# Ploting the graph after and before

sns.displot(X_train['Age'] , label = 'Orignal')
sns.displot(X_train['Age_impute'] , label = 'Imputed')
plt.legend()
plt.show()

# The diffrence came in covaraince changes

X_train[['Fare' , 'Age' , 'Age_impute']].cov()

# When you apply this on ptoduction website when the code runs from the
column haveing value and predicted the other value just like fare and age (missing)
# Then for same fare value the age changes thats is bad modelsing so to
improve this we have to use below code
sample_value = X_train['Age'].dropna().sample(1,random_state =
int(observation['Fare']))

2. Missing Indicator
Age | Fare | Age_na
27 35 T
41 55 T
Na 41 F
62 22 T

In this method the model try to create to find the diffrence

between Age and Age_na

Code :
X = df.drop(columns = ['Surived'])
y = df['Surived']
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size
= 0.2 , random_state = 2)
mi = MissingIndicator()
mi.fit(X_train)

# now the tranformation part

X_train_missing = mi.tranform(X_train)
X_test_missing = mi.tranform(X_test)

X_train['Age_Na'] = X_train_missing
X_test['Age_Na'] = X_test_missing

# and now train the model it improves the model accuracy

# And there is also class in Imputer thats is add to perform missing
indiactor
si = SimpelInputer(add_indiactor = True)

3. GridSearchCV : Just find which method is best for model

F. The above method is univaraite method now we are starting with

multivaraite
@ In the multivariate imputeation to fill any column it hepls other
columns fill that column that is missing

1. KNN Imputeation : is the imputetion which takes help of neighours

value trying to impute the value
Formula used -> Sqrt(weight * (a-b)^2 + (c-d)^2 + (e-f)^2)
{Here a-b the squarevalues }

Advanatage => Accurate

Disadvatage => More timetaken and Memory Consupation is more

Code :
>>> knn = KNNImputer(n_neighors = 5 , weights = 'distance')
>>> x_train_trf = knn.fit_tranform(x_train)
>>> x_test_trf = knn.tranform(x_test)

# here genraaly the weight are how the operation of Knn works
mean : in gerally k = 2 -> 70 + 50 / 2 = 60 value linear
# but in the diatance it multiply with each distance value with
orignal value gives me more accurate result

2. Alternative imputer (MICE) : When the data is Missiang at Random ,

More accuaret but disdavanyage is slow processing and memory consumpe more
here the data is impuation is done by the machine learning
algorthim here the data is filled by

Step 1 : Fill all the NaN values with mean of respective cols
Step 2 : Remove all col1 missing values
Step 3 : Predict the missing values cols using other cols
Step 4 : Remove all col2 missing values
Step 5 : Predict the missing values col2 using other cols
Step 6 : Remove all col3 value

from ever iteration there is two type of iteration one is the

mean valeu another regression or other model thing
make the substration on them example : iteration0(mean) -
itearation1(linear regression) = heer you have to make sure if the diffrence can be
0 or nearset to 0(0.23) then the we have predicted the accurate
value of column , iteration1(linear regression) - itearation2(logistic regression)
do it again and till the accurate the impuation is not done .

10. Outliers detection and removal :

A. Z-Score method : when to use when the data points are normally distrubuted
: Formula is xi' = xi - mean / std
if value outside the range -3 to +3 then it consider as
outliers
We can do it with two ways a) Trimming b) Capping

code of ploting normal distrubution :

>>> plt.figure(figsize = (16,5))
>>> plt.subplot(1,2,1)
>>> sns.displot(data['cgf'])

>>> plt.subplot(1,2,2)
>>> sns.displot(data['placement_exam_marks'])
>>> plt.show()

>>> # Trimming
>>> new_df = df[(df['cgf']<8.80) & (df['cgf']>5.11)]
>>> new_df

>>> # Calculating the Z score

>>> df['cgpa_zscore'] = (df['cgpa'] - df['cgpa'].mean()) /
df['cgpa'].std()

Note : only works in normal distrubution .

B. IQR Method : when data is skewed we can perform when The IQR method is
simple and effective, but its accuracy might vary depending on your data.
You can adjust the "1.5 times" multiplier to be more or less
strict in identifying outliers.
Just remember, outlier detection should always consider the
context and purpose of your analysis.

C. Using percentile method : The IQR method is simple and effective, but its
accuracy might vary depending on your data.
You can adjust the "1.5 times" multiplier to be more
or less strict in identifying outliers.
Just remember, outlier detection should always
consider the context and purpose of your analysis. ???

11. Feature Construction : is done by manually on basis of domen knowledge .

a. Feature spliting method : In machine learning, we do something similar,
but with data instead of toys.
We have different features (like size, color, and
material), and we want to split our data into groups based on those features.
This helps us understand our data better and make
predictions.

b. Curse of Dimensionality : When you take less feature or more features the
performance of model will be not optimal
Ex : Image , Textv data problem take place
Why it happen because when you have more dimension then the data will
be sparse so
Performnce decreses and Compuation will be disadvantge
Solution : Dimensitionality reduction
1. Feature selection (backward and forward
selection)
2. Feature extraction (PCA and LDA and tsne)

A) PCA : it is technique that tranform higher dimension to lower

dimension
Benfits : @ Size of Dimension is less so faster compuation
@ Visulation

varianvce - statsicatl technique that gives how much data is

spreaded is directly proporaltional
in pca to maximusize the varince gives me better performance on
pca

Covariance -> find the realtionship between x and y axis

Covariance matrix -> is special matrix and find all the axis
spread out and relation ship between the aixs
matrices -> are the linear tranformation that changes the
cornadinte system changes
egin vetor - > it doesnt change the vector direction on doing
transformation but it maginutide changes
egin values -> how much magniuted changes on egin vector
egin decompostion of Covariance matrix ->

Steps to do PCA
1.) Mean Centerning
2.) Find covarince matrix
3.) Find the egin value / vector
that in 3d vector get there 3 covarince a1(pc1) , a2(pc2) ,
a3(pc3)
for 3d to 1d -> pc1 only
for 3d to 2d -> pc1 + pc2
The fundamental is that the

Finding the optimal principal compontent

Disadvantges : data is plot as circle , two data cluster as

mirrior image , pattern

12. Linear Regression : Types

a. Simple -> 1 input and 1 output
b. Mutiple -> more than 1 input columns and 1 output
c. Polynomial ->
d. Regurlization

a. Simple Linear Regression : The maths is done in notebook

Sracth code of Gradient Deicent used in higher dimension of SGD
Regression

Code :
class MeraLR:
def __init__(self):
self.m = None
self.b = None

def fit(self,X_train , y_train):

num = 0
den = 0
for i in range(X_train.shape[0]):
num = num + ((X_train[i] - X_train.mean()) *
(y_train[i] - y_train.mean()))
dem = den + ((X_train[i] - X_train.mean()) *
(X_train[i] - X_train.mean()))

self.m = num / dem

self.b = y_train.mean() + (self.m * X_train.mean())
print(self.m)
print(self.b)

def predict(self , X_test):

return self.m * X_test + slef.b

import numpy as np
import pandas as pd

# Importing data and reading using panda futher we divided data as x

and y

X = df.iloc[:,0],values
y = df.iloc[:,1],values

# Spliting test , train data

lr = MeraLR()
lr.fit(X_train, y_train) => Gives m and b of line
lr.predict(X_test)

b. Multiple Linear Regression : In done in Notebook

c. Polynomial Regression :

13. Gradent Descent :

a. Batch : it is slow and due to taken from all rows then update single
for n dimension dataset

Code :
>>>
class GDREgression :

def init(self , learning_rate = 0.01 , epochs = 100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs

def fit (self , X_train , y_train):

self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):
# update all the coef and the intercept
y_hat = np.dot(X_train , self.coef_) + self.intercept_
intercept_der = -2 * np.mean(y_train - y_hat)
self.intercept_ = self.intercept_ - (self.lr *
intercept_der)

coef_der = - 2 * np.dot((y_train - y_hat) , X_train)

self.coef_ = self.coef_ - (self.lr * coef_der)
print(self.intercept_,self.coef_)

def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_

Stochastic Regression :
Code :
>>>
class SGDREgression :

def init(self , learning_rate = 0.01 , epochs = 100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs

def fit (self , X_train , y_train):

self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):
for j in range(X_train.shape[0]):
idx = np.random.randint(0,X_train.shape[0])
# In this there is no matrix it will be scaler value
y_hat = np.dot(X_train[idx],self.coef_) +
self.intercept_

intercept_der = -2 * (y_train[idx] - y_hat)

self.intercept_ = self.intercept_ - (self.lr *
intercept_der)

coef_der = - 2 * np.dot((y_train[idx] - y_hat) ,

X_train[idx])
self.coef_ = self.coef_ - (self.lr * coef_der)

def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_

Mini Batch : Make batches complete one batche or epoch then it will complete
the update

Code :
>>>

class MBGDREgression :

def init(self ,batch_size, learning_rate = 0.01 , epochs = 100):

self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs
self.batch_size = batch_size

def fit (self , X_train , y_train):

self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])

for i in range(self.epochs):
for j in range(int(X_train.shape[0] / self.batch_size)):

idx =
random.sample(range(X_train.shape[0]),self.batch_size)

y_hat = np.dot(X_train[idx] , self.coef_) +

self.intercept_
intercept_der = -2 * np.mean(y_train[idx] - y_hat)
self.intercept_ = self.intercept_ - (self.lr *
intercept_der)

coef_der = - 2 * np.dot((y_train[idx] - y_hat) ,

X_train[idx])
self.coef_ = self.coef_ - (self.lr * coef_der)

print(self.intercept_,self.coef_)

def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_

Learning Schedule : We make vary with learning schedule when the epoch
increseaes

14. Bias Variance Trade Off : when model cant perform well on training set then it
is called bias , When the line is plot of machine learning and get the high
diffrence in the actual vs predicted then it is called varaince

Overfitting : When in the training data model works better but not in test
(Low Bias and High Viarance )
Uderfitting : When in the training data model cant perform in that (High Bias
and Low Varaince)

Solution of this is : Reguralization , Bagging and Boosting

15. What is regularization means that when you apply some added information to
machine learning model that overcome the overfitting
There are 3 types
a. Ridge (L1)
b. Laso (L2)
c. Elastic

Code of Rigde ->

class MeraRidge:
def __init__(self,alpha = 0.1):
self.alpha = alpha
self.m = none
self.b = none

def fit(self,x_train,y_train):
num = 0
dem = 0

for i in range(X_train.shape[0]):
num = num + (y_train[i] - y_train.mean())*(X_train[i] -
X_train[i].mean())
dem = dem + (y_train[i] - y_train.mean())*(X_train[i] -
X_train[i].mean()) + self.a

self.m = num/dem
self.b = y_train.mean() - (self.m*X_train.mean())
print(self.m , self.b)

def predicte(X_test):
pass

for nData columns that gives me weights

16. Lasso

CRISChecklistfor Reporting Invitro Studies
33% (3)
CRISChecklistfor Reporting Invitro Studies
2 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
NSG 522 Midterm Learning Objectives
No ratings yet
NSG 522 Midterm Learning Objectives
6 pages
An Investigation Into The Content Validity of A Vietnamese Standardized Test of English Proficiency (Vstep.3-5) Reading Test
No ratings yet
An Investigation Into The Content Validity of A Vietnamese Standardized Test of English Proficiency (Vstep.3-5) Reading Test
15 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
UIG - IT0209- Business Statistics_2025
No ratings yet
UIG - IT0209- Business Statistics_2025
14 pages
Unit 4_Working With Graphs _python
No ratings yet
Unit 4_Working With Graphs _python
49 pages
Metodologia UENF - Social Science Methodology - A Criterial Framework - 01
No ratings yet
Metodologia UENF - Social Science Methodology - A Criterial Framework - 01
363 pages
Field-2013 Discovering Statistics Using IBM SPSS-ed4-577-609
No ratings yet
Field-2013 Discovering Statistics Using IBM SPSS-ed4-577-609
33 pages
Machine Learning
No ratings yet
Machine Learning
81 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Lecture 1 - Measurement Uncertainty
100% (2)
Lecture 1 - Measurement Uncertainty
75 pages
Application of Statistics in Real Life
No ratings yet
Application of Statistics in Real Life
10 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
1737527078055
No ratings yet
1737527078055
111 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Quality
No ratings yet
Quality
64 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Week 10
No ratings yet
Week 10
50 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Sample in Variance and Standard Deviation
100% (1)
Sample in Variance and Standard Deviation
3 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Sample Thesis Analysis of Data
100% (3)
Sample Thesis Analysis of Data
8 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Unit 3
No ratings yet
Unit 3
41 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Mba Mona Ddce 2022
No ratings yet
Mba Mona Ddce 2022
29 pages
Reg 2
No ratings yet
Reg 2
96 pages
DS 1
No ratings yet
DS 1
20 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Sampling Procedure: Ms. Resa Mae C. Laygan
No ratings yet
Sampling Procedure: Ms. Resa Mae C. Laygan
19 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Regression
No ratings yet
Regression
32 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
data_preprocessing - Jupyter Notebook
No ratings yet
data_preprocessing - Jupyter Notebook
5 pages
ML - WEEK 04
No ratings yet
ML - WEEK 04
33 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Ap Python
No ratings yet
Ap Python
12 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
Analisis Konsumsi Energi Listrik Untuk P 03f5c67f
No ratings yet
Analisis Konsumsi Energi Listrik Untuk P 03f5c67f
13 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Viva
No ratings yet
Viva
7 pages
Random Variables and Probablity Distribution
No ratings yet
Random Variables and Probablity Distribution
13 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
Ds 5
No ratings yet
Ds 5
9 pages
Analisis Butir Soal Tes Uraian Ujian Tengah Semester Mata Kuliah Statistik
No ratings yet
Analisis Butir Soal Tes Uraian Ujian Tengah Semester Mata Kuliah Statistik
10 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Sample Designs and Sampling Procedures
No ratings yet
Sample Designs and Sampling Procedures
35 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
1
No ratings yet
1
9 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Chicago Durkheim Conections
No ratings yet
Chicago Durkheim Conections
18 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Anam Khan MED
No ratings yet
Anam Khan MED
17 pages
ECT305: Analog and Digital Communication Module 2, Part 3: DR - Susan Dominic Assistant Professor Dept. of ECE Rset
No ratings yet
ECT305: Analog and Digital Communication Module 2, Part 3: DR - Susan Dominic Assistant Professor Dept. of ECE Rset
21 pages
FEATURE ENGINEERING ASSIGNMENT
No ratings yet
FEATURE ENGINEERING ASSIGNMENT
7 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Understanding The Independent-Samples T Test
No ratings yet
Understanding The Independent-Samples T Test
8 pages
Examples of Descriptive Statistics
No ratings yet
Examples of Descriptive Statistics
4 pages
Anesthesiology Core Review Part One BASIC Exam, 2nd Edition Statistics Incomp
No ratings yet
Anesthesiology Core Review Part One BASIC Exam, 2nd Edition Statistics Incomp
2 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
MATH 499 Homework 1
No ratings yet
MATH 499 Homework 1
5 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Cream Modern Data Scientist Resume.pdf
No ratings yet
Cream Modern Data Scientist Resume.pdf
1 page
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Final ML
No ratings yet
Final ML
2 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
HasnBots Fast Flags
100% (1)
HasnBots Fast Flags
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)