100 Days of Machine Learning
100 Days of Machine Learning
Feature Engineering :
1. Transformation : Missing Value , Handling categorical value , Outiler
detection , Feature Slection
2. Construction
3. Selction
4. extraction
A) Standarzation :
In some algothrim the standization doesnt work and outilers doesnt effect by
standrazation
when to use this algothrim : KNN , K-Mean , PCA , Atrifical nueral network ,
Gradient Descent
When you working with data then ask some questions like
? Is feature scaling requried
? Most use is standazation then minmax scalling(image processing)
Code :
>>> # One hot encoding
>>> pd.get_dummies(df,column=['fuel','owner'],drop_first=True)
3. Column Tranfomer : When the columns has intger value (simple imputeation) ex:
Age , other has catgorical data but some has (one hot encoding) Ex: city ,
genders ; OrdinalEncoding -> review
so what to do in this case ?
Ans is column tranfomer
Step 1 import all the libraiers like numpy , pandas , simpleimputer ,
onehotencoding and odinalencoding
Step 2 split the data in X_train ,X_test , Y_train , Y_test
Step 3 now apply the column tranfomer
>>> tranfomer.fit_transform(X_train)
>>> tranfomer.fit_transform(X_test)
6. Encoding numerical data : means when you need to convert numerical data in
categorical data
Why we use this ? because
There are two types -> a) Discretization (binning) => range
=> use : handel outliers , to improve spread value
1) unsuprvised => 1.1 Equal width
1.2 Equal Frequency 1.3 KMean
2) supervised => 2.1 Desison
Binnning
3) Custom
7. Handleing Mixed Varaible : categorical + numerical data and there are two types
of mixed data 1.) b5 , c23... 2.) 1 , 2 , 3 , A , B....
Both can be handle by
1.) B5 and C23 anc be diffrentie by categorical and numerical columns
2.) here is same as above cat and num cols but diffrence here is null value
is more
Code :
1. extract year :
date['date_year'] = date['date'].dt.year
2. extract month
date['date_year_num'] = date['date'].dt.month => this give me numerical
value
date['date_year_cat'] = date['date'].dt.month_name() => this give me
categorical value
3. extract days
date['date_days'] = date['date'].dt.day
4. extarct weeks
date['date_dow'] = date['date'].dt.dayofweek
Code :
# to find the % of column missing value then this code
df.isnull().mean()*100
# ! Always do that when you have numerical data plot the graph
that gives me the overview
fig = plt.figure()
ax = fig.add_subplot(111)
# orignal data
df['traning_hours'].hist(bins = 50, ax = ax , density = True ,
colors = 'red')
@ After ploting the graph if we see the over lapping then the
data missing is at random
When you apply the mean and median on the data which is
missing value then the variance changes shrink in the sense
@@@@@ Fit on the X_train and tranform on X_test
##### When thae data csv file has more than 50 coluns and we have to use only
4 to 5 then we use this
df = pd.read_csv("File_name.csv" , usecols= ['ABC' , 'DEF' , 'GHI'])
Disadavantage:
Here the covarince changes
The missing value of X_train while using this technique it consume
memory on server because when new data comes then it will take from this X_train
memory
Will suited for linear model
Code :
# for numerical
df.isnull()*mean() * 100 # to check the percanatge of mising value
# When you apply this on ptoduction website when the code runs from the
column haveing value and predicted the other value just like fare and age (missing)
# Then for same fare value the age changes thats is bad modelsing so to
improve this we have to use below code
sample_value = X_train['Age'].dropna().sample(1,random_state =
int(observation['Fare']))
2. Missing Indicator
Age | Fare | Age_na
27 35 T
41 55 T
Na 41 F
62 22 T
Code :
X = df.drop(columns = ['Surived'])
y = df['Surived']
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size
= 0.2 , random_state = 2)
mi = MissingIndicator()
mi.fit(X_train)
X_train['Age_Na'] = X_train_missing
X_test['Age_Na'] = X_test_missing
Code :
>>> knn = KNNImputer(n_neighors = 5 , weights = 'distance')
>>> x_train_trf = knn.fit_tranform(x_train)
>>> x_test_trf = knn.tranform(x_test)
# here genraaly the weight are how the operation of Knn works
mean : in gerally k = 2 -> 70 + 50 / 2 = 60 value linear
# but in the diatance it multiply with each distance value with
orignal value gives me more accurate result
Step 1 : Fill all the NaN values with mean of respective cols
Step 2 : Remove all col1 missing values
Step 3 : Predict the missing values cols using other cols
Step 4 : Remove all col2 missing values
Step 5 : Predict the missing values col2 using other cols
Step 6 : Remove all col3 value
>>> plt.subplot(1,2,2)
>>> sns.displot(data['placement_exam_marks'])
>>> plt.show()
>>> # Trimming
>>> new_df = df[(df['cgf']<8.80) & (df['cgf']>5.11)]
>>> new_df
B. IQR Method : when data is skewed we can perform when The IQR method is
simple and effective, but its accuracy might vary depending on your data.
You can adjust the "1.5 times" multiplier to be more or less
strict in identifying outliers.
Just remember, outlier detection should always consider the
context and purpose of your analysis.
C. Using percentile method : The IQR method is simple and effective, but its
accuracy might vary depending on your data.
You can adjust the "1.5 times" multiplier to be more
or less strict in identifying outliers.
Just remember, outlier detection should always
consider the context and purpose of your analysis. ???
b. Curse of Dimensionality : When you take less feature or more features the
performance of model will be not optimal
Ex : Image , Textv data problem take place
Why it happen because when you have more dimension then the data will
be sparse so
Performnce decreses and Compuation will be disadvantge
Solution : Dimensitionality reduction
1. Feature selection (backward and forward
selection)
2. Feature extraction (PCA and LDA and tsne)
Steps to do PCA
1.) Mean Centerning
2.) Find covarince matrix
3.) Find the egin value / vector
that in 3d vector get there 3 covarince a1(pc1) , a2(pc2) ,
a3(pc3)
for 3d to 1d -> pc1 only
for 3d to 2d -> pc1 + pc2
The fundamental is that the
Code :
class MeraLR:
def __init__(self):
self.m = None
self.b = None
import numpy as np
import pandas as pd
X = df.iloc[:,0],values
y = df.iloc[:,1],values
lr = MeraLR()
lr.fit(X_train, y_train) => Gives m and b of line
lr.predict(X_test)
Code :
>>>
class GDREgression :
for i in range(self.epochs):
# update all the coef and the intercept
y_hat = np.dot(X_train , self.coef_) + self.intercept_
intercept_der = -2 * np.mean(y_train - y_hat)
self.intercept_ = self.intercept_ - (self.lr *
intercept_der)
def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_
Stochastic Regression :
Code :
>>>
class SGDREgression :
for i in range(self.epochs):
for j in range(X_train.shape[0]):
idx = np.random.randint(0,X_train.shape[0])
# In this there is no matrix it will be scaler value
y_hat = np.dot(X_train[idx],self.coef_) +
self.intercept_
def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_
Mini Batch : Make batches complete one batche or epoch then it will complete
the update
Code :
>>>
class MBGDREgression :
for i in range(self.epochs):
for j in range(int(X_train.shape[0] / self.batch_size)):
idx =
random.sample(range(X_train.shape[0]),self.batch_size)
print(self.intercept_,self.coef_)
def predict(self,X_test):
return np.dot(X_test, self.coef_) +self.intercept_
Learning Schedule : We make vary with learning schedule when the epoch
increseaes
14. Bias Variance Trade Off : when model cant perform well on training set then it
is called bias , When the line is plot of machine learning and get the high
diffrence in the actual vs predicted then it is called varaince
Overfitting : When in the training data model works better but not in test
(Low Bias and High Viarance )
Uderfitting : When in the training data model cant perform in that (High Bias
and Low Varaince)
def fit(self,x_train,y_train):
num = 0
dem = 0
for i in range(X_train.shape[0]):
num = num + (y_train[i] - y_train.mean())*(X_train[i] -
X_train[i].mean())
dem = dem + (y_train[i] - y_train.mean())*(X_train[i] -
X_train[i].mean()) + self.a
self.m = num/dem
self.b = y_train.mean() - (self.m*X_train.mean())
print(self.m , self.b)
def predicte(X_test):
pass
16. Lasso