3.
Data Preprocessing
Data preprocessing transforms raw data into a format
suitable for analysis and modeling. Key preprocessing
techniques include:
3.1 Feature Scaling
Feature scaling ensures that numerical features have
comparable ranges, preventing models from being biased
towards larger values. Common techniques include:
Min-Max Scaling (Normalization)
o Scales values between 0 and 1.
o Best for data without outliers.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1',
'col2']])
Standardization (Z-score Normalization)
o Centers data around zero with unit variance.
o Suitable for normally distributed data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1',
'col2']])
Robust Scaling
o Uses median and IQR to scale data.
o Effective for datasets with outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1',
'col2']])
Max Abs Scaling
o Scales data by dividing by the maximum absolute
value.
o Useful for sparse data.
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1',
'col2']])
When to Use Each Scaling Technique?
Min-Max Scaling: When you need all values between a
fixed range (0,1). Useful for image processing.
Standardization: When data follows a normal
distribution.
Robust Scaling: When data contains outliers.
Max Abs Scaling: When working with sparse data like
text-based features.
3.2 Encoding Categorical Variables
Many machine learning models require numerical input, so
categorical variables need to be converted into numeric
representations. Common encoding techniques include:
One-Hot Encoding
o Converts categorical variables into binary columns.
o Suitable for nominal categorical variables.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder(sparse=False, drop='first')
encoded_cols =
encoder.fit_transform(df[['category_column']])
df_encoded = pd.DataFrame(encoded_cols,
columns=encoder.get_feature_names_out(['category_c
olumn']))
df =
df.join(df_encoded).drop(columns=['category_column'])
Label Encoding
o Assigns a unique integer to each category.
o Suitable for ordinal categorical variables.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['category_column'] =
encoder.fit_transform(df['category_column'])
Ordinal Encoding
o Maps categories to integers based on order.
o Useful for ordinal data like education levels (e.g.,
High School < Bachelor < Master < PhD).
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Low', 'Medium',
'High']])
df[['category_column']] =
encoder.fit_transform(df[['category_column']])
Frequency Encoding
o Replaces categories with their frequency in the
dataset.
freq_encoding =
df['category_column'].value_counts().to_dict()
df['category_column'] =
df['category_column'].map(freq_encoding)
Target Encoding (Mean Encoding)
o Replaces categories with the mean of the target
variable.
o Useful in supervised learning but may cause data
leakage.
target_mean_encoding = df.groupby('category_column')
['target'].mean().to_dict()
df['category_column'] =
df['category_column'].map(target_mean_encoding)
Choosing the Right Encoding Technique
One-Hot Encoding: Best for nominal data with a small
number of unique values.
Label Encoding: Suitable for ordinal data.
Ordinal Encoding: When the categorical feature has an
inherent order.
Frequency Encoding: When high-cardinality categorical
data is present.
Target Encoding: Useful in supervised learning but must
be used cautiously.
3.3 Feature Engineering
Feature engineering involves creating new features or
modifying existing ones to improve model performance.
Some key techniques include:
Feature Extraction: Deriving useful features from
existing data (e.g., extracting text length from textual
data).
Feature Transformation: Applying mathematical
functions to normalize or scale data (e.g., log
transformations).
Feature Selection: Choosing the most important
features to reduce dimensionality and improve
efficiency.
Polynomial Features: Generating higher-order features
to capture complex relationships.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df[['feature1', 'feature2']])
Binning: Grouping continuous variables into discrete
bins.
Encoding Categorical Variables: Converting categorical
variables into numerical format (One-Hot, Label, Target
Encoding).
Time-Series Feature Engineering: Extracting features
like rolling averages, lags, and trends from time-series
data.
Handling Missing Values: Using mean/mode
imputation, KNN imputation, or model-based methods.
Feature engineering enhances model performance by
adding meaningful transformations to raw data, ensuring
better predictions and interpretability.
3.4 Handling Imbalanced Data
Handling imbalanced data is crucial in classification
problems where one class has significantly fewer samples
than another. Techniques to address imbalanced data
include:
Resampling Techniques:
o Oversampling (SMOTE, ADASYN): Generating
synthetic samples for the minority class.
o from imblearn.over_sampling import SMOTE
o smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X,
y)
o Undersampling: Randomly removing samples from
the majority class to balance the dataset.
o from imblearn.under_sampling import
RandomUnderSampler
o undersample = RandomUnderSampler()
X_resampled, y_resampled =
undersample.fit_resample(X, y)
Cost-Sensitive Learning: Assigning higher weights to the
minority class during training.
Anomaly Detection Approaches: Treating minority class
samples as anomalies and using specialized detection
techniques.
Data Augmentation: Using transformations, synthetic
data generation, or GANs to create more minority class
samples.
3.5 Principal Component Analysis (PCA) for
Dimensionality Reduction
PCA is a technique used to reduce the dimensionality of
large datasets while preserving important information. It
helps remove redundancy and speed up computations in
machine learning models.
Steps in PCA
1. Standardize the Data: Ensure that all features have zero
mean and unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
2. Compute the Covariance Matrix: Understand how
features vary with each other.
3. Compute Eigenvalues and Eigenvectors: Identify the
principal components.
4. Select the Top Principal Components: Choose the
number of components based on explained variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # Choose 2 principal
components
X_pca = pca.fit_transform(X_scaled)
5. Transform the Data: Project data onto the selected
principal components.
6. Analyze Explained Variance:
print(pca.explained_variance_ratio_)
Advantages of PCA
Reduces dimensionality, improving model efficiency.
Removes multicollinearity among features.
Helps visualize high-dimensional data in 2D or 3D.
Reduces overfitting in models with many features.
Limitations of PCA
Can lead to information loss if too many components
are removed.
Difficult to interpret transformed features.
Assumes linear relationships among variables.