Unit 5
Unit 5
Data reduction strategies aim to minimize the volume of data while maintaining its integrity and
relevance for analysis. These techniques are crucial in handling large datasets efficiently, improving
storage, computational efficiency, and model performance. Below are the main data reduction
strategies:
1. Dimensionality Reduction
Techniques:
2. Data Compression
Techniques:
o Lossy Compression: Reduces data size by discarding less critical information (e.g.,
JPEG for images, MP3 for audio).
3. Data Sampling
o Stratified Sampling: Ensures that the subset maintains the proportion of different
groups (e.g., class labels).
o Cluster Sampling: Groups the data into clusters and randomly samples from them.
4. Data Aggregation
Techniques:
o Binning: Dividing continuous data into intervals and summarizing each bin (e.g.,
frequency counts).
5. Data Transformation
Techniques:
6. Clustering
Groups similar data points into clusters and uses cluster representatives for analysis.
7. Numerosity Reduction
Reduces the number of data points by replacing them with models or patterns.
Techniques:
Used in OLAP (Online Analytical Processing) to aggregate data across multiple dimensions for
easier analysis.
9. Data Pruning
Techniques:
Wavelet Transforms:
Wavelet transforms are a powerful mathematical tool widely used in data science and signal
processing to analyze and process signals, images, and data sets. They are particularly effective for
applications requiring multi-resolution analysis, where features at different scales or frequencies
need to be examined.
1. Signal Processing
Noise Reduction: Removing noise from time-series data while preserving essential features.
Feature Extraction: Identifying key patterns or characteristics in signals for tasks like
classification or forecasting.
2. Image Processing
Compression: Used in formats like JPEG2000 to achieve high compression ratios with
minimal loss of quality.
Feature Detection: Enhancing edges or texture patterns for computer vision applications.
3. Time-Series Analysis
at different scales.
Useful in financial data analysis, earthquake prediction, and biomedical signal analysis.
4. Machine Learning
Feature Engineering: Wavelet coefficients can serve as input features to machine learning
models for classification or regression tasks.
Dimensionality Reduction: Decomposing data into wavelet domains can help focus on the
most informative components.
5. Biomedical Applications
Analysis of EEG, ECG, and other physiological signals for diagnosis and monitoring.
Coiflets (coif): Provide vanishing moments for both the wavelet and scaling function.
Python Libraries:
The Haar wavelet transform decomposes a signal into approximation and detail coefficients.
It operates on pairs of data points:
x=[4,6,10,12,14,16,18,20]
x=[5,11,15,19 ∣ −1,−1,−1,−1]
Take the new approximation coefficients [5,11,15,19] and repeat the process.
Pair: (8,17)
Summary
At each level:
This decomposition is useful for compression and denoising, as smaller detail coefficients can
often be discarded or thresholded.
Principal Components Analysis (PCA) is a widely used statistical technique for data reduction
and dimensionality reduction. PCA transforms a high-dimensional dataset into a smaller set
of principal components that retain most of the variance (information) in the data.
1. Standardize the Data: Center the data by subtracting the mean and scaling it by the standard
deviation to ensure all variables are on the same scale.
2. Compute the Covariance Matrix: Calculate the covariance matrix to measure relationships
between variables.
4. Select Principal Components: Choose the top kk principal components that explain the
majority of the variance (e.g., retain components that explain 95% of the total variance).
5. Transform Data: Project the original data onto the selected principal components to reduce
the dimensionality.
6.
3. Customer Segmentation: Simplify customer data with multiple features into a few
interpretable components.
Problem
X1 X2
2 3
3 5
4 7
5 9
6 11
Step-by-Step Solution
Z=X−μ/σ
X1 (Standardized) X2 (Standardized)
-1.414 -1.414
-0.707 -0.707
0 0
0.707 0.707
1.414 1.414
Result
The original two-dimensional data has been reduced to a single dimension (PC1). This retains
most of the variability in the data, as λ1=2 accounts for 100% of the variance.
Would you like to see a Python implementation or a plot of the PCA projection?
Regression Analysis
Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables
with one or more independent variables. It predicts continuous/real
values such as temperature, age, salary, price, etc.
Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining
the causal-effect relationship between variables.
Below are some other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the
independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is
affecting the other factors.
o Types of Regression
o There are various types of regressions which are used in data science
and machine learning.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is
used to solve the classification problems. In classification problems, we
have dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as
0 or 1, Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of
probability.
o Logistic regression is a type of regression, but it is different from the
linear regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:
o It uses the concept of threshold levels, values above the threshold level
are rounded up to 1, and values below the threshold level are rounded
up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-
linear dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are
present in a non-linear fashion, so for such case, linear regression will
not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial
line.