[go: up one dir, main page]

0% found this document useful (0 votes)
5 views18 pages

Unit 5

The document provides an overview of data reduction strategies, including techniques such as dimensionality reduction, data compression, sampling, and clustering, aimed at minimizing data volume while preserving its integrity. It also discusses data visualization methods and introduces wavelet transforms, which are effective for analyzing signals and images through multi-resolution analysis. Additionally, it covers regression analysis as a statistical method for modeling relationships between variables and predicting continuous values.

Uploaded by

csd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

Unit 5

The document provides an overview of data reduction strategies, including techniques such as dimensionality reduction, data compression, sampling, and clustering, aimed at minimizing data volume while preserving its integrity. It also discusses data visualization methods and introduces wavelet transforms, which are effective for analyzing signals and images through multi-resolution analysis. Additionally, it covers regression analysis as a statistical method for modeling relationships between variables and predicting continuous values.

Uploaded by

csd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT 5:

Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms, Principal


Components Analysis, Attribute Subset Selection, Regression and Log-Linear Models: Parametric
Data Reduction, Histograms, Clustering, Sampling, Data Cube Aggregation.

Data Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization


Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization Techniques, Visualizing
Complex Data and Relations.

Overview of Data Reduction Strategies

Data reduction strategies aim to minimize the volume of data while maintaining its integrity and
relevance for analysis. These techniques are crucial in handling large datasets efficiently, improving
storage, computational efficiency, and model performance. Below are the main data reduction
strategies:

1. Dimensionality Reduction

 Reduces the number of features or variables while preserving essential information.

 Techniques:

o Principal Component Analysis (PCA): Transforms the data into a lower-dimensional


space by identifying the directions (principal components) of maximum variance.

o Linear Discriminant Analysis (LDA): Finds a linear combination of features that


separates classes while reducing dimensions.

o Feature Selection: Selecting a subset of relevant features using methods like


correlation analysis, recursive feature elimination, or mutual information.

2. Data Compression

 Reduces the data volume by encoding it in a more compact format.

 Techniques:

o Lossless Compression: Ensures no loss of information (e.g., Huffman coding, run-


length encoding).

o Lossy Compression: Reduces data size by discarding less critical information (e.g.,
JPEG for images, MP3 for audio).

o Vector Quantization: Encodes data points by assigning them to clusters represented


by prototypes.

3. Data Sampling

 Selects a representative subset of data from a larger dataset.


 Techniques:

o Random Sampling: Randomly selects a subset of records.

o Stratified Sampling: Ensures that the subset maintains the proportion of different
groups (e.g., class labels).

o Systematic Sampling: Selects data points at regular intervals.

o Cluster Sampling: Groups the data into clusters and randomly samples from them.

4. Data Aggregation

 Summarizes and combines data to reduce its size.

 Techniques:

o Averaging: Replacing multiple data points with their mean or median.

o Summarization: Using statistical measures like sum, max, min, or count.

o Binning: Dividing continuous data into intervals and summarizing each bin (e.g.,
frequency counts).

5. Data Transformation

 Transforms data into a simpler, more manageable format.

 Techniques:

o Normalization: Scales data to a smaller range (e.g., [0, 1]).

o Discretization: Converts continuous variables into categorical bins.

o Log Transformation: Reduces skewness in the data distribution.

6. Clustering

 Groups similar data points into clusters and uses cluster representatives for analysis.

 Example: Representing data by centroids of clusters (e.g., using k-means or hierarchical


clustering).

7. Numerosity Reduction

 Reduces the number of data points by replacing them with models or patterns.

 Techniques:

o Parametric Methods: Approximate data with mathematical models (e.g., regression


models).
o Non-Parametric Methods: Use histograms, decision trees, or clustering to
approximate data.

8. Data Cube Aggregation

 Used in OLAP (Online Analytical Processing) to aggregate data across multiple dimensions for
easier analysis.

 Example: Summarizing sales data by region, time, or product.

9. Data Pruning

 Removes irrelevant, redundant, or noisy data points.

 Techniques:

o Outlier Detection and Removal: Filters out anomalous data points.

o Redundancy Elimination: Removes duplicate or highly correlated data.

10. Instance Selection

 Selects a subset of instances that contribute most to the data patterns.

 Example: Retaining boundary instances in a classification problem (e.g., support vectors in


SVM).

Wavelet Transforms:

Wavelet transforms are a powerful mathematical tool widely used in data science and signal
processing to analyze and process signals, images, and data sets. They are particularly effective for
applications requiring multi-resolution analysis, where features at different scales or frequencies
need to be examined.

What Are Wavelet Transforms?

Wavelet transforms decompose a signal into components corresponding to different frequency


bands and time intervals. Unlike Fourier transforms, which analyze signals in terms of sine and cosine
functions with infinite support, wavelets are localized both in time and frequency. This makes them
particularly useful for non-stationary signals where frequency characteristics change over time.

Types of Wavelet Transforms

1. Continuous Wavelet Transform (CWT):

o Provides a continuous representation of a signal in terms of scaled and translated


versions of a wavelet.

o Useful for detailed signal analysis.

o Outputs a two-dimensional representation, often visualized as a scalogram.


2. Discrete Wavelet Transform (DWT):

o Operates on discrete data and provides a multi-resolution representation.

o Commonly used in compression (e.g., JPEG2000) and denoising.

o Outputs coefficients corresponding to approximations (low-frequency components)


and details (high-frequency components).

3. Stationary Wavelet Transform (SWT):

o A redundant version of DWT, which avoids downsampling, ensuring better shift-


invariance.

o Useful in applications like denoising.

4. Wavelet Packet Transform (WPT):

o Extends DWT by further decomposing both approximation and detail coefficients.

o Provides a more granular frequency analysis.

Applications in Data Science

Wavelet transforms are versatile and find applications in various fields:

1. Signal Processing

 Noise Reduction: Removing noise from time-series data while preserving essential features.

 Feature Extraction: Identifying key patterns or characteristics in signals for tasks like
classification or forecasting.

 Anomaly Detection: Detecting irregularities in time-series or sensor data.

2. Image Processing

 Compression: Used in formats like JPEG2000 to achieve high compression ratios with
minimal loss of quality.

 Denoising: Suppressing noise in images while retaining edges and details.

 Feature Detection: Enhancing edges or texture patterns for computer vision applications.

3. Time-Series Analysis

 Multi-resolution decomposition of time-series data to analyze trends, seasonality, and noise

 at different scales.

 Useful in financial data analysis, earthquake prediction, and biomedical signal analysis.

4. Machine Learning

 Feature Engineering: Wavelet coefficients can serve as input features to machine learning
models for classification or regression tasks.
 Dimensionality Reduction: Decomposing data into wavelet domains can help focus on the
most informative components.

5. Biomedical Applications

 Analysis of EEG, ECG, and other physiological signals for diagnosis and monitoring.

 Detection of seizures, arrhythmias, or other anomalies.

Advantages of Wavelet Transforms

 Localization: Handles signals with both time and frequency localization.

 Multi-Resolution: Provides a hierarchical view of data.

 Versatility: Can be applied to 1D, 2D, or even 3D data.

 Noise Robustness: Effective in separating noise from meaningful signals.

Popular Wavelet Families

 Haar: Simple and efficient for piecewise constant signals.

 Daubechies (db): General-purpose wavelets with varying smoothness.

 Coiflets (coif): Provide vanishing moments for both the wavelet and scaling function.

 Symlets: Symmetric versions of Daubechies wavelets.

 Meyer and Morlet: Continuous wavelets for smooth signal analysis.

Tools and Libraries

 Python Libraries:

o PyWavelets (pywt): Comprehensive library for wavelet transform in Python.

o scipy.signal: Includes basic wavelet transform functionalities.

 MATLAB: Offers extensive wavelet toolboxes.

 R Packages: Libraries like wavelets and wavethresh support wavelet analysis.

 Haar Wavelet Transform Overview

 The Haar wavelet transform decomposes a signal into approximation and detail coefficients.
It operates on pairs of data points:

 Approximation coefficients: Represent the average of two data points.

 Detail coefficients: Represent the difference between two data points.

 Example: Haar Wavelet Transform on a 1D Signal

 Let's solve an example step-by-step.


 Given Signal

 We have a signal with 8 data points:

 x=[4,6,10,12,14,16,18,20]

 Step 1: Pairwise Operations

 For each pair of values in the signal, calculate:

 The average (approximation coefficient).

 The difference (detail coefficient).

 First Level Transformation

 Pairs: (4,6), (10,12), (14,16), (18,20)

 Approximation coefficients: 4+6/2=5,10+12/2=11,14+16/2=15,18+20/2=19

 So, approximation = [5,11,15,19]

 Detail coefficients: 4−6/2=−1,10−12//2=−1,14−162=−1,18−20/2=−1So, detail = [−1,−1,−1,−1]

 The signal becomes:

 x=[5,11,15,19 ∣ −1,−1,−1,−1]

 Step 2: Repeat on Approximation Coefficients

 Take the new approximation coefficients [5,11,15,19] and repeat the process.

 Pairs: (5,11), (15,19)

 Approximation coefficients: 5+11/2=8,15+19/2=17

 So, approximation = [8,17]

 Detail coefficients: 5−11/2=−3,15−19/2=−2So, detail = [−3,−2]

 The signal becomes:

 x=[8,17 ∣ −3,−2 ∣ −1,−1,−1,−1]

 Step 3: Final Level

 Take the new approximation coefficients [8,17]and repeat the process.

 Pair: (8,17)

 Approximation coefficient: 8+17/2=12.5

 So, approximation = [12.5]

 Detail coefficient: 8−17/2=−4.5 So, detail = [−4.5]

 The final signal becomes:

 x=[12.5 ∣ −4.5 ∣ −3,−2 ∣ −1,−1,−1,−1]

 Final Transformed Signal


 The Haar wavelet transform decomposes the original signal into:

 Approximation coefficients: [12.5]

 Detail coefficients: [−4.5,−3,−2,−1,−1,−1,−1]

 Summary

 At each level:

 Approximation coefficients represent the smoothed version of the signal.

 Detail coefficients capture the differences at different resolutions.

 This decomposition is useful for compression and denoising, as smaller detail coefficients can
often be discarded or thresholded.

Principal Components Analysis

Principal Components Analysis (PCA) is a widely used statistical technique for data reduction
and dimensionality reduction. PCA transforms a high-dimensional dataset into a smaller set
of principal components that retain most of the variance (information) in the data.

Here’s an explanation and example to illustrate PCA for data reduction:

Steps of PCA for Data Reduction

1. Standardize the Data: Center the data by subtracting the mean and scaling it by the standard
deviation to ensure all variables are on the same scale.

2. Compute the Covariance Matrix: Calculate the covariance matrix to measure relationships
between variables.

3. Calculate Eigenvalues and Eigenvectors: Compute eigenvalues and eigenvectors of the


covariance matrix. The eigenvectors define the directions of the principal components, and
the eigenvalues indicate their variance (importance).

4. Select Principal Components: Choose the top kk principal components that explain the
majority of the variance (e.g., retain components that explain 95% of the total variance).

5. Transform Data: Project the original data onto the selected principal components to reduce
the dimensionality.
6.

Applications of PCA in Data Reduction

1. Image Compression: Reduce pixel dimensions while preserving key features.

2. Gene Expression Data: Reduce thousands of gene expression variables to a few


components for biological analysis.

3. Customer Segmentation: Simplify customer data with multiple features into a few
interpretable components.

Principal Component Analysis (PCA) is a statistical technique used to reduce the


dimensionality of a dataset while preserving as much variability as possible. Here's a simple
solved example to understand PCA:

Problem

You have a dataset with the following two variables:

X1 X2

2 3

3 5

4 7

5 9

6 11

You want to reduce the dimensionality of this dataset using PCA.

Step-by-Step Solution

Step 1: Standardize the Data


PCA requires standardizing the data to have a mean of 0 and a standard deviation of 1. The
formula for standardization is:

Z=X−μ/σ

For simplicity, compute the standardized values for x1 and X2:

X1 (Standardized) X2 (Standardized)

-1.414 -1.414

-0.707 -0.707

0 0

0.707 0.707

1.414 1.414
Result

The original two-dimensional data has been reduced to a single dimension (PC1). This retains
most of the variability in the data, as λ1=2 accounts for 100% of the variance.

Would you like to see a Python implementation or a plot of the PCA projection?

Regression Analysis
Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables
with one or more independent variables. It predicts continuous/real
values such as temperature, age, salary, price, etc.
Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous
output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining
the causal-effect relationship between variables.
Below are some other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the
independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is
affecting the other factors.
o Types of Regression
o There are various types of regressions which are used in data science
and machine learning.

Here we are discussing some important types of regression which are


given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
o Linear Regression: Linear regression is a statistical regression method
which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on
regression and shows the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis), hence
called linear regression.
o If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input
variable, then such linear regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Some popular applications of linear regression are:
o Analyzing trends and sales estimates
o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.
o Multiple Linear Regression
o This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn y=β0+β1X1+β2X2+………βnXn
where:
o Y is the dependent variable
o X1, X2, …, Xn are the independent variables
o β0 is the intercept
o β1, β2, …, βn are the slopes
o The goal of the algorithm is to find the best Fit Line equation that can
predict the values based on the independent variables.
o In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an
unknown X this learned function can be used. In regression we have to
find the value of Y, So, a function is required that predicts continuous Y in
the case of regression given X as independent features.

Logistic Regression:
o Logistic regression is another supervised learning algorithm which is
used to solve the classification problems. In classification problems, we
have dependent variables in a binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as
0 or 1, Yes or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of
probability.
o Logistic regression is a type of regression, but it is different from the
linear regression algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a
complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-
curve as follows:

o It uses the concept of threshold levels, values above the threshold level
are rounded up to 1, and values below the threshold level are rounded
up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-
linear dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are
present in a non-linear fashion, so for such case, linear regression will
not best fit to those datapoints. To cover such datapoints, we need
Polynomial regression.
o In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a linear
model. Which means the datapoints are best fitted using a polynomial
line.

o The equation for polynomial regression also derived from linear


regression equation that means Linear regression equation Y= b 0+ b1x, is
transformed into Polynomial regression equation Y= b0+b1x+ b2x2+
b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression
coefficients. x is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that
in Polynomial regression, a single element has different degrees instead
of multiple variables with the same degree.
The Log-linear Regression model

You might also like