0% found this document useful (0 votes)

15 views5 pages

Unit3 Datamining

The document discusses unsupervised learning techniques, emphasizing the exploration of data patterns without predefined outcomes. Key methodologies include data visualization, Principal Component Analysis (PCA), cluster analysis, and market basket analysis, each serving distinct purposes such as dimensionality reduction and identifying associations. It also contrasts explanatory modeling, focused on understanding relationships, with predictive modeling, aimed at accurate future predictions.

Uploaded by

anujbhagat031

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

Unit3 Datamining

Uploaded by

anujbhagat031

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Unsupervised Learning: Exploring Data and Modeling Approaches

Unsupervised Learning: Core Idea Unsupervised learning deals with finding patterns and structure in data that
has not been labeled with predefined outcomes or categories. The goal is to explore the data, discover
inherent groupings, reduce dimensionality, or find interesting relationships without a specific target variable to
predict.

1. Exploring Data with Visualization (JMP Pro and Enterprise Guide)

• Purpose: To gain initial insights into the data's structure, distributions, relationships between variables,
and potential outliers before applying formal modeling techniques.

• JMP Pro & SAS Enterprise Guide: These are statistical software packages that offer extensive
visualization capabilities.

o JMP Pro: Highly interactive, known for dynamic linking of graphs, easy exploration of
distributions (histograms, box plots), scatter plot matrices, parallel plots, and tools for
identifying clusters visually.

o SAS Enterprise Guide: Provides a more project-based, code-driven (SAS code) approach but also
offers graphical tools for creating summaries, charts, and plots to understand data
characteristics.

• Key Visualizations for Unsupervised Learning:

o Histograms & Density Plots: To understand the distribution of individual variables.

o Scatter Plots & Scatter Plot Matrices: To see relationships and potential groupings between
pairs of variables.

o Box Plots: To compare distributions across different segments (if any are pre-identified or
hypothesized) and detect outliers.

o Parallel Coordinate Plots: Useful for visualizing high-dimensional data and identifying patterns
or clusters.

o Heatmaps: To visualize correlations between variables or the intensity of values in a matrix.

2. Principal Component Analysis (PCA)

• Purpose: A dimensionality reduction technique used to transform a large set of correlated variables
into a smaller set of uncorrelated variables called principal components (PCs). These PCs capture the
maximum possible variance from the original data.

• How it Works (Conceptually):

1. Finds the direction (PC1) in the data that explains the most variance.

2. Finds the next direction (PC2), orthogonal (uncorrelated) to PC1, that explains the most
remaining variance.
3. Continues this process until all variance is captured or a desired number of components is
reached.

• Use Cases:

o Reducing the number of features for faster computation or to avoid the curse of dimensionality
in subsequent modeling.

o Visualizing high-dimensional data in 2D or 3D using the first few PCs.

o Feature extraction for supervised learning.

o Noise reduction.

• Key Output: Eigenvalues (variance explained by each PC) and Eigenvectors (coefficients defining each
PC).

3. Cluster Analysis

• Purpose: To group a set of objects (data points, observations) in such a way that objects in the same
group (called a cluster) are more similar to each other than to those in other clusters.

• Common Types:

o K-Means Clustering:

▪ Algorithm: Partitions data into 'k' predefined clusters. It iteratively assigns data points to
the nearest cluster centroid and then recalculates the centroid.

▪ Requires: The number of clusters (k) to be specified beforehand. Sensitive to initial

centroid placement and feature scaling.

o Hierarchical Clustering:

▪ Algorithm: Builds a hierarchy of clusters either agglomeratively (bottom-up, starting

with individual points and merging them) or divisively (top-down, starting with one
cluster and splitting it).

▪ Output: A dendrogram (tree diagram) showing the merge/split sequence, allowing

choice of cluster numbers by cutting the dendrogram at a certain level.

▪ Does not require 'k' upfront.

o Density-Based Clustering (e.g., DBSCAN):

▪ Algorithm: Groups together points that are closely packed together (points with many
nearby neighbors), marking as outliers points that lie alone in low-density regions.

▪ Can find arbitrarily shaped clusters and handles noise well.

• Use Cases: Customer segmentation, anomaly detection, image segmentation, document grouping.

4. Variables Clustering (or Feature Clustering)

• Purpose: Unlike cluster analysis which groups observations, variables clustering groups variables
(features) that are similar or redundant. The goal is to identify groups of variables that share similar
information content or are highly correlated.

• How it Differs from PCA:

o PCA creates new, synthetic variables (principal components).

o Variables clustering groups the original variables.

• How it Works (Conceptually): Uses measures of similarity or association between variables (e.g.,
correlation, mutual information) to group them. Often involves hierarchical clustering principles applied
to variables.

• Use Cases:

o Understanding relationships and redundancies among variables.

o Feature selection: selecting a representative variable from each cluster.

o Reducing multicollinearity in regression models.

o Simplifying the interpretation of complex datasets by grouping related concepts.

o In JMP Pro, this is often found under "Multivariate Methods" and can help in selecting a subset
of variables that represent distinct underlying dimensions.

5. Market Basket Analysis (Association Analysis)

• Purpose: To discover associations or co-occurrence relationships among a set of items in transactional

data. The classic example is finding products frequently bought together in a supermarket.

• Core Idea: Identifies "if-then" rules (e.g., "IF {Bread, Butter} THEN {Milk}").

• Key Metrics:

o Support: The fraction of transactions that contain a particular itemset (e.g., {Bread, Butter,
Milk}).

▪ Support(X) = (Number of transactions containing X) / (Total number of transactions)

o Confidence: The conditional probability that a transaction containing itemset X also contains
itemset Y. Measures how often items in Y appear in transactions that contain X.

▪ Confidence(X -> Y) = Support(X U Y) / Support(X)

o Lift: Measures how much more likely itemset Y is purchased when itemset X is purchased,
compared to Y being purchased alone. It indicates the strength of an association beyond
random chance.

▪ Lift(X -> Y) = Support(X U Y) / (Support(X) * Support(Y))

▪ Lift > 1: Positive correlation (Y is likely to be bought if X is bought).

▪ Lift < 1: Negative correlation.

▪ Lift = 1: No correlation.

• Algorithm Example: Apriori algorithm is a classic method for finding frequent itemsets, which are then
used to generate association rules.

• Use Cases: Retail layout optimization, product recommendations, targeted marketing, fraud detection.

The Problem of Explanatory (Traditional) vs. Predictive Modeling and Why It Matters

This distinction is fundamental in data analysis and statistics.

1. Explanatory Modeling (Traditional / Causal Inference Focus)

o Primary Goal: To understand and quantify the relationship between a set of input variables
(predictors, independent variables) and an outcome variable (dependent variable). The focus is
on interpreting the model coefficients to explain how or why changes in predictors affect the
outcome.

o Emphasis: Causal inference (though often hard to prove definitively), understanding

mechanisms, testing hypotheses derived from theory.

o Model Complexity: Often prefers simpler models (e.g., linear regression, logistic regression)
where coefficients are easily interpretable and statistical significance can be assessed.

o Evaluation: Goodness-of-fit statistics (e.g., R-squared, p-values for coefficients, AIC, BIC),
residual analysis, adherence to model assumptions.

o Example: A sociologist wants to understand the factors (e.g., education, income, location) that
explain differences in life satisfaction scores. The goal is to understand the individual impact of
each factor.

2. Predictive Modeling (Machine Learning Focus)

o Primary Goal: To develop a model that can accurately predict future or unseen outcomes
based on new input data. The focus is on the model's predictive accuracy, not necessarily on
understanding the exact contribution of each individual predictor.

o Emphasis: Generalization to new data, minimizing prediction error.

o Model Complexity: Can use complex, "black-box" models (e.g., neural networks, random
forests, gradient boosting) if they provide better predictive performance, even if their internal
workings are hard to interpret.

o Evaluation: Performance on unseen test data (e.g., accuracy, precision, recall, F1-score for
classification; RMSE, MAE for regression), cross-validation.

o Example: A company wants to build a model to predict which customers are likely to churn next
month based on their past behavior and demographics. The primary goal is accurate prediction
to target retention efforts.
Why the Distinction Matters:

• Different Goals Lead to Different Methods:

o If explaining why, you prioritize interpretable models and rigorous statistical inference about
coefficients.

o If predicting what, you prioritize models that generalize well to new data, even if they are
complex and less interpretable.

• Model Selection Criteria:

o Explanatory: Significance of predictors, goodness-of-fit, theoretical relevance.

o Predictive: Predictive accuracy on hold-out data.

• Interpretation of Results:

o Explanatory: Focus on the meaning, magnitude, and significance of coefficients (e.g., "a one-
unit increase in X is associated with a B-unit change in Y, holding other factors constant").

o Predictive: Focus on the overall accuracy and reliability of predictions. Individual feature
importance might be assessed, but it's secondary to predictive power.

• Risk of Misapplication:

o Using a purely predictive model for explanation can be misleading, as complex interactions or
correlations might not imply causation or simple relationships.

o Over-focusing on interpretability in a purely predictive task might sacrifice accuracy.

Aiml Model
No ratings yet
Aiml Model
13 pages
Introduction To Data Mining 1
No ratings yet
Introduction To Data Mining 1
23 pages
Da Imp Qna Cleaned
No ratings yet
Da Imp Qna Cleaned
7 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
ML Summary
No ratings yet
ML Summary
23 pages
ML SummaryFINAL
No ratings yet
ML SummaryFINAL
48 pages
Marketing Analytics Week-8 LAQ
No ratings yet
Marketing Analytics Week-8 LAQ
4 pages
Business Analytics Unit 3 Notes - Watermarked
No ratings yet
Business Analytics Unit 3 Notes - Watermarked
16 pages
Day School 03
No ratings yet
Day School 03
32 pages
Course - Data Science Foundations - Data Mining
No ratings yet
Course - Data Science Foundations - Data Mining
3 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
Da Mid 2
No ratings yet
Da Mid 2
12 pages
Empirical Finance
No ratings yet
Empirical Finance
5 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
9 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
BDA Lecture Unit 3 With LAB
No ratings yet
BDA Lecture Unit 3 With LAB
20 pages
Unit 4
No ratings yet
Unit 4
42 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
13 pages
Data Mining
No ratings yet
Data Mining
44 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Big Data
No ratings yet
Big Data
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Business Analytics & Data Mining Guide
No ratings yet
Business Analytics & Data Mining Guide
7 pages
Exam 1
No ratings yet
Exam 1
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
30 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
Machine Unit4
No ratings yet
Machine Unit4
55 pages
EDA Assignment 1 Devyani1
No ratings yet
EDA Assignment 1 Devyani1
7 pages
Unit V Data Analytics Visualization
No ratings yet
Unit V Data Analytics Visualization
48 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
3 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
8 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
ML Unit 3
No ratings yet
ML Unit 3
10 pages
Bi Short Notes
No ratings yet
Bi Short Notes
15 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
BigData QB (C.format)
No ratings yet
BigData QB (C.format)
6 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Ds 1
No ratings yet
Ds 1
8 pages
Cheatsheet FDA A4 Full
No ratings yet
Cheatsheet FDA A4 Full
2 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Unit 5
No ratings yet
Unit 5
38 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Margin 6794edf99eb1f 6794ede66a47f
No ratings yet
Margin 6794edf99eb1f 6794ede66a47f
2 pages
Machine Learning
No ratings yet
Machine Learning
48 pages
Predictive Analysis 5
No ratings yet
Predictive Analysis 5
8 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Module III Data Mining
No ratings yet
Module III Data Mining
7 pages
Data Mining Technique
No ratings yet
Data Mining Technique
7 pages
Medium Com Sarowar Saurav10 20 Advanced Statistical Approaches Every Data Scientist Should Know Ccc70ae4df28
No ratings yet
Medium Com Sarowar Saurav10 20 Advanced Statistical Approaches Every Data Scientist Should Know Ccc70ae4df28
15 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Order of Preference of The Post Details: Railway Recruitment Board Ministry of Railways, Govt. of India
No ratings yet
Order of Preference of The Post Details: Railway Recruitment Board Ministry of Railways, Govt. of India
2 pages
Pecaiml601b
No ratings yet
Pecaiml601b
11 pages
Shruti Singh - 1095
No ratings yet
Shruti Singh - 1095
36 pages
Suraj Kumar Prasad CV
No ratings yet
Suraj Kumar Prasad CV
2 pages
HASHING
No ratings yet
HASHING
8 pages
Stack
No ratings yet
Stack
10 pages
Aircraft Icing: Impact & Safety Measures
No ratings yet
Aircraft Icing: Impact & Safety Measures
11 pages
Benefits of Laughter for Health
No ratings yet
Benefits of Laughter for Health
2 pages
PHP Basics for Beginners
No ratings yet
PHP Basics for Beginners
8 pages
Dry Erase Marker Experiment
No ratings yet
Dry Erase Marker Experiment
9 pages
Tds - Emaco Cp20
No ratings yet
Tds - Emaco Cp20
3 pages
BITSAT 2019 Question Paper With Answer Key
No ratings yet
BITSAT 2019 Question Paper With Answer Key
29 pages
2000 General Catalogue
No ratings yet
2000 General Catalogue
102 pages
Determination of The Ultimate Pit PDF
No ratings yet
Determination of The Ultimate Pit PDF
16 pages
Language Models
No ratings yet
Language Models
50 pages
Shedding Vortex Characteristics Analysis of NACA 0012
No ratings yet
Shedding Vortex Characteristics Analysis of NACA 0012
19 pages
Elements of Chemical Reaction Engineering Fourth Edition H. Scott Fogler Latest PDF 2025
0% (1)
Elements of Chemical Reaction Engineering Fourth Edition H. Scott Fogler Latest PDF 2025
165 pages
Good DESIGN THINKING Book by Dharam Mentor
100% (1)
Good DESIGN THINKING Book by Dharam Mentor
78 pages
The Importance of Having Life Goals
No ratings yet
The Importance of Having Life Goals
2 pages
Tm-t81 Users Manual
No ratings yet
Tm-t81 Users Manual
12 pages
Dilation
No ratings yet
Dilation
13 pages
Assignment Duality & Sensitivity Analysi
No ratings yet
Assignment Duality & Sensitivity Analysi
6 pages
3 Ways to Achieve Goals Guide
No ratings yet
3 Ways to Achieve Goals Guide
5 pages
BOHO-Decor Company-Profile 221122
No ratings yet
BOHO-Decor Company-Profile 221122
48 pages
I-DECIDE An Online Intervention Drawing On The Psychosocial
No ratings yet
I-DECIDE An Online Intervention Drawing On The Psychosocial
9 pages
Cemu 1.26.2f Crash Log Analysis
No ratings yet
Cemu 1.26.2f Crash Log Analysis
4 pages
ASME PCC-1 Appendix-O-Calculation
50% (2)
ASME PCC-1 Appendix-O-Calculation
23 pages
Exploring Medical Language A Student Directed Approach 10th Edition ISBN 0323396453, 9780323396455 Extended Version Download
0% (1)
Exploring Medical Language A Student Directed Approach 10th Edition ISBN 0323396453, 9780323396455 Extended Version Download
14 pages
in Physical Science
No ratings yet
in Physical Science
16 pages
Stationary - Non-Stationary - White Noise Time Series
No ratings yet
Stationary - Non-Stationary - White Noise Time Series
21 pages
Home Science Extension and Community Development
No ratings yet
Home Science Extension and Community Development
2 pages
Marketing
No ratings yet
Marketing
17 pages
Journal 2
No ratings yet
Journal 2
50 pages
Travel Planning Worksheet 2
No ratings yet
Travel Planning Worksheet 2
8 pages
RRZZHHTT-65A-R6H4 Product Specifications (Comprehensive)
No ratings yet
RRZZHHTT-65A-R6H4 Product Specifications (Comprehensive)
6 pages
List Config
No ratings yet
List Config
9 pages