Unsupervised Learning: Exploring Data and Modeling Approaches
Unsupervised Learning: Core Idea Unsupervised learning deals with finding patterns and structure in data that
has not been labeled with predefined outcomes or categories. The goal is to explore the data, discover
inherent groupings, reduce dimensionality, or find interesting relationships without a specific target variable to
predict.
1. Exploring Data with Visualization (JMP Pro and Enterprise Guide)
• Purpose: To gain initial insights into the data's structure, distributions, relationships between variables,
and potential outliers before applying formal modeling techniques.
• JMP Pro & SAS Enterprise Guide: These are statistical software packages that offer extensive
visualization capabilities.
o JMP Pro: Highly interactive, known for dynamic linking of graphs, easy exploration of
distributions (histograms, box plots), scatter plot matrices, parallel plots, and tools for
identifying clusters visually.
o SAS Enterprise Guide: Provides a more project-based, code-driven (SAS code) approach but also
offers graphical tools for creating summaries, charts, and plots to understand data
characteristics.
• Key Visualizations for Unsupervised Learning:
o Histograms & Density Plots: To understand the distribution of individual variables.
o Scatter Plots & Scatter Plot Matrices: To see relationships and potential groupings between
pairs of variables.
o Box Plots: To compare distributions across different segments (if any are pre-identified or
hypothesized) and detect outliers.
o Parallel Coordinate Plots: Useful for visualizing high-dimensional data and identifying patterns
or clusters.
o Heatmaps: To visualize correlations between variables or the intensity of values in a matrix.
2. Principal Component Analysis (PCA)
• Purpose: A dimensionality reduction technique used to transform a large set of correlated variables
into a smaller set of uncorrelated variables called principal components (PCs). These PCs capture the
maximum possible variance from the original data.
• How it Works (Conceptually):
1. Finds the direction (PC1) in the data that explains the most variance.
2. Finds the next direction (PC2), orthogonal (uncorrelated) to PC1, that explains the most
remaining variance.
3. Continues this process until all variance is captured or a desired number of components is
reached.
• Use Cases:
o Reducing the number of features for faster computation or to avoid the curse of dimensionality
in subsequent modeling.
o Visualizing high-dimensional data in 2D or 3D using the first few PCs.
o Feature extraction for supervised learning.
o Noise reduction.
• Key Output: Eigenvalues (variance explained by each PC) and Eigenvectors (coefficients defining each
PC).
3. Cluster Analysis
• Purpose: To group a set of objects (data points, observations) in such a way that objects in the same
group (called a cluster) are more similar to each other than to those in other clusters.
• Common Types:
o K-Means Clustering:
▪ Algorithm: Partitions data into 'k' predefined clusters. It iteratively assigns data points to
the nearest cluster centroid and then recalculates the centroid.
▪ Requires: The number of clusters (k) to be specified beforehand. Sensitive to initial
centroid placement and feature scaling.
o Hierarchical Clustering:
▪ Algorithm: Builds a hierarchy of clusters either agglomeratively (bottom-up, starting
with individual points and merging them) or divisively (top-down, starting with one
cluster and splitting it).
▪ Output: A dendrogram (tree diagram) showing the merge/split sequence, allowing
choice of cluster numbers by cutting the dendrogram at a certain level.
▪ Does not require 'k' upfront.
o Density-Based Clustering (e.g., DBSCAN):
▪ Algorithm: Groups together points that are closely packed together (points with many
nearby neighbors), marking as outliers points that lie alone in low-density regions.
▪ Can find arbitrarily shaped clusters and handles noise well.
• Use Cases: Customer segmentation, anomaly detection, image segmentation, document grouping.
4. Variables Clustering (or Feature Clustering)
• Purpose: Unlike cluster analysis which groups observations, variables clustering groups variables
(features) that are similar or redundant. The goal is to identify groups of variables that share similar
information content or are highly correlated.
• How it Differs from PCA:
o PCA creates new, synthetic variables (principal components).
o Variables clustering groups the original variables.
• How it Works (Conceptually): Uses measures of similarity or association between variables (e.g.,
correlation, mutual information) to group them. Often involves hierarchical clustering principles applied
to variables.
• Use Cases:
o Understanding relationships and redundancies among variables.
o Feature selection: selecting a representative variable from each cluster.
o Reducing multicollinearity in regression models.
o Simplifying the interpretation of complex datasets by grouping related concepts.
o In JMP Pro, this is often found under "Multivariate Methods" and can help in selecting a subset
of variables that represent distinct underlying dimensions.
5. Market Basket Analysis (Association Analysis)
• Purpose: To discover associations or co-occurrence relationships among a set of items in transactional
data. The classic example is finding products frequently bought together in a supermarket.
• Core Idea: Identifies "if-then" rules (e.g., "IF {Bread, Butter} THEN {Milk}").
• Key Metrics:
o Support: The fraction of transactions that contain a particular itemset (e.g., {Bread, Butter,
Milk}).
▪ Support(X) = (Number of transactions containing X) / (Total number of transactions)
o Confidence: The conditional probability that a transaction containing itemset X also contains
itemset Y. Measures how often items in Y appear in transactions that contain X.
▪ Confidence(X -> Y) = Support(X U Y) / Support(X)
o Lift: Measures how much more likely itemset Y is purchased when itemset X is purchased,
compared to Y being purchased alone. It indicates the strength of an association beyond
random chance.
▪ Lift(X -> Y) = Support(X U Y) / (Support(X) * Support(Y))
▪ Lift > 1: Positive correlation (Y is likely to be bought if X is bought).
▪ Lift < 1: Negative correlation.
▪ Lift = 1: No correlation.
• Algorithm Example: Apriori algorithm is a classic method for finding frequent itemsets, which are then
used to generate association rules.
• Use Cases: Retail layout optimization, product recommendations, targeted marketing, fraud detection.
The Problem of Explanatory (Traditional) vs. Predictive Modeling and Why It Matters
This distinction is fundamental in data analysis and statistics.
1. Explanatory Modeling (Traditional / Causal Inference Focus)
o Primary Goal: To understand and quantify the relationship between a set of input variables
(predictors, independent variables) and an outcome variable (dependent variable). The focus is
on interpreting the model coefficients to explain how or why changes in predictors affect the
outcome.
o Emphasis: Causal inference (though often hard to prove definitively), understanding
mechanisms, testing hypotheses derived from theory.
o Model Complexity: Often prefers simpler models (e.g., linear regression, logistic regression)
where coefficients are easily interpretable and statistical significance can be assessed.
o Evaluation: Goodness-of-fit statistics (e.g., R-squared, p-values for coefficients, AIC, BIC),
residual analysis, adherence to model assumptions.
o Example: A sociologist wants to understand the factors (e.g., education, income, location) that
explain differences in life satisfaction scores. The goal is to understand the individual impact of
each factor.
2. Predictive Modeling (Machine Learning Focus)
o Primary Goal: To develop a model that can accurately predict future or unseen outcomes
based on new input data. The focus is on the model's predictive accuracy, not necessarily on
understanding the exact contribution of each individual predictor.
o Emphasis: Generalization to new data, minimizing prediction error.
o Model Complexity: Can use complex, "black-box" models (e.g., neural networks, random
forests, gradient boosting) if they provide better predictive performance, even if their internal
workings are hard to interpret.
o Evaluation: Performance on unseen test data (e.g., accuracy, precision, recall, F1-score for
classification; RMSE, MAE for regression), cross-validation.
o Example: A company wants to build a model to predict which customers are likely to churn next
month based on their past behavior and demographics. The primary goal is accurate prediction
to target retention efforts.
Why the Distinction Matters:
• Different Goals Lead to Different Methods:
o If explaining why, you prioritize interpretable models and rigorous statistical inference about
coefficients.
o If predicting what, you prioritize models that generalize well to new data, even if they are
complex and less interpretable.
• Model Selection Criteria:
o Explanatory: Significance of predictors, goodness-of-fit, theoretical relevance.
o Predictive: Predictive accuracy on hold-out data.
• Interpretation of Results:
o Explanatory: Focus on the meaning, magnitude, and significance of coefficients (e.g., "a one-
unit increase in X is associated with a B-unit change in Y, holding other factors constant").
o Predictive: Focus on the overall accuracy and reliability of predictions. Individual feature
importance might be assessed, but it's secondary to predictive power.
• Risk of Misapplication:
o Using a purely predictive model for explanation can be misleading, as complex interactions or
correlations might not imply causation or simple relationships.
o Over-focusing on interpretability in a purely predictive task might sacrifice accuracy.