Unsupervised Learning & Clustering — Exam Crib
Sheet
Why Unsupervised?
• Labeling large datasets is costly.
• Useful for data mining and feature extraction.
• Handles evolving patterns over time.
• Reveals hidden data structure.
Mixture Models & Identifiability
• Model: p(x) = Σ P(ωj) p(x | θj)
• Identifiability: unique θ from p(x). If multiple θ → same p(x), can’t recover true parameters.
• Gaussian mixtures usually identifiable, but label switching possible.
Parameter Estimation
• Maximum Likelihood (ML): maximize log L(θ).
• EM Algorithm: E-step: compute responsibilities; M-step: update parameters.
• Bayesian: use prior p(θ) + data; better with small samples, reduces overfitting.
Clustering Algorithms
• K-Means: minimizes SSE; assumes spherical, equal-size clusters.
• Hierarchical: agglomerative (merge) or divisive (split); dendrogram output.
• Graph-Theoretic: clusters via connectivity in similarity graph; handles non-convex shapes.
Similarity Measures
• Metric: Euclidean, Manhattan, Minkowski.
• Non-metric: Inner product, Tanimoto.
• Normalize features for invariance.
Choosing Number of Clusters
• Elbow method (SSE drop-off).
• Silhouette score (compactness & separation).
• BIC/AIC for mixture models.
• Stability across bootstrap samples.
Key Pitfalls
• Singularities in ML (component collapse).
• Local optima in EM/K-means.
• Outliers distort distance-based methods.
• High-dimensional data reduces distance meaning.
Practical Tips
• Scale features; smart init (k-means++).
• Multiple restarts for robustness.
• Validate before interpretation.
• Combine with dimensionality reduction if needed.
Method Comparison
Method Strengths Weaknesses
K-Means Fast, simple, scalable Assumes spherical, equal-size clusters
GMM (EM) Elliptical clusters, soft assignments Sensitive to init, can collapse
Hierarchical Dendrogram, no preset k O(n^2) complexity
Graph/Spectral Non-convex shapes Needs similarity matrix