[go: up one dir, main page]

0% found this document useful (0 votes)
4 views2 pages

Unsupervised Learning Clustering Cribsheet

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

Unsupervised Learning Clustering Cribsheet

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Unsupervised Learning & Clustering — Exam Crib

Sheet

Why Unsupervised?
• Labeling large datasets is costly.
• Useful for data mining and feature extraction.
• Handles evolving patterns over time.
• Reveals hidden data structure.

Mixture Models & Identifiability


• Model: p(x) = Σ P(ωj) p(x | θj)
• Identifiability: unique θ from p(x). If multiple θ → same p(x), can’t recover true parameters.
• Gaussian mixtures usually identifiable, but label switching possible.

Parameter Estimation
• Maximum Likelihood (ML): maximize log L(θ).
• EM Algorithm: E-step: compute responsibilities; M-step: update parameters.
• Bayesian: use prior p(θ) + data; better with small samples, reduces overfitting.

Clustering Algorithms
• K-Means: minimizes SSE; assumes spherical, equal-size clusters.
• Hierarchical: agglomerative (merge) or divisive (split); dendrogram output.
• Graph-Theoretic: clusters via connectivity in similarity graph; handles non-convex shapes.

Similarity Measures
• Metric: Euclidean, Manhattan, Minkowski.
• Non-metric: Inner product, Tanimoto.
• Normalize features for invariance.

Choosing Number of Clusters


• Elbow method (SSE drop-off).
• Silhouette score (compactness & separation).
• BIC/AIC for mixture models.
• Stability across bootstrap samples.

Key Pitfalls
• Singularities in ML (component collapse).
• Local optima in EM/K-means.
• Outliers distort distance-based methods.
• High-dimensional data reduces distance meaning.
Practical Tips
• Scale features; smart init (k-means++).
• Multiple restarts for robustness.
• Validate before interpretation.
• Combine with dimensionality reduction if needed.

Method Comparison
Method Strengths Weaknesses
K-Means Fast, simple, scalable Assumes spherical, equal-size clusters
GMM (EM) Elliptical clusters, soft assignments Sensitive to init, can collapse
Hierarchical Dendrogram, no preset k O(n^2) complexity
Graph/Spectral Non-convex shapes Needs similarity matrix

You might also like