-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Rewrite plot_document_clustering.py as notebook #22497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposing a first round of review to reorganize a bit the section about the performance comparison.
@@ -6,148 +6,32 @@ | |||
This is an example showing how the scikit-learn can be used to cluster |
There was a problem hiding this comment.
Choose a reason for hiding this com 10000 ment
The reason will be displayed to describe this comment to others. Learn more.
This is an example showing how the scikit-learn can be used to cluster | |
This is an example showing how scikit-learn can be used to cluster |
@@ -6,148 +6,32 @@ | |||
This is an example showing how the scikit-learn can be used to cluster | |||
documents by topics using a bag-of-words approach. This example uses | |||
a scipy.sparse matrix to store the features instead of standard numpy arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a scipy.sparse matrix to store the features instead of standard numpy arrays. | |
a `scipy.sparse` matrix to store the features instead of standard numpy arrays. |
We attempt reweighting and dimensionality reduction on the | ||
extracted features and compare performance metrics for each case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we compare different feature extraction approaches using several clustering metrics.
|
||
categories = [ | ||
"alt.atheism", | ||
"talk.religion.misc", | ||
"comp.graphics", | ||
"sci.space", | ||
] | ||
# Uncomment the following to do the analysis on all the categories | ||
# categories = None | ||
|
||
print("Loading 20 newsgroups dataset for categories:") | ||
print(categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can change l.43-45
print(f"{len(dataset.data)} documents")
print(f"{len(dataset.target_names)} categories")
model (the fit method does nothing). When IDF weighting is needed it can | ||
be added by pipelining its output to a TfidfTransformer instance. | ||
|
||
Two algorithms are demoed: ordinary k-means and its more scalable cousin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note here: I checked locally and for this example, there is no point in using MiniBatchKMeans
.
At the end of the example, we can add a remark that, with a large number of data samples, it would be best to use the MiniBatchKMeans
to improve the computational performance.
from sklearn.cluster import MiniBatchKMeans | ||
|
||
|
||
def kmeans_pipeline( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this function, I would heavily rely on the notebook structure and add narrative.
In the end, we could create a cell for each of the following models:
tf_model = make_pipeline(
TfidfVectorizer(
max_df=0.5,
max_features=10000,
min_df=2,
stop_words="english",
use_idf=False,
),
KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)
tf_idf_model = make_pipeline(
TfidfVectorizer(
max_df=0.5,
max_features=10000,
min_df=2,
stop_words="english",
use_idf=True,
),
KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)
tf_idf_lsa_model = make_pipeline(
TfidfVectorizer(
max_df=0.5,
max_features=10000,
min_df=2,
stop_words="english",
use_idf=True,
),
TruncatedSVD(n_components=10, random_state=0),
Normalizer(copy=False),
KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)
(The above models need to be double-checked, this is a draft of what I could observe from the code).
We could add a narration explaining what each pipeline is doing before creating it and finally make the comparison by evaluating each pipeline.
|
||
pipeline.append(km) | ||
|
||
homo = metrics.homogeneity_score(labels, km.labels_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can isolate the evaluation in its own function:
from collections import defaultdict
import numpy as np
import pandas as pd
from sklearn.metrics import (
homogeneity_score,
completeness_score,
v_measure_score,
adjusted_rand_score,
silhouette_score,
)
def fit_evaluate_clusterer_booststrap(clusterer, X, labels, n_bootstrap=10, random_state=None):
bootstrap_results = defaultdict(list)
n_samples = X.shape[0]
rng = np.random.RandomState(random_state)
for _ in range(n_bootstrap):
bootstrap_indices = rng.choice(np.arange(n_samples), size=n_samples, replace=True)
X_bootstrap, labels_bootstrap = X[bootstrap_indices], labels[bootstrap_indices]
predicted_labels = clusterer.fit_predict(X_bootstrap)
bootstrap_results["homogeneity"].append(homogeneity_score(labels_bootstrap, predicted_labels))
bootstrap_results["completness"].append(completeness_score(labels_bootstrap, predicted_labels))
bootstrap_results["v_measure"].append(v_measure_score(labels_bootstrap, predicted_labels))
bootstrap_results["ARI"].append(adjusted_rand_score(labels_bootstrap, predicted_labels))
bootstrap_results["silhouhette"].append(silhouette_score(X_bootstrap, predicted_labels, sample_size=20))
return pd.DataFrame(bootstrap_results, index=[f"Bootstrap #{i}" for i in range(n_bootstrap)])
The idea of using the following function will be to get an estimate of the variation of the fitting using bootstrap. We had a discussion with @ogrisel IRL to know what would be the best strategy here and it seems the best.
We can then call this function and show the mean and std results:
bootstrap_results = fit_evaluate_clusterer_booststrap(kmeans, X, y, 10, np.random.RandomState(0))
bootstrap_results.aggregate(["mean", "std"])
Since we will have 3 different models, we can create the dataframe a bit differently but showing the method as index and the metric/mean/std with a multi-index as columns could be a way to summarize the information nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review.
But I encountered an issue. Silhouette coeff takes as argument the data in feature space. X_bootstrap
in the function is a list of the documents. Which means we need the output of the vectorizer but clusterer
is a Pipeline
and doesn't output intermediary layers (unless there is a way I'm not aware of). Same problem with the last code cell where the plotting happens and all the steps are needed.
So we either remove silhouette coef, or create pipelines only for the feature extraction and instantiate KMeans
separately. Any suggestions?
Figured out a way. If it looks good I can add the narrative. Note: |
We are still using the |
@gpapadok what is the reason for closing the PR? |
Reference Issues/PRs
For #22406.
What does this implement/fix? Explain your changes.
Rewrite of script to notebook-style example as suggested by @glemaitre in #22443.
Structured it as a comparison between k-means pipelines with weighted term frequency and dimensionality reduction and without. Also added a text plot at the end with the most frequent terms for each cluster.
Any other comments?
Alternatively, I could break the second cell with the
kmeans_pipeline
function into smaller ones and make it into a straightforward example. But I found the performance comparison interesting.