DOC Rewrite plot_document_clustering.py as notebook #22497

gpapadok · 2022-02-15T21:32:45Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Rewrite of script to notebook-style example as suggested by @glemaitre in #22443.

Structured it as a comparison between k-means pipelines with weighted term frequency and dimensionality reduction and without. Also added a text plot at the end with the most frequent terms for each cluster.

Any other comments?

Alternatively, I could break the second cell with the kmeans_pipeline function into smaller ones and make it into a straightforward example. But I found the performance comparison interesting.

glemaitre

I proposing a first round of review to reorganize a bit the section about the performance comparison.

glemaitre · 2022-02-16T10:57:04Z

examples/text/plot_document_clustering.py

@@ -6,148 +6,32 @@
 This is an example showing how the scikit-learn can be used to cluster


Suggested change

This is an example showing how the scikit-learn can be used to cluster

This is an example showing how scikit-learn can be used to cluster

glemaitre · 2022-02-16T10:57:26Z

examples/text/plot_document_clustering.py

@@ -6,148 +6,32 @@
 This is an example showing how the scikit-learn can be used to cluster
 documents by topics using a bag-of-words approach. This example uses
 a scipy.sparse matrix to store the features instead of standard numpy arrays.


Suggested change

a scipy.sparse matrix to store the features instead of standard numpy arrays.

a `scipy.sparse` matrix to store the features instead of standard numpy arrays.

glemaitre · 2022-02-16T11:02:11Z

examples/text/plot_document_clustering.py

+We attempt reweighting and dimensionality reduction on the
+extracted features and compare performance metrics for each case.


Here, we compare different feature extraction approaches using several clustering metrics.

glemaitre · 2022-02-16T11:03:48Z

examples/text/plot_document_clustering.py


 categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
 ]
-# Uncomment the following to do the analysis on all the categories
-# categories = None

 print("Loading 20 newsgroups dataset for categories:")
 print(categories)


you can change l.43-45

print(f"{len(dataset.data)} documents") print(f"{len(dataset.target_names)} categories")

glemaitre · 2022-02-16T14:29:44Z

examples/t 10000 ext/plot_document_clustering.py

-    model (the fit method does nothing). When IDF weighting is needed it can
-    be added by pipelining its output to a TfidfTransformer instance.
-
-Two algorithms are demoed: ordinary k-means and its more scalable cousin


A note here: I checked locally and for this example, there is no point in using MiniBatchKMeans.
At the end of the example, we can add a remark that, with a large number of data samples, it would be best to use the MiniBatchKMeans to improve the computational performance.

glemaitre · 2022-02-16T14:30:36Z

examples/text/plot_document_clustering.py

+from sklearn.cluster import MiniBatchKMeans
+
+
+def kmeans_pipeline(


Instead of this function, I would heavily rely on the notebook structure and add narrative.
In the end, we could create a cell for each of the following models:

tf_model = make_pipeline( TfidfVectorizer( max_df=0.5, max_features=10000, min_df=2, stop_words="english", use_idf=False, ), KMeans(n_clusters=true_k, init="kmean++", random_state=0), ) tf_idf_model = make_pipeline( TfidfVectorizer( max_df=0.5, max_features=10000, min_df=2, stop_words="english", use_idf=True, ), KMeans(n_clusters=true_k, init="kmean++", random_state=0), ) tf_idf_lsa_model = make_pipeline( TfidfVectorizer( max_df=0.5, max_features=10000, min_df=2, stop_words="english", use_idf=True, ), TruncatedSVD(n_components=10, random_state=0), Normalizer(copy=False), KMeans(n_clusters=true_k, init="kmean++", random_state=0), )

(The above models need to be double-checked, this is a draft of what I could observe from the code).
We could add a narration explaining what each pipeline is doing before creating it and finally make the comparison by evaluating each pipeline.

glemaitre · 2022-02-16T14:53:10Z

examples/text/plot_document_clustering.py

+
+    pipeline.append(km)
+
+    homo = metrics.homogeneity_score(labels, km.labels_)


I think that we can isolate the evaluation in its own function:

from collections import defaultdict import numpy as np import pandas as pd from sklearn.metrics import ( homogeneity_score, completeness_score, v_measure_score, adjusted_rand_score, silhouette_score, ) def fit_evaluate_clusterer_booststrap(clusterer, X, labels, n_bootstrap=10, random_state=None): bootstrap_results = defaultdict(list) n_samples = X.shape[0] rng = np.random.RandomState(random_state) for _ in range(n_bootstrap): bootstrap_indices = rng.choice(np.arange(n_samples), size=n_samples, replace=True) X_bootstrap, labels_bootstrap = X[bootstrap_indices], labels[bootstrap_indices] predicted_labels = clusterer.fit_predict(X_bootstrap) bootstrap_results["homogeneity"].append(homogeneity_score(labels_bootstrap, predicted_labels)) bootstrap_results["completness"].append(completeness_score(labels_bootstrap, predicted_labels)) bootstrap_results["v_measure"].append(v_measure_score(labels_bootstrap, predicted_labels)) bootstrap_results["ARI"].append(adjusted_rand_score(labels_bootstrap, predicted_labels)) bootstrap_results["silhouhette"].append(silhouette_score(X_bootstrap, predicted_labels, sample_size=20)) return pd.DataFrame(bootstrap_results, index=[f"Bootstrap #{i}" for i in range(n_bootstrap)])

The idea of using the following function will be to get an estimate of the variation of the fitting using bootstrap. We had a discussion with @ogrisel IRL to know what would be the best strategy here and it seems the best.

We can then call this function and show the mean and std results:

bootstrap_results = fit_evaluate_clusterer_booststrap(kmeans, X, y, 10, np.random.RandomState(0)) bootstrap_results.aggregate(["mean", "std"])

Since we will have 3 different models, we can create the dataframe a bit differently but showing the method as index and the metric/mean/std with a multi-index as columns could be a way to summarize the information nicely.

Thanks for the detailed review.

But I encountered an issue. Silhouette coeff takes as argument the data in feature space. X_bootstrap in the function is a list of the documents. Which means we need the output of the vectorizer but clusterer is a Pipeline and doesn't output intermediary layers (unless there is a way I'm not aware of). Same problem with the last code cell where the plotting happens and all the steps are needed.

So we either remove silhouette coef, or create pipelines only for the feature extraction and instantiate KMeans separately. Any suggestions?

gpapadok · 2022-02-18T20:23:02Z

Figured out a way. If it looks good I can add the narrative.
For the sampling it made more sense to use sklearn.utils.resample. Faster than choice too.

Note: RandomState is legacy according to the numpy docs. I tried to use the recommended Generator but resample won't accept it.

glemaitre · 2022-02-28T10:33:07Z

Note: RandomState is legacy according to the numpy docs. I tried to use the recommended Generator but resample won't accept it.

We are still using the RandomState for backward compatibility. At some point, we might migrate the random generator to the new API.

glemaitre · 2022-02-28T10:33:45Z

@gpapadok what is the reason for closing the PR?

rewrite plot_document_clustering.py as a notebook

b615d64

github-actions bot added the Documentation label Feb 15, 2022

gpapadok added 2 commits February 16, 2022 01:30

auto-format with black

e43038d

fix circleci test fail

7ddf031

glemaitre self-requested a review February 16, 2022 12:30

glemaitre reviewed Feb 16, 2022

View reviewed changes

lesteve mentioned this pull request Feb 17, 2022

Fix notebook-style examples #22406

Closed

47 tasks

gpapadok changed the title ~~DOC Rewrite plot_document_clustering.py as notebook~~ [WIP] DOC Rewrite plot_document_clustering.py as notebook Feb 17, 2022

split models into cells and use bootstrap evaluation function

db38abc

gpapadok requested a review from glemaitre February 18, 2022 20:24

add narrative

e7be17a

gpapadok changed the title ~~[WIP] DOC Rewrite plot_document_clustering.py as notebook~~ DOC Rewrite plot_document_clustering.py as notebook Feb 18, 2022

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC Rewrite plot_document_clustering.py as notebook #22497

DOC Rewrite plot_document_clustering.py as notebook #22497

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -6,148 +6,32 @@
		This is an example showing how the scikit-learn can be used to cluster

	This is an example showing how the scikit-learn can be used to cluster
	This is an example showing how scikit-learn can be used to cluster

	a scipy.sparse matrix to store the features instead of standard numpy arrays.
	a `scipy.sparse` matrix to store the features instead of standard numpy arrays.

		We attempt reweighting and dimensionality reduction on the
		extracted features and compare performance metrics for each case.

		from sklearn.cluster import MiniBatchKMeans


		def kmeans_pipeline(


		pipeline.append(km)

		homo = metrics.homogeneity_score(labels, km.labels_)

Uh oh!

DOC Rewrite plot_document_clustering.py as notebook #22497

DOC Rewrite plot_document_clustering.py as notebook #22497

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!