10000 DOC Rewrite plot_document_clustering.py as notebook by gpapadok · Pull Request #22497 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Rewrite plot_document_clustering.py as notebook #22497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

DOC Rewrite plot_document_clustering.py as notebook #22497

wants to merge 5 commits into from

Conversation

gpapadok
Copy link
Contributor
@gpapadok gpapadok commented Feb 15, 2022

Reference Issues/PRs

For #22406.

What does this implement/fix? Explain your changes.

Rewrite of script to notebook-style example as suggested by @glemaitre in #22443.

Structured it as a comparison between k-means pipelines with weighted term frequency and dimensionality reduction and without. Also added a text plot at the end with the most frequent terms for each cluster.

Any other comments?

Alternatively, I could break the second cell with the kmeans_pipeline function into smaller ones and make it into a straightforward example. But I found the performance comparison interesting.

@glemaitre glemaitre self-requested a review February 16, 2022 12:30
Copy link
Member
@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposing a first round of review to reorganize a bit the section about the performance comparison.

@@ -6,148 +6,32 @@
This is an example showing how the scikit-learn can be used to cluster
Copy link
Member

Choose a reason for hiding this com 10000 ment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is an example showing how the scikit-learn can be used to cluster
This is an example showing how scikit-learn can be used to cluster

@@ -6,148 +6,32 @@
This is an example showing how the scikit-learn can be used to cluster
documents by topics using a bag-of-words approach. This example uses
a scipy.sparse matrix to store the features instead of standard numpy arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a scipy.sparse matrix to store the features instead of standard numpy arrays.
a `scipy.sparse` matrix to store the features instead of standard numpy arrays.

Comment on lines 9 to 10
We attempt reweighting and dimensionality reduction on the
extracted features and compare performance metrics for each case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we compare different feature extraction approaches using several clustering metrics.


categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
]
# Uncomment the following to do the analysis on all the categories
# categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can change l.43-45

print(f"{len(dataset.data)} documents")
print(f"{len(dataset.target_names)} categories")

model (the fit method does nothing). When IDF weighting is needed it can
be added by pipelining its output to a TfidfTransformer instance.

Two algorithms are demoed: ordinary k-means and its more scalable cousin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note here: I checked locally and for this example, there is no point in using MiniBatchKMeans.
At the end of the example, we can add a remark that, with a large number of data samples, it would be best to use the MiniBatchKMeans to improve the computational performance.

from sklearn.cluster import MiniBatchKMeans


def kmeans_pipeline(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this function, I would heavily rely on the notebook structure and add narrative.
In the end, we could create a cell for each of the following models:

tf_model = make_pipeline(
    TfidfVectorizer(
        max_df=0.5,
        max_features=10000,
        min_df=2,
        stop_words="english",
        use_idf=False,
    ),
    KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)

tf_idf_model = make_pipeline(
    TfidfVectorizer(
        max_df=0.5,
        max_features=10000,
        min_df=2,
        stop_words="english",
        use_idf=True,
    ),
    KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)

tf_idf_lsa_model = make_pipeline(
    TfidfVectorizer(
        max_df=0.5,
        max_features=10000,
        min_df=2,
        stop_words="english",
        use_idf=True,
    ),
    TruncatedSVD(n_components=10, random_state=0),
    Normalizer(copy=False),
    KMeans(n_clusters=true_k, init="kmean++", random_state=0),
)

(The above models need to be double-checked, this is a draft of what I could observe from the code).
We could add a narration explaining what each pipeline is doing before creating it and finally make the comparison by evaluating each pipeline.


pipeline.append(km)

homo = metrics.homogeneity_score(labels, km.labels_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can isolate the evaluation in its own function:

from collections import defaultdict

import numpy as np
import pandas as pd

from sklearn.metrics import (
    homogeneity_score,
    completeness_score,
    v_measure_score,
    adjusted_rand_score,
    silhouette_score,
)


def fit_evaluate_clusterer_booststrap(clusterer, X, labels, n_bootstrap=10, random_state=None):
    bootstrap_results = defaultdict(list)
    n_samples = X.shape[0]
    rng = np.random.RandomState(random_state)

    for _ in range(n_bootstrap):
        bootstrap_indices = rng.choice(np.arange(n_samples), size=n_samples, replace=True)
        X_bootstrap, labels_bootstrap = X[bootstrap_indices], labels[bootstrap_indices]
        predicted_labels = clusterer.fit_predict(X_bootstrap)
    
        bootstrap_results["homogeneity"].append(homogeneity_score(labels_bootstrap, predicted_labels))
        bootstrap_results["completness"].append(completeness_score(labels_bootstrap, predicted_labels))
        bootstrap_results["v_measure"].append(v_measure_score(labels_bootstrap, predicted_labels))
        bootstrap_results["ARI"].append(adjusted_rand_score(labels_bootstrap, predicted_labels))
        bootstrap_results["silhouhette"].append(silhouette_score(X_bootstrap, predicted_labels, sample_size=20))
    
    return pd.DataFrame(bootstrap_results, index=[f"Bootstrap #{i}" for i in range(n_bootstrap)])

The idea of using the following function will be to get an estimate of the variation of the fitting using bootstrap. We had a discussion with @ogrisel IRL to know what would be the best strategy here and it seems the best.

We can then call this function and show the mean and std results:

bootstrap_results = fit_evaluate_clusterer_booststrap(kmeans, X, y, 10, np.random.RandomState(0))
bootstrap_results.aggregate(["mean", "std"])

Since we will have 3 different models, we can create the dataframe a bit differently but showing the method as index and the metric/mean/std with a multi-index as columns could be a way to summarize the information nicely.

Copy link
Contributor Author
@gpapadok gpapadok Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review.

But I encountered an issue. Silhouette coeff takes as argument the data in feature space. X_bootstrap in the function is a list of the documents. Which means we need the output of the vectorizer but clusterer is a Pipeline and doesn't output intermediary layers (unless there is a way I'm not aware of). Same problem with the last code cell where the plotting happens and all the steps are needed.

So we either remove silhouette coef, or create pipelines only for the feature extraction and instantiate KMeans separately. Any suggestions?

@lesteve lesteve mentioned this pull request Feb 17, 2022
47 tasks
@gpapadok gpapadok changed the title DOC Rewrite plot_document_clustering.py as notebook [WIP] DOC Rewrite plot_document_clustering.py as notebook Feb 17, 2022
@gpapadok
Copy link
Contributor Author

Figured out a way. If it looks good I can add the narrative.
For the sampling it made more sense to use sklearn.utils.resample. Faster than choice too.

Note: RandomState is legacy according to the numpy docs. I tried to use the recommended Generator but resample won't accept it.

@gpapadok gpapadok requested a review from glemaitre February 18, 2022 20:24
@gpapadok gpapadok changed the title [WIP] DOC Rewrite plot_document_clustering.py as notebook DOC Rewrite plot_document_clustering.py as notebook Feb 18, 2022
@glemaitre
Copy link
Member

Note: RandomState is legacy according to the numpy docs. I tried to use the recommended Generator but resample won't accept it.

We are still using the RandomState for backward compatibility. At some point, we might migrate the random generator to the new API.

@glemaitre
Copy link
Member

@gpapadok what is the reason for closing the PR?

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0