DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266

ArturoAmorQ · 2022-05-03T09:38:09Z

Reference Issues/PRs

Related to #22928

What does this implement/fix? Explain your changes.

In #22928 we remove the use of HashingVectorizer from the plot_document_classification_20newsgroups.py example for the sake of simplicity.
A comparison of the performance of hashers and vectorizers can be moved to this existing example.

Any other comments?

Side effect: Implements notebook style as intended in #22406

ogrisel

Thanks for the PR, here is a batch of feedback.

examples/text/plot_hashing_vs_dict_vectorizer.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…nto compare_vectorizers

ogrisel

Thanks very much @ArturoAmorQ, this notebook is much nicer than the original benchmark script.

Here is a final batch of suggestions for improvement:

examples/text/plot_hashing_vs_dict_vectorizer.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

examples/text/plot_hashing_vs_dict_vectorizer.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…nto compare_vectorizers

…earn into compare_vectorizers

jjerphan

Thank you, @ArturoAmorQ.

I think one should use other terms to make this example more accurate.

This is for instance the case of:

"frequency" which can be replace by "occurence (counts)" (to respect the the definition)
"speed" which can be replaced by "data processing rate" (to respect the unit (bytes/sec))

Here are some comments and formatting fixes.

Edit: not related to this PR, but #23004 might come with new changes for this example then.

examples/text/plot_hashing_vs_dict_vectorizer.py

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ArturoAmorQ · 2022-05-30T09:33:31Z

Thanks @ogrisel and @jjerphan. This notebook is much more clearer thanks to your comments.

jjerphan

Thank you, @ArturoAmorQ.

Edit: I let @ogrisel merge if everything LGTH.

examples/text/plot_hashing_vs_dict_vectorizer.py

ogrisel

LGTM again, just a final batch of nitpicks + a formatting fix.

examples/text/plot_hashing_vs_dict_vectorizer.py

ogrisel · 2022-05-30T16:46:23Z

Merged, thank you very much for the nice contribution @ArturoAmorQ!

…3266) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ArturoAmorQ added 3 commits April 27, 2022 14:53

Change format to notebook style

44cfe22

Add vectorizers to benchmark

d4056c8

Divide by sections

ad3b964

github-actions bot added the Documentation label May 3, 2022

lesteve added the Quick Review For PRs that are quick to review label May 11, 2022

ArturoAmorQ added 2 commits May 13, 2022 15:06

Link to documentation

64d9e8c

Improve notebook style

b6e8c3c

lesteve removed the Quick Review For PRs that are quick to review label May 13, 2022

ArturoAmorQ added 3 commits May 13, 2022 17:47

Add text and plots

5346990

Use f format for prints

c62405c

Improve text and general organization

d948423

ogrisel reviewed May 19, 2022

View reviewed changes

ArturoAmorQ and others added 3 commits May 19, 2022 16:53

Apply suggestions from code review

d24180b

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Fix format

b3078c5

Apply suggestions from code review

68f1ddb

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ArturoAmorQ changed the title ~~[WIP] DOC Rework plot_hashing_vs_dict_vectorizer.py example~~ DOC Rework plot_hashing_vs_dict_vectorizer.py example May 20, 2022

ArturoAmorQ added 4 commits May 20, 2022 12:16

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

dbffb99

…nto compare_vectorizers

Edit abstract

5575aa8

Iter

10e89ab

Add hyperlinks to functions

93355de

ogrisel approved these changes May 23, 2022

View reviewed changes

ArturoAmorQ and others added 3 commits May 23, 2022 17:58

Apply suggestions from code review

360140a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

iter

0d318dd

iter

883e234

ogrisel reviewed May 25, 2022

View reviewed changes

examples/text/plot_hashing_vs_dict_vectorizer.py Outdated Show resolved Hide resolved

ArturoAmorQ and others added 3 commits May 25, 2022 10:48

Update examples/text/plot_hashing_vs_dict_vectorizer.py

aff5eba

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

5633f37

…nto compare_vectorizers

Merge branch 'compare_vectorizers' of github.com:ArturoAmorQ/scikit-l…

cf1399e

…earn into compare_vectorizers

jjerphan reviewed May 30, 2022

View reviewed changes

Apply suggestions from code review

71fa2de

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ArturoAmorQ added 3 commits May 30, 2022 10:03

Format

d65880e

Preffer term occurrences over frequencies

c958be9

Add authors

4af4e43

jjerphan approved these changes May 30, 2022

View reviewed changes

ArturoAmorQ commented May 30, 2022

View reviewed changes

examples/text/plot_hashing_vs_dict_vectorizer.py Outdated Show resolved Hide resolved

tweak

9268ca4

ogrisel approved these changes May 30, 2022

View reviewed changes

Apply suggestions from code review

e54a6cc

ogrisel reviewed May 30, 2022

View reviewed changes

examples/text/plot_hashing_vs_dict_vectorizer.py Outdated Show resolved Hide resolved

Update examples/text/plot_hashing_vs_dict_vectorizer.py

e228f2f

ogrisel reviewed May 30, 2022

View reviewed changes

examples/text/plot_hashing_vs_dict_vectorizer.py Show resolved Hide resolved

Update examples/text/plot_hashing_vs_dict_vectorizer.py

39ce87f

ogrisel merged commit 6ff214c into scikit-learn:main May 30, 2022

ArturoAmorQ mentioned this pull request Jun 2, 2022

DOC Rework plot_document_clustering.py example #23528

Merged

ArturoAmorQ deleted the compare_vectorizers branch June 9, 2022 13:29

glemaitre pushed a commit that referenced this pull request Aug 5, 2022

DOC Rework plot_hashing_vs_dict_vectorizer.py example (#23266)

3845e76

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266

DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266

DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266

DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment