-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Rework plot_hashing_vs_dict_vectorizer.py example #23266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, here is a batch of feedback.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @ArturoAmorQ, this notebook is much nicer than the original benchmark script.
Here is a final batch of suggestions for improvement:
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…nto compare_vectorizers
…earn into compare_vectorizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ArturoAmorQ.
I think one should use other terms to make this example more accurate.
This is for instance the case of:
- "frequency" which can be replace by "occurence (counts)" (to respect the the definition)
- "speed" which can be replaced by "data processing rate" (to respect the unit (bytes/sec))
Here are some comments and formatting fixes.
Edit: not related to this PR, but #23004 might come with new changes for this example then.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ArturoAmorQ.
Edit: I let @ogrisel merge if everything LGTH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again, just a final batch of nitpicks + a formatting fix.
Merged, thank you very much for the nice contribution @ArturoAmorQ! |
…3266) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…3266) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Related to #22928
What does this implement/fix? Explain your changes.
In #22928 we remove the use of
HashingVectorizer
from the plot_document_classification_20newsgroups.py example for the sake of simplicity.A comparison of the performance of hashers and vectorizers can be moved to this existing example.
Any other comments?
Side effect: Implements notebook style as intended in #22406