8000 TfidfVectorizer documentation doesn't track TfidfTransformer · Issue #6766 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

TfidfVectorizer documentation doesn't track TfidfTransformer #6766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue May 5, 2016 · 16 comments · Fixed by #12822
Closed

TfidfVectorizer documentation doesn't track TfidfTransformer #6766

amueller opened this issue May 5, 2016 · 16 comments · Fixed by #12822
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve Sprint

Comments

@amueller
Copy link
Member
amueller commented May 5, 2016

There are important details of the tf-idf transformation in the TfidfTransformer that are not reflected in the TfidfVectorizer's docstring. I think they should be added there.
Also, I think the formula that is used should be added to the narrative.

@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation labels May 5, 2016
@amueller
Copy link
Member Author
amueller commented May 5, 2016

Actually, I'm not entirely happy with the documentation of TfidfTransformer.
Finding out what the actual formula is it uses by default is non-trivial.

@amueller
8000 Copy link
Member Author
amueller commented May 5, 2016

also see #4391, #2998

@amueller
Copy link
Member Author
amueller commented May 5, 2016

and neither documents the default value of "norm".

@amueller
Copy link
Member Author
amueller commented May 5, 2016

slightly unrelated: the _document_frequency function is badly misnamed. It computes document counts, not frequencies.

@krishnakalyan3
Copy link
Contributor
krishnakalyan3 commented May 28, 2016

@amueller I may I take up this task?. As far as I understand, the scope of this task is to update the TfidfVectorizer's docstring.
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)
a) Default Value of norm
b) Update TfidfVectorizer documentation

Please let me know incase I have missed any details.

@jnothman
Copy link
Member

@krishnakalyan3 go ahead, but @amueller suggested TfidfTransformer needs some more detail, and TfidfVectorizer should have much of that too (or at least refer clearly to where the user can read about the parameters). Familiarise yourself with the options and write up as clearly as you can...

@krishnakalyan3
Copy link
Contributor
krishnakalyan3 commented May 29, 2016

@amueller @jnothman please let me know if I have missed anything out.

@jnothman
Copy link
Member
jnothman commented Jun 3, 2016

@amueller

slightly unrelated: the _document_frequency function is badly misnamed. It computes document counts, not frequencies.

Apart from not being sure what distinction you make between 'count' and 'frequency', "document frequency" is jargon in Information Retrieval, meaning specifically, "the number of documents a particular term appears in".

@HimaVarsha94
Copy link

Shouldn't the default value of analyzer be given in the comments as well?

@amueller
Copy link
Member Author

@HimaVarsha94 yes but that seems to be a different issue.

@amueller
Copy link
Member Author
amueller commented Mar 3, 2017

Someone can take this up from #6839 from what it looks like.

@amueller amueller added the Sprint label Mar 3, 2017
@musiciancodes
Copy link

I'll take a look at this today. Plan is to start by making the TfIDFVectorizer docstring provide more detail and potentially wordsmith docstrings for TfIDfVectorizer and TfIdftransformer.

@amueller
Copy link
Member Author

thanks @y3l2n :)

@amueller
Copy link
Member Author

btw @y3l2n there was an earlier attempt at #6839 but I'm not sure what happened to it.

@krishnakalyan3
Copy link
Contributor

@amueller @y3l2n I was not exactly sure how to go about it, thats why I closed this PR. I am not working on this issue anymore. Please feel free to work on it.

@qinhanmin2014 qinhanmin2014 added the good first issue Easy with clear instructions to resolve label Jul 28, 2018
@blooraspberry
Copy link

I am working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve Sprint
Projects
None yet
9 participants
0