10000 Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer · Issue #23004 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eangius opened this issue Mar 31, 2022 · 2 comments
Open

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

eangius opened this issue Mar 31, 2022 · 2 comments
Labels

Comments

@eangius
Copy link
eangius commented Mar 31, 2022

Describe the workflow you want to enable

The CountVectorizer component has the responsibility of not just vectorizing term frequencies but also internally tokenizing & normalizing the input text into terms. While this functionality is convenient for NLP pipelines, it cannot be leveraged for similar type of non-NLP problems where one needs to perform tf-idf (or similar) on "document" bags of general items other than text.

Describe your proposed solution

Decouple the CountVectorizer into the following 2 specific purpose & reusable components: TextTokenizer & ItemCountVectorizer such that when both are run sequentially in a pipeline they produce the same behavior.

Describe alternatives you've considered, if relevant

In some but not all cases one could force this (vectorizing item frequency) behavior by carefully serializing the bag of items into a faked string document & configuring the CountVectorizer tokenizer parameters to split on specific delimiters. But this brittle serialization & string de-serialization becomes inefficient & a bit hacky.

Additional context

No response

@eangius eangius added Needs Triage Issue requires triage New Feature labels Mar 31, 2022
@eangius
Copy link
Author
eangius commented Mar 31, 2022

If this functionality it deemed desired, attached below is a working draft of the proposed ItemCountVectorizer with limited functionality derived from the CountVectorizer that served our purposes: ItemCountVectorizer.py.zip

The TextTokenizer component would need refactoring out from CountVectorizer. Also for backwards compatibility with user pipelines depending on CountVectorizer, this component could extend from Pipeline implementing the combination of these two new components.

Of course, all of this would need to be integrated, tested & documented to scikit-learn standards.

@thomasjpfan
Copy link
Member

There has been discussion to have text preprocessing be more decoupled: #14951

From a pure unix point of view, I see the value of decoupling CountVectorizer into two objects. If I were to design this from the start, I would want the tokenizer to be it's own object. On the other hand, I am very concerned with the maintenance cost of adding two new public estimators. We are already limited in supporting our existing estimators.

Overall I am +0 with introducing TextTokenizer and ItemCountVectorizer.

@thomasjpfan thomasjpfan added module:preprocessing Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0