Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

eangius · 2022-03-31T13:15:16Z

Describe the workflow you want to enable

The CountVectorizer component has the responsibility of not just vectorizing term frequencies but also internally tokenizing & normalizing the input text into terms. While this functionality is convenient for NLP pipelines, it cannot be leveraged for similar type of non-NLP problems where one needs to perform tf-idf (or similar) on "document" bags of general items other than text.

Describe your proposed solution

Decouple the CountVectorizer into the following 2 specific purpose & reusable components: TextTokenizer & ItemCountVectorizer such that when both are run sequentially in a pipeline they produce the same behavior.

Describe alternatives you've considered, if relevant

In some but not all cases one could force this (vectorizing item frequency) behavior by carefully serializing the bag of items into a faked string document & configuring the CountVectorizer tokenizer parameters to split on specific delimiters. But this brittle serialization & string de-serialization becomes inefficient & a bit hacky.

Additional context

No response

The text was updated successfully, but these errors were encountered:

eangius · 2022-03-31T13:30:24Z

If this functionality it deemed desired, attached below is a working draft of the proposed ItemCountVectorizer with limited functionality derived from the CountVectorizer that served our purposes: ItemCountVectorizer.py.zip

The TextTokenizer component would need refactoring out from CountVectorizer. Also for backwards compatibility with user pipelines depending on CountVectorizer, this component could extend from Pipeline implementing the combination of these two new components.

Of course, all of this would need to be integrated, tested & documented to scikit-learn standards.

thomasjpfan · 2022-04-14T15:24:59Z

There has been discussion to have text preprocessing be more decoupled: #14951

From a pure unix point of view, I see the value of decoupling CountVectorizer into two objects. If I were to design this from the start, I would want the tokenizer to be it's own object. On the other hand, I am very concerned with the maintenance cost of adding two new public estimators. We are already limited in supporting our existing estimators.

Overall I am +0 with introducing TextTokenizer and ItemCountVectorizer.

eangius added Needs Triage Issue requires triage New Feature labels Mar 31, 2022

thomasjpfan added module:preprocessing Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

Decouple CountVectorizer => TextTokenizer + ItemCountVectorizer #23004

Comments

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context