[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

kmike · 2016-09-22T17:18:38Z

There is a common gotcha in Cython code: even after a type check Cython compiles txt.encode('utf-8') to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.

if isinstance(f, unicode):
    f = f.encode("utf-8")

On the other hand,

if isinstance(f, unicode):
    f = (<unicode>f).encode("utf-8")

compiles directly to a PyUnicode_AsUTF8String call.

Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.

The third thing I fixed is string_types lookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.

These changes allow to make HashingVectorizer about 10% faster. Benchmark script:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

categories = ['alt.atheism', 'comp.graphics']
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)
vec = HashingVectorizer()
%timeit vec.fit_transform(data_train.data[:100])

For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.

This change is

kmike · 2016-09-22T17:29:02Z

sklearn/feature_extraction/_hashing.pyx

    for x in raw_X:
        for f, v in x:
-            if isinstance(v, string_types):
+            if isinstance(v, basestring):


This requires Cython 0.20+, is it OK? If not, it is possible to replace it with (str, unicode) which does the same.

Hmm I'd prefer using (str, unicode). We don't have published "minimum cython requirements" but I suppose there's no harm in this case in maintaining backwards compatibility?

jnothman · 2016-09-26T23:46:12Z

LGTM

amueller · 2016-09-30T00:51:02Z

thanks

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

micro-optimize HashingVectorizer and FeatureHasher

4cb6fd4

kmike commented Sep 22, 2016

View reviewed changes

fix backwards compatibility for Cython < 0.20

9de996d

jnothman changed the title ~~micro-optimize HashingVectorizer and FeatureHasher~~ [MRG+1] micro-optimize HashingVectorizer and FeatureHasher Sep 26, 2016

amueller merged commit c336a43 into scikit-learn:master Sep 30, 2016

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

861d9ec

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

rth mentioned this pull request Jun 9, 2017

[MRG+1] Add text vectorizers benchmarks #9086

Merged

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

fa37ed4

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher (scikit-le…

1794164

…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants