8000 [MRG+1] micro-optimize HashingVectorizer and FeatureHasher by kmike · Pull Request #7470 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 30, 2016

Conversation

kmike
Copy link
Contributor
@kmike kmike commented Sep 22, 2016

There is a common gotcha in Cython code: even after a type check Cython compiles txt.encode('utf-8') to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.

if isinstance(f, unicode):
    f = f.encode("utf-8")

On the other hand,

if isinstance(f, unicode):
    f = (<unicode>f).encode("utf-8")

compiles directly to a PyUnicode_AsUTF8String call.

Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.

The third thing I fixed is string_types lookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.

These changes allow to make HashingVectorizer about 10% faster. Benchmark script:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

categories = ['alt.atheism', 'comp.graphics']
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)
vec = HashingVectorizer()
%timeit vec.fit_transform(data_train.data[:100])

For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.


This change is Reviewable

@@ -45,7 +43,7 @@ def transform(raw_X, Py_ssize_t n_features, dtype):

for x in raw_X:
for f, v in x:
if isinstance(v, string_types):
if isinstance(v, basestring):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires Cython 0.20+, is it OK? If not, it is possible to replace it with (str, unicode) which does the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'd prefer using (str, unicode). We don't have published "minimum cython requirements" but I suppose there's no harm in this case in maintaining backwards compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jnothman
Copy link
Member

LGTM

@jnothman jnothman changed the title micro-optimize HashingVectorizer and FeatureHasher [MRG+1] micro-optimize HashingVectorizer and FeatureHasher Sep 26, 2016
@amueller
Copy link
Member

thanks

@amueller amueller merged commit c336a43 into scikit-learn:master Sep 30, 2016
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…arn#7470)

* micro-optimize HashingVectorizer and FeatureHasher

* fix backwards compatibility for Cython < 0.20
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…arn#7470)

* micro-optimize HashingVectorizer and FeatureHasher

* fix backwards compatibility for Cython < 0.20
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…arn#7470)

* micro-optimize HashingVectorizer and FeatureHasher

* fix backwards compatibility for Cython < 0.20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0