-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -45,7 +43,7 @@ def transform(raw_X, Py_ssize_t n_features, dtype): | |||
|
|||
for x in raw_X: | |||
for f, v in x: | |||
if isinstance(v, string_types): | |||
if isinstance(v, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires Cython 0.20+, is it OK? If not, it is possible to replace it with (str, unicode) which does the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'd prefer using (str, unicode). We don't have published "minimum cython requirements" but I suppose there's no harm in this case in maintaining backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM |
thanks |
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
…arn#7470) * micro-optimize HashingVectorizer and FeatureHasher * fix backwards compatibility for Cython < 0.20
There is a common gotcha in Cython code: even after a type check Cython compiles
txt.encode('utf-8')
to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.On the other hand,
compiles directly to a PyUnicode_AsUTF8String call.
Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.
The third thing I fixed is
string_types
lookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.These changes allow to make HashingVectorizer about 10% faster. Benchmark script:
For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.
This change is