[MRG+1] micro-optimize HashingVectorizer and FeatureHasher #7470
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a common gotcha in Cython code: even after a type check Cython compiles
txt.encode('utf-8')
to a code which looks up 'encode' method, then calls it as a Python method, this method has to find utf-8 codec, etc.On the other hand,
compiles directly to a PyUnicode_AsUTF8String call.
Another gotcha is that if you declare argument type (e.g. bytes) and then pass a variable of type object, it does more work, not less work because type is checked. Because we know a variable is bytes casting it to bytes allow to remove all these type checks and conversions.
The third thing I fixed is
string_types
lookup. "basestring" in Cython is the same as six.string_types; isinstance check gets compiled directly to C API calls this way.These changes allow to make HashingVectorizer about 10% faster. Benchmark script:
For me it runs in 13.8 ms before the changes and 11.5 ms after the changes.
This change is