8000 ENH Adds FeatureHasher support to pypy by thomasjpfan · Pull Request #23023 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Adds FeatureHasher support to pypy #23023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 8, 2022

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #11540

What does this implement/fix? Explain your changes.

This PR enables FeatureHasher to work on pypy by using c++ vectors in the transform call. Running this benchmark:

Benchmark
import numpy as np
import re
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
from time import perf_counter
from statistics import mean, stdev
from sklearn.feature_extraction._hashing_fast import transform as _hashing_transform


def tokens(doc):
    return (tok.lower() for tok in re.findall(r"\w+", doc))


def token_freqs(doc):
    freq = defaultdict(int)
    for tok in tokens(doc):
        freq[tok] += 1
    return freq


n_features = 2**20
n_repeats = 100

raw_data, _ = fetch_20newsgroups(subset="all", return_X_y=True)
raw_X = [token_freqs(d).items() for d in raw_data]

durations = []
for i in range(n_repeats):
    t0 = perf_counter()
    _hashing_transform(
        raw_X, n_features, np.float64, alternate_sign=True, seed=n_repeats
    )
    duration = perf_counter() - t0
    durations.append(duration)

mean_dur = mean(durations)
std_dur = stdev(durations)
print(f"{mean_dur:.4f} +/- {std_dur:.4f}")

Here are my results:

0.2824 +/- 0.0025 : PR
0.2892 +/- 0.0020 : main

Any other comments?

Int32 and Int64 were added to the vector sentinel to support the hasher.

Copy link
Member
@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. C++ rocks.

@@ -64,6 +68,42 @@ cdef class StdVectorSentinelIntP(StdVectorSentinel):
return ITYPECODE


cdef class StdVectorSentinelInt32(StdVectorSentinel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look like tempita material 😄

@jeremiedbb
Copy link
Member

THe pypy run did not get it through :(

@thomasjpfan
Copy link
Member Author

Currently, PyPy already timeouts on main. I increased the timeout duration for PyPy in this PR so the test has a chance to complete.

Copy link
Member
@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Member

A green pypy CI would be nice. After that we should revert c008f80 (don't we?) and then merge.

@jeremiedbb
Copy link
Member

After that we should revert c008f80 (don't we?) and then merge.

We already merged this change in main (#23049). We need to keep it to not have this job time out when we trigger it

@lorentzenchr lorentzenchr merged commit b2632de into scikit-learn:main Apr 8, 2022
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FeatureHasher support in PyPy
3 participants
0