10000 Merge pull request #909 from larsmans/hashing-trick · ludgate/scikit-learn@c89d208 · GitHub
[go: up one dir, main page]

Skip to content

Commit c89d208

Browse files
committed
Merge pull request scikit-learn#909 from larsmans/hashing-trick
MRG feature hashing transformer
2 parents c370f53 + 3de0f43 commit c89d208

File tree

13 files changed

+7012
-12
lines changed

13 files changed

+7012
-12
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
/sklearn/cluster/_k_means.c -diff
55
/sklearn/datasets/_svmlight_format.c -diff
66
/sklearn/ensemble/_gradient_boosting.c -diff
7+
/sklearn/feature_extraction/_hashing.c -diff
78
/sklearn/linear_model/cd_fast.c -diff
89
/sklearn/linear_model/sgd_fast.c -diff
910
/sklearn/neighbors/ball_tree.c -diff

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,7 @@ Samples generator
281281
:template: class.rst
282282

283283
feature_extraction.DictVectorizer
284+
feature_extraction.FeatureHasher
284285

285286
From images
286287
-----------

doc/modules/feature_extraction.rst

Lines changed: 86 additions & 0 deletions
< D7AE /tr>
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,92 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
102102
default instead of a ``numpy.ndarray``.
103103

104104

105+
.. _feature_hashing:
106+
107+
Feature hashing
108+
===============
109+
110+
.. currentmodule:: sklearn.feature_extraction
111+
112+
The class :class:`FeatureHasher` is a high-speed, low-memory vectorizer that
113+
uses a technique known as
114+
`feature hashing <https://en.wikipedia.org/wiki/Feature_hashing>`_,
115+
or the "hashing trick".
116+
Instead of building a hash table of the features encountered in training,
117+
as the vectorizers do, instances of :class:`FeatureHasher`
118+
apply a hash function to the features
119+
to determine their column index in sample matrices directly.
120+
The result is increased speed and reduced memory usage,
121+
at the expense of inspectability;
122+
the hasher does not remember what the input features looked like
123+
and has no ``inverse_transform`` method.
124+
125+
Since the hash function might cause collisions between (unrelated) features,
126+
a signed hash function is used and the sign of the hash value
127+
determines the sign of the value stored in the output matrix for a feature;
128+
this way, collisions are likely to cancel out rather than accumulate error,
129+
and the expected mean of any output feature's value is zero
130+
131+
If ``non_negative=True`` is passed to the constructor,
132+
the absolute value is taken.
133+
This undoes some of the collision handling,
134+
but allows the output to be passed to estimators like :class:`MultinomialNB`
135+
or ``chi2`` feature selectors that expect non-negative inputs.
136+
137+
:class:`FeatureHasher` accepts either mappings
138+
(like Python's ``dict`` and its variants in the ``collections`` module),
139+
``(feature, value)`` pairs, or strings,
140+
depending on the constructor parameter ``input_type``.
141+
Mapping are treated as lists of ``(feature, value)`` pairs,
142+
while single strings have an implicit value of 1.
143+
If a feature occurs multiple times in a sample, the values will be summed.
144+
Feature hashing can be employed in document classification,
145+
but unlike :class:`text.CountVectorizer`,
146+
:class:`FeatureHasher` does not do word
147+
splitting or any other preprocessing except Unicode-to-UTF-8 encoding.
148+
The output from :class:`FeatureHasher` is always a ``scipy.sparse`` matrix
149+
in the CSR format.
150+
151+
As an example, consider a word-level natural language processing task
152+
that needs features extracted from ``(token, part_of_speech)`` pairs.
153+
One could use a Python 341A generator function to extract features::
154+
155+
def token_features(token, part_of_speech):
156+
if token.isdigit():
157+
yield "numeric"
158+
else:
159+
yield "token={}".format(token.lower())
160+
yield "token,pos={},{}".format(token, part_of_speech)
161+
if token[0].isupper():
162+
yield "uppercase_initial"
163+
if token.isupper():
164+
yield "all_uppercase"
165+
yield "pos={}".format(part_of_speech)
166+
167+
Then, the ``raw_X`` to be fed to ``FeatureHasher.transform``
168+
can be constructed using::
169+
170+
raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
171+
172+
and fed to a hasher with::
173+
174+
hasher = FeatureHasher(input_type=string)
175+
X = hasher.transform(raw_X)
176+
177+
to get a ``scipy.sparse`` matrix ``X``.
178+
179+
Note the use of a generator comprehension,
180+
which introduces laziness into the feature extraction:
181+
tokens are only processed on demand from the hasher.
182+
183+
184+
.. topic:: References:
185+
186+
* Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
187+
Josh Attenberg (2009). `Feature hashing for large scale multitask learning
188+
<http://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.
189+
190+
105191
.. _text_feature_extraction:
106192

107193
Text feature extraction

doc/whats_new.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,12 @@ Changelog
5858
- New estimator :class:`preprocessing.OneHotEncoder` to compute
5959
binary encodings of categorical features by `Andreas Müller`_.
6060

61-
- Faster implementation of :func:`metrics.precision_recall_curve` by Conrad Lee.
61+
- Faster implementation of :func:`metrics.precision_recall_curve` by
62+
Conrad Lee.
63+
64+
- New :class:`feature_extraction.FeatureHasher`, implementing the
65+
"hashing trick" for fast, low-memory feature extraction from string data
66+
by `Lars Buitinck`_.
6267

6368

6469
API changes summary
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
"""Compares FeatureHasher and DictVectorizer by using both to vectorize
2+
text documents.
3+
4+
The example demonstrates syntax and speed only; it doesn't actually do
5+
anything useful with the extracted vectors. See the example scripts
6+
{document_classification_20newsgroups,clustering}.py for actual learning
7+
on text documents.
8+
9+
A discrepancy between the number of terms reported for DictVectorizer and
10+
for FeatureHasher is to be expected due to hash collisions.
11+
"""
12+
13+
# Author: Lars Buitinck <L.J.Buitinck@uva.nl>
14+
# License: 3-clause BSD
15+
16+
from __future__ import print_function
17+
from collections import defaultdict
18+
import re
19+
import sys
20+
from time import time
21+
22+
import numpy as np
23+
24+
from sklearn.datasets import fetch_20newsgroups
25+
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
26+
27+
28+
def n_nonzero_columns(X):
29+
"""Returns the number of non-zero columns in a CSR matrix X."""
30+
return len(np.unique(X.nonzero()[1]))
31+
32+
33+
def tokens(doc):
34+
"""Extract tokens from doc.
35+
36+
This uses a simple regex to break strings into tokens. For a more
37+
principled approach, see CountVectorizer or TfidfVectorizer.
38+
"""
39+
return (tok.lower() for tok in re.findall(r"\w+", doc))
40+
41+
42+
def token_freqs(doc):
43+
"""Extract a dict mapping tokens from doc to their frequencies."""
44+
freq = defaultdict(int)
45+
for tok in tokens(doc):
46+
freq[tok] += 1
47+
return freq
48+
49+
50+
categories = [
51+
'alt.atheism',
52+
'comp.graphics',
53+
'comp.sys.ibm.pc.hardware',
54+
'misc.forsale',
55+
'rec.autos',
56+
'sci.space',
57+
'talk.religion.misc',
58+
]
59+
# Uncomment the following line to use a larger set (11k+ documents)
60+
#categories=None
61+
62+
print(__doc__)
63+
print("Usage: %s [n_features_for_hashing]" % sys.argv[0])
64+
print(" The default number of features is 2**18.")
65+
print()
66+
67+
try:
68+
n_features = int(sys.argv[1])
69+
except IndexError:
70+
n_features = 2 ** 18
71+
except ValueError:
72+
print("not a valid number of features: %r" % sys.argv[1])
73+
sys.exit(1)
74+
75+
print("Loading 20 newsgroups training data")
76+
raw_data = fetch_20newsgroups(subset='train', categories=categories).data
77+
print("%d documents" % len(raw_data))
78+
print()
79+
80+
print("DictVectorizer")
81+
t0 = time()
82+
vectorizer = DictVectorizer()
83+
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
84+
print("done in %fs" % (time() - t0))
85+
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
86+
print()
87+
88+
print("FeatureHasher on frequency dicts")
89+
t0 = time()
90+
hasher = FeatureHasher(n_features=n_features)
91+
X = hasher.transform(token_freqs(d) for d in raw_data)
92+
print("done in %fs" % (time() - t0))
93+
print("Found %d unique terms" % n_nonzero_columns(X))
94+
print()
95+
96+
print("FeatureHasher on raw tokens")
97+
t0 = time()
98+
hasher = FeatureHasher(n_features=n_features, input_type="string")
99+
X = hasher.transform(tokens(d) for d in raw_data)
100+
print("done in %fs" % (time() - t0))
101+
print("Found %d unique terms" % n_nonzero_columns(X))

sklearn/feature_extraction/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
"""
66

77
from .dict_vectorizer import DictVectorizer
8+
from .hashing import FeatureHasher
89
from .image import img_to_graph, grid_to_graph
910
from . import text
1011

0 commit comments

Comments
 (0)
0