@@ -206,8 +206,9 @@ Note the use of a generator comprehension,
206
206
which introduces laziness into the feature extraction:
207
207
tokens are only processed on demand from the hasher.
208
208
209
- Implementation details
210
- ----------------------
209
+ |details-start |
210
+ **Implementation details **
211
+ |details-split |
211
212
212
213
:class: `FeatureHasher ` uses the signed 32-bit variant of MurmurHash3.
213
214
As a result (and because of limitations in ``scipy.sparse ``),
@@ -223,16 +224,18 @@ Since a simple modulo is used to transform the hash function to a column index,
223
224
it is advisable to use a power of two as the ``n_features `` parameter;
224
225
otherwise the features will not be mapped evenly to the columns.
225
226
227
+ .. topic :: References:
228
+
229
+ * `MurmurHash3 <https://github.com/aappleby/smhasher >`_.
230
+
231
+ |details-end |
226
232
227
233
.. topic :: References:
228
234
229
235
* Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
230
236
Josh Attenberg (2009). `Feature hashing for large scale multitask learning
231
237
<https://alex.smola.org/papers/2009/Weinbergeretal09.pdf> `_. Proc. ICML.
232
238
233
- * `MurmurHash3 <https://github.com/aappleby/smhasher >`_.
234
-
235
-
236
239
.. _text_feature_extraction :
237
240
238
241
Text feature extraction
@@ -395,8 +398,9 @@ last document::
395
398
396
399
.. _stop_words :
397
400
398
- Using stop words
399
- ................
401
+ |details-start |
402
+ **Using stop words **
403
+ |details-split |
400
404
401
405
Stop words are words like "and", "the", "him", which are presumed to be
402
406
uninformative in representing the content of a text, and which may be
@@ -426,6 +430,9 @@ identify and warn about some kinds of inconsistencies.
426
430
<https://aclweb.org/anthology/W18-2502> `__.
427
431
In *Proc. Workshop for NLP Open Source Software *.
428
432
433
+
434
+ |details-end |
435
+
429
436
.. _tfidf :
430
437
431
438
Tf–idf term weighting
@@ -490,6 +497,10 @@ class::
490
497
Again please see the :ref: `reference documentation
491
498
<text_feature_extraction_ref>` for the details on all the parameters.
492
499
500
+ |details-start |
501
+ **Numeric example of a tf-idf matrix **
502
+ |details-split |
503
+
493
504
Let's take an example with the following counts. The first term is present
494
505
100% of the time hence not very interesting. The two other features only
495
506
in less than 50% of the time hence probably more representative of the
@@ -609,6 +620,7 @@ feature extractor with a classifier:
609
620
610
621
* :ref: `sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py `
611
622
623
+ |details-end |
612
624
613
625
Decoding text files
614
626
-------------------
@@ -637,6 +649,10 @@ or ``"replace"``. See the documentation for the Python function
637
649
``bytes.decode `` for more details
638
650
(type ``help(bytes.decode) `` at the Python prompt).
639
651
652
+ |details-start |
653
+ **Troubleshooting decoding text **
654
+ |details-split |
655
+
640
656
If you are having trouble decoding text, here are some things to try:
641
657
642
658
- Find out what the actual encoding of the text is. The file might come
@@ -690,6 +706,7 @@ About Unicode <https://www.joelonsoftware.com/articles/Unicode.html>`_.
690
706
691
707
.. _`ftfy` : https://github.com/LuminosoInsight/python-ftfy
692
708
709
+ |details-end |
693
710
694
711
Applications and examples
695
712
-------------------------
@@ -870,8 +887,9 @@ The :class:`HashingVectorizer` also comes with the following limitations:
870
887
model. A :class: `TfidfTransformer ` can be appended to it in a pipeline if
871
888
required.
872
889
873
- Performing out-of-core scaling with HashingVectorizer
874
- ------------------------------------------------------
890
+ |details-start |
891
+ **Performing out-of-core scaling with HashingVectorizer **
892
+ |details-split |
875
893
876
894
An interesting development of using a :class: `HashingVectorizer ` is the ability
877
895
to perform `out-of-core `_ scaling. This means that we can learn from data that
@@ -890,6 +908,8 @@ time is often limited by the CPU time one wants to spend on the task.
890
908
For a full-fledged example of out-of-core scaling in a text classification
891
909
task see :ref: `sphx_glr_auto_examples_applications_plot_out_of_core_classification.py `.
892
910
911
+ |details-end |
912
+
893
913
Customizing the vectorizer classes
894
914
----------------------------------
895
915
@@ -928,6 +948,10 @@ parameters it is possible to derive from the class and override the
928
948
``build_preprocessor ``, ``build_tokenizer `` and ``build_analyzer ``
929
949
factory methods instead of passing custom functions.
930
950
951
+ |details-start |
952
+ **Tips and tricks **
953
+ |details-split |
954
+
931
955
Some tips and tricks:
932
956
933
957
* If documents are pre-tokenized by an external package, then store them in
@@ -982,6 +1006,8 @@ Some tips and tricks:
982
1006
Customizing the vectorizer can also be useful when handling Asian languages
983
1007
that do not use an explicit word separator such as whitespace.
984
1008
1009
+ |details-end |
1010
+
985
1011
.. _image_feature_extraction :
986
1012
987
1013
Image feature extraction
0 commit comments