8000 DOC Added dropdowns to 6.2 feature-extraction (#26807) · REDVM/scikit-learn@5ba4c82 · GitHub
[go: up one dir, main page]

Skip to content

Commit 5ba4c82

Browse files
Kishan-VedREDVM
authored andcommitted
DOC Added dropdowns to 6.2 feature-extraction (scikit-learn#26807)
1 parent c96d06b commit 5ba4c82

File tree

1 file changed

+35
-9
lines changed

1 file changed

+35
-9
lines changed

doc/modules/feature_extraction.rst

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -206,8 +206,9 @@ Note the use of a generator comprehension,
206206
which introduces laziness into the feature extraction:
207207
tokens are only processed on demand from the hasher.
208208

209-
Implementation details
210-
----------------------
209+
|details-start|
210+
**Implementation details**
211+
|details-split|
211212

212213
:class:`FeatureHasher` uses the signed 32-bit variant of MurmurHash3.
213214
As a result (and because of limitations in ``scipy.sparse``),
@@ -223,16 +224,18 @@ Since a simple modulo is used to transform the hash function to a column index,
223224
it is advisable to use a power of two as the ``n_features`` parameter;
224225
otherwise the features will not be mapped evenly to the columns.
225226

227+
.. topic:: References:
228+
229+
* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
230+
231+
|details-end|
226232

227233
.. topic:: References:
228234

229235
* Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and
230236
Josh Attenberg (2009). `Feature hashing for large scale multitask learning
231237
<https://alex.smola.org/papers/2009/Weinbergeretal09.pdf>`_. Proc. ICML.
232238

233-
* `MurmurHash3 <https://github.com/aappleby/smhasher>`_.
234-
235-
236239
.. _text_feature_extraction:
237240

238241
Text feature extraction
@@ -395,8 +398,9 @@ last document::
395398

396399
.. _stop_words:
397400

398-
Using stop words
399-
................
401+
|details-start|
402+
**Using stop words**
403+
|details-split|
400404

401405
Stop words are words like "and", "the", "him", which are presumed to be
402406
uninformative in representing the content of a text, and which may be
@@ -426,6 +430,9 @@ identify and warn about some kinds of inconsistencies.
426430
<https://aclweb.org/anthology/W18-2502>`__.
427431
In *Proc. Workshop for NLP Open Source Software*.
428432
433+
434+
|details-end|
435+
429436
.. _tfidf:
430437

431438
Tf–idf term weighting
@@ -490,6 +497,10 @@ class::
490497
Again please see the :ref:`reference documentation
491498
<text_feature_extraction_ref>` for the details on all the parameters.
492499

500+
|details-start|
501+
**Numeric example of a tf-idf matrix**
502+
|details-split|
503+
493504
Let's take an example with the following counts. The first term is present
494505
100% of the time hence not very interesting. The two other features only
495506
in less than 50% of the time hence probably more representative of the
@@ -609,6 +620,7 @@ feature extractor with a classifier:
609620

610621
* :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
611622

623+
|details-end|
612624

613625
Decoding text files
614626
-------------------
@@ -637,6 +649,10 @@ or ``"replace"``. See the documentation for the Python function
637649
``bytes.decode`` for more details
638650
(type ``help(bytes.decode)`` at the Python prompt).
639651

652+
|details-start|
653+
**Troubleshooting decoding text**
654+
|details-split|
655+
640656
If you are having trouble decoding text, here are some things to try:
641657

642658
- Find out what the actual encoding of the text is. The file might come
@@ -690,6 +706,7 @@ About Unicode <https://www.joelonsoftware.com/articles/Unicode.html>`_.
690706

691707
.. _`ftfy`: https://github.com/LuminosoInsight/python-ftfy
692708

709+
|details-end|
693710

694711
Applications and examples
695712
-------------------------
@@ -870,8 +887,9 @@ The :class:`HashingVectorizer` also comes with the following limitations:
870887
model. A :class:`TfidfTransformer` can be appended to it in a pipeline if
871888
required.
872889

873-
Performing out-of-core scaling with HashingVectorizer
874-
------------------------------------------------------
890+
|details-start|
891+
**Performing out-of-core scaling with HashingVectorizer**
892+
|details-split|
875893

876894
An interesting development of using a :class:`HashingVectorizer` is the ability
877895
to perform `out-of-core`_ scaling. This means that we can learn from data that
@@ -890,6 +908,8 @@ time is often limited by the CPU time one wants to spend on the task.
890908
For a full-fledged example of out-of-core scaling in a text classification
891909
task see :ref:`sphx_glr_auto_examples_applications_plot_out_of_core_classification.py`.
892910

911+
|details-end|
912+
893913
Customizing the vectorizer classes
894914
----------------------------------
895915

@@ -928,6 +948,10 @@ parameters it is possible to derive from the class and override the
928948
``build_preprocessor``, ``build_tokenizer`` and ``build_analyzer``
929949
factory methods instead of passing custom functions.
930950

951+
|details-start|
952+
**Tips and tricks**
953+
|details-split|
954+
931955
Some tips and tricks:
932956

933957
* If documents are pre-tokenized by an external package, then store them in
@@ -982,6 +1006,8 @@ Some tips and tricks:
9821006
Customizing the vectorizer can also be useful when handling Asian languages
9831007
that do not use an explicit word separator such as whitespace.
9841008

1009+
|details-end|
1010+
9851011
.. _image_feature_extraction:
9861012

9871013
Image feature extraction

0 commit comments

Comments
 (0)
0