8000 [MRG] skip dataset downloading doctest by aozgaa · Pull Request #11284 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] skip dataset downloading doctest #11284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion doc/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
from sklearn.utils.testing import SkipTest
from sklearn.utils.testing import check_skip_network
from sklearn.datasets import get_data_home
from sklearn.datasets.base import _pkl_filepath
from sklearn.datasets.twenty_newsgroups import CACHE_NAME
from sklearn.utils.testing import install_mldata_mock
from sklearn.utils.testing import uninstall_mldata_mock

Expand Down Expand Up @@ -47,12 +49,16 @@ def setup_rcv1():

def setup_twenty_newsgroups():
data_home = get_data_home()
if not exists(join(data_home, '20news_home')):
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
if not exists(cache_path):
raise SkipTest("Skipping dataset loading doctests")


def setup_working_with_text_data():
check_skip_network()
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
if not exists(cache_path):
raise SkipTest("Skipping dataset loading doctests")


def setup_compose():
Expand Down
24 changes: 14 additions & 10 deletions doc/datasets/twenty_newsgroups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ attribute is the integer index of the category::
>>> newsgroups_train.target.shape
(11314,)
>>> newsgroups_train.target[:10]
array([12, 6, 9, 8, 6, 7, 9, 2, 13, 19])
array([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4])

It is possible to load only a sub-selection of the categories by passing the
list of the categories to load to the
Expand All @@ -78,7 +78,7 @@ list of the categories to load to the
>>> newsgroups_train.target.shape
(1073,)
>>> newsgroups_train.target[:10]
array([1, 1, 1, 0, 1, 0, 0, 1, 1, 1])
array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])

Converting text to vectors
--------------------------
Expand All @@ -105,7 +105,7 @@ components by sample in a more than 30000-dimensional space
(less than .5% non-zero features)::

>>> vectors.nnz / float(vectors.shape[0])
159.01327433628319
159.01327...

:func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which returns
ready-to-use tfidf features instead of file names.
Expand All @@ -131,9 +131,11 @@ which is fast to train and achieves a decent F-score::
>>> vectors_test = vectorizer.transform(newsgroups_test.data)
>>> clf = MultinomialNB(alpha=.01)
>>> clf.fit(vectors, newsgroups_train.target)
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

>>> pred = clf.predict(vectors_test)
>>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
0.88213592402729568
0.88213...

(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
the training and test data, instead of segmenting by time, and in that case
Expand All @@ -150,10 +152,10 @@ Let's take a look at what the most informative features are:
... print("%s: %s" % (category, " ".join(feature_names[top10])))
...
>>> show_top10(clf, vectorizer, newsgroups_train.target_names)
alt.atheism: sgi livesey atheists writes people caltech com god keith edu
comp.graphics: organization thanks files subject com image lines university edu graphics
sci.space: toronto moon gov com alaska access henry nasa edu space
talk.religion.misc: article writes kent people christian jesus sandvik edu com god
alt.atheism: edu it and in you that is of to the
comp.graphics: edu in graphics it is for and of to the
sci.space: edu it that is in and space to of the
talk.religion.misc: not it you in is that and to of the

You can now see many things that these features have overfit to:

Expand Down Expand Up @@ -183,7 +185,7 @@ blocks, and quotation blocks respectively.
>>> vectors_test = vectorizer.transform(newsgroups_test.data)
>>> pred = clf.predict(vectors_test)
>>> metrics.f1_score(pred, newsgroups_test.target, average='macro')
0.77310350681274775
0.77310...

This classifier lost over a lot of its F-score, just because we removed
metadata that has little to do with topic classification.
Expand All @@ -195,10 +197,12 @@ It loses even more if we also strip this metadata from the training data:
>>> vectors = vectorizer.fit_transform(newsgroups_train.data)
>>> clf = MultinomialNB(alpha=.01)
>>> clf.fit(vectors, newsgroups_train.target)
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

>>> vectors_test = vectorizer.transform(newsgroups_test.data)
>>> pred = clf.predict(vectors_test)
>>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
0.76995175184521725
0.76995...

Some other classifiers cope better with this harder version of the task. Try
running :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py` with and without
Expand Down
0