8000 [MRG] Openml data loader (#11419) · scikit-learn/scikit-learn@ab82f57 · GitHub
[go: up one dir, main page]

Skip to content

Commit ab82f57

Browse files
janvanrijnjnothman
authored andcommitted
[MRG] Openml data loader (#11419)
1 parent 98af001 commit ab82f57

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+2291
-2
lines changed

MANIFEST.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ include *.rst
22
recursive-include doc *
33
recursive-include examples *
44
recursive-include sklearn *.c *.h *.pyx *.pxd *.pxi
5-
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt
5+
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt *.arff.gz *.json.gz
66
include COPYING
77
include AUTHORS.rst
88
include README.rst

doc/datasets/openml.rst

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
..
2+
For doctests:
3+
4+
>>> import numpy as np
5+
>>> import os
6+
7+
8+
.. _openml:
9+
10+
Downloading datasets from the openml.org repository
11+
===================================================
12+
13+
`openml.org <https://openml.org>`_ is a public repository for machine learning
14+
data and experiments, that allows everybody to upload open datasets.
15+
16+
The ``sklearn.datasets`` package is able to download datasets
17+
from the repository using the function
18+
:func:`sklearn.datasets.fetch_openml`.
19+
20+
For example, to download a dataset of gene expressions in mice brains::
21+
22+
>>> from sklearn.datasets import fetch_openml
23+
>>> mice = fetch_openml(name='miceprotein', version=4)
24+
25+
To fully specify a dataset, you need to provide a name and a version, though
26+
the version is optional, see :ref:`openml_versions`_ below.
27+
The dataset contains a total of 1080 examples belonging to 8 different
28+
classes::
29+
30+
>>> mice.data.shape
31+
(1080, 77)
32+
>>> mice.target.shape
33+
(1080,)
34+
>>> np.unique(mice.target) # doctest: +NORMALIZE_WHITESPACE
35+
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)
36+
37+
You can get more information on the dataset by looking at the ``DESCR``
38+
and ``details`` attributes::
39+
40+
>>> print(mice.DESCR) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
41+
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
42+
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
43+
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
44+
Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
45+
Syndrome. PLoS ONE 10(6): e0129126...
46+
47+
>>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
48+
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
49+
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
50+
'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
51+
'file_id': '17928620', 'default_target_attribute': 'class',
52+
'row_id_attribute': 'MouseID',
53+
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
54+
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
55+
'visibility': 'public', 'status': 'active',
56+
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}
57+
58+
59+
The ``DESCR`` contains a free-text description of the data, while ``details``
60+
contains a dictionary of meta-data stored by openml, like the dataset id.
61+
For more details, see the `OpenML documentation
62+
<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset
63+
is 40966, and you can use this (or the name) to get more information on the
64+
dataset on the openml website::
65+
66+
>>> mice.url
67+
'https://www.openml.org/d/40966'
68+
69+
The ``data_id`` also uniquely identifies a dataset from OpenML::
70+
71+
>>> mice = fetch_openml(data_id=40966)
72+
>>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP
73+
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
74+
'creator': ...,
75+
'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':
76+
'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':
77+
'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,
78+
Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins
79+
Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):
80+
e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',
81+
'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':
82+
'3c479a6885bfa0438971388283a1ce32'}
83+
84+
.. _openml_versions:
85+
86+
Dataset Versions
87+
----------------
88+
89+
A dataset is uniquely specified by its ``data_id``, but not necessarily by its
90+
name. Several different "versions" of a dataset with the same name can exist
91+
which can contain entirely different datasets.
92+
If a particular version of a dataset has been found to contain significant
93+
issues, it might be deactivated. Using a name to specify a dataset will yield
94+
the earliest version of a dataset that is still active. That means that
95+
``fetch_openml(name="miceprotein")`` can yield different results at different
96+
times if earlier versions become inactive.
97+
You can see that the dataset with ``data_id`` 40966 that we fetched above is
98+
the version 1 of the "miceprotein" dataset::
99+
100+
>>> mice.details['version'] #doctest: +SKIP
101+
'1'
102+
103+
In fact, this dataset only has one version. The iris dataset on the other hand
104+
has multiple versions::
105+
106+
>>> iris = fetch_openml(name="iris")
107+
>>> iris.details['version'] #doctest: +SKIP
108+
'1'
109+
>>> iris.details['id'] #doctest: +SKIP
110+
'61'
111+
112+
>>> iris_61 = fetch_openml(data_id=61)
113+
>>> iris_61.details['version']
114+
'1'
115+
>>> iris_61.details['id']
116+
'61'
117+
118+
>>> iris_969 = fetch_openml(data_id=969)
119+
>>> iris_969.details['version']
120+
'3'
121+
>>> iris_969.details['id']
122+
'969'
123+
124+
Specifying the dataset by the name "iris" yields the lowest version, version 1,
125+
with the ``data_id`` 61. To make sure you always get this exact dataset, it is
126+
safest to specify it by the dataset ``data_id``. The other dataset, with
127+
``data_id`` 969, is version 3 (version 2 has become inactive), and contains a
128+
binarized version of the data::
129+
130+
>>> np.unique(iris_969.target)
131+
array(['N', 'P'], dtype=object)
132+
133+
You can also specify both the name and the version, which also uniquely
134+
identifies the dataset::
135+
136+
>>> iris_version_3 = fetch_openml(name="iris", version=3)
137+
>>> iris_version_3.details['version']
138+
'3'
139+
>>> iris_version_3.details['id']
140+
'969'
141+
142+
143+
.. topic:: References:
144+
145+
* Vanschoren, van Rijn, Bischl and Torgo
146+
`"OpenML: networked science in machine learning"
147+
<https://arxiv.org/pdf/1407.7722.pdf>`_,
148+
ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.

doc/developers/contributing.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ link to it from your website, or simply star to say "I use it":
7979
* `joblib <https://github.com/joblib/joblib/issues>`__
8080
* `sphinx-gallery <https://github.com/sphinx-gallery/sphinx-gallery/issues>`__
8181
* `numpydoc <https://github.com/numpy/numpydoc/issues>`__
82+
* `liac-arff <https://github.com/renatopp/liac-arff>`__
8283

8384
and larger projects:
8485

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,7 @@ Loaders
259259
datasets.fetch_lfw_people
260260
datasets.fetch_mldata
261261
datasets.fetch_olivetti_faces
262+
datasets.fetch_openml
262263
datasets.fetch_rcv1
263264
datasets.fetch_species_distributions
264265
datasets.get_data_home

doc/whats_new/v0.20.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,11 @@ Support for Python 3.3 has been officially dropped.
178178
:mod:`sklearn.datasets`
179179
.......................
180180

181+
- |MajorFeature| Added :func:`datasets.fetch_openml` to fetch datasets from
182+
`OpenML <http://openml.org>`. OpenML is a free, open data sharing platform
183+
and will be used instead of mldata as it provides better service availability.
184+
:issue:`9908` by `Andreas Müller`_ and :user:`Jan N. van Rijn <janvanrijn>`.
185+
181186
- |Feature| In :func:`datasets.make_blobs`, one can now pass a list to the
182187
`n_samples` parameter to indicate the number of samples to generate per
183188
cluster. :issue:`8617` by :user:`Maskani Filali Mohamed <maskani-moh>` and
@@ -204,7 +209,6 @@ Support for Python 3.3 has been officially dropped.
204209
data points could be generated. :issue:`10045` by :user:`Christian Braune
205210
<christianbraune79>`.
206211

207-
208212
:mod:`sklearn.decomposition`
209213
............................
210214

sklearn/datasets/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from .twenty_newsgroups import fetch_20newsgroups
2424
from .twenty_newsgroups import fetch_20newsgroups_vectorized
2525
from .mldata import fetch_mldata, mldata_filename
26+
from .openml import fetch_openml
2627
from .samples_generator import make_classification
2728
from .samples_generator import make_multilabel_classification
2829
from .samples_generator import make_hastie_10_2
@@ -65,6 +66,7 @@
6566
'fetch_covtype',
6667
'fetch_rcv1',
6768
'fetch_kddcup99',
69+
'fetch_openml',
6870
'get_data_home',
6971
'load_boston',
7072
'load_diabetes',

0 commit comments

Comments
 (0)
0