|
| 1 | +.. |
| 2 | + For doctests: |
| 3 | +
|
| 4 | + >>> import numpy as np |
| 5 | + >>> import os |
| 6 | + |
| 7 | + |
| 8 | +.. _openml: |
| 9 | + |
| 10 | +Downloading datasets from the openml.org repository |
| 11 | +=================================================== |
| 12 | + |
| 13 | +`openml.org <https://openml.org>`_ is a public repository for machine learning |
| 14 | +data and experiments, that allows everybody to upload open datasets. |
| 15 | + |
| 16 | +The ``sklearn.datasets`` package is able to download datasets |
| 17 | +from the repository using the function |
| 18 | +:func:`sklearn.datasets.fetch_openml`. |
| 19 | + |
| 20 | +For example, to download a dataset of gene expressions in mice brains:: |
| 21 | + |
| 22 | + >>> from sklearn.datasets import fetch_openml |
| 23 | + >>> mice = fetch_openml(name='miceprotein', version=4) |
| 24 | + |
| 25 | +To fully specify a dataset, you need to provide a name and a version, though |
| 26 | +the version is optional, see :ref:`openml_versions`_ below. |
| 27 | +The dataset contains a total of 1080 examples belonging to 8 different |
| 28 | +classes:: |
| 29 | + |
| 30 | + >>> mice.data.shape |
| 31 | + (1080, 77) |
| 32 | + >>> mice.target.shape |
| 33 | + (1080,) |
| 34 | + >>> np.unique(mice.target) # doctest: +NORMALIZE_WHITESPACE |
| 35 | + array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object) |
| 36 | + |
| 37 | +You can get more information on the dataset by looking at the ``DESCR`` |
| 38 | +and ``details`` attributes:: |
| 39 | + |
| 40 | + >>> print(mice.DESCR) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP |
| 41 | + **Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios |
| 42 | + **Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015 |
| 43 | + **Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing |
| 44 | + Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down |
| 45 | + Syndrome. PLoS ONE 10(6): e0129126... |
| 46 | + |
| 47 | + >>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP |
| 48 | + {'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF', |
| 49 | + 'upload_date': '2017-11-08T16:00:15', 'licence': 'Public', |
| 50 | + 'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff', |
| 51 | + 'file_id': '17928620', 'default_target_attribute': 'class', |
| 52 | + 'row_id_attribute': 'MouseID', |
| 53 | + 'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'], |
| 54 | + 'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'], |
| 55 | + 'visibility': 'public', 'status': 'active', |
| 56 | + 'md5_checksum': '3c479a6885bfa0438971388283a1ce32'} |
| 57 | + |
| 58 | + |
| 59 | +The ``DESCR`` contains a free-text description of the data, while ``details`` |
| 60 | +contains a dictionary of meta-data stored by openml, like the dataset id. |
| 61 | +For more details, see the `OpenML documentation |
| 62 | +<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset |
| 63 | +is 40966, and you can use this (or the name) to get more information on the |
| 64 | +dataset on the openml website:: |
| 65 | + |
| 66 | + >>> mice.url |
| 67 | + 'https://www.openml.org/d/40966' |
| 68 | + |
| 69 | +The ``data_id`` also uniquely identifies a dataset from OpenML:: |
| 70 | + |
| 71 | + >>> mice = fetch_openml(data_id=40966) |
| 72 | + >>> mice.details # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS +SKIP |
| 73 | + {'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF', |
| 74 | + 'creator': ..., |
| 75 | + 'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url': |
| 76 | + 'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id': |
| 77 | + '1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C, |
| 78 | + Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins |
| 79 | + Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): |
| 80 | + e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14', |
| 81 | + 'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum': |
| 82 | + '3c479a6885bfa0438971388283a1ce32'} |
| 83 | + |
| 84 | +.. _openml_versions: |
| 85 | + |
| 86 | +Dataset Versions |
| 87 | +---------------- |
| 88 | + |
| 89 | +A dataset is uniquely specified by its ``data_id``, but not necessarily by its |
| 90 | +name. Several different "versions" of a dataset with the same name can exist |
| 91 | +which can contain entirely different datasets. |
| 92 | +If a particular version of a dataset has been found to contain significant |
| 93 | +issues, it might be deactivated. Using a name to specify a dataset will yield |
| 94 | +the earliest version of a dataset that is still active. That means that |
| 95 | +``fetch_openml(name="miceprotein")`` can yield different results at different |
| 96 | +times if earlier versions become inactive. |
| 97 | +You can see that the dataset with ``data_id`` 40966 that we fetched above is |
| 98 | +the version 1 of the "miceprotein" dataset:: |
| 99 | + |
| 100 | + >>> mice.details['version'] #doctest: +SKIP |
| 101 | + '1' |
| 102 | + |
| 103 | +In fact, this dataset only has one version. The iris dataset on the other hand |
| 104 | +has multiple versions:: |
| 105 | + |
| 106 | + >>> iris = fetch_openml(name="iris") |
| 107 | + >>> iris.details['version'] #doctest: +SKIP |
| 108 | + '1' |
| 109 | + >>> iris.details['id'] #doctest: +SKIP |
| 110 | + '61' |
| 111 | + |
| 112 | + >>> iris_61 = fetch_openml(data_id=61) |
| 113 | + >>> iris_61.details['version'] |
| 114 | + '1' |
| 115 | + >>> iris_61.details['id'] |
| 116 | + '61' |
| 117 | + |
| 118 | + >>> iris_969 = fetch_openml(data_id=969) |
| 119 | + >>> iris_969.details['version'] |
| 120 | + '3' |
| 121 | + >>> iris_969.details['id'] |
| 122 | + '969' |
| 123 | + |
| 124 | +Specifying the dataset by the name "iris" yields the lowest version, version 1, |
| 125 | +with the ``data_id`` 61. To make sure you always get this exact dataset, it is |
| 126 | +safest to specify it by the dataset ``data_id``. The other dataset, with |
| 127 | +``data_id`` 969, is version 3 (version 2 has become inactive), and contains a |
| 128 | +binarized version of the data:: |
| 129 | + |
| 130 | + >>> np.unique(iris_969.target) |
| 131 | + array(['N', 'P'], dtype=object) |
| 132 | + |
| 133 | +You can also specify both the name and the version, which also uniquely |
| 134 | +identifies the dataset:: |
| 135 | + |
| 136 | + >>> iris_version_3 = fetch_openml(name="iris", version=3) |
| 137 | + >>> iris_version_3.details['version'] |
| 138 | + '3' |
| 139 | + >>> iris_version_3.details['id'] |
| 140 | + '969' |
| 141 | + |
| 142 | + |
| 143 | +.. topic:: References: |
| 144 | + |
| 145 | + * Vanschoren, van Rijn, Bischl and Torgo |
| 146 | + `"OpenML: networked science in machine learning" |
| 147 | + <https://arxiv.org/pdf/1407.7722.pdf>`_, |
| 148 | + ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014. |
0 commit comments