8000 Add OpenML dataset fetcher · Issue #9543 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add OpenML dataset fetcher #9543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Aug 13, 2017 · 43 comments · Fixed by #11419
Closed

Add OpenML dataset fetcher #9543

amueller opened this issue Aug 13, 2017 · 43 comments · Fixed by #11419
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices
Milestone

Comments

@amueller
Copy link
Member

The status of the OpenML API is now so that I think we can relatively easily implement a fetcher for OpenML datasets.

You can see some of the discussion here:
openml/OpenML#218 (comment)

The interface should probably either accepting a name or an id. Names are not unique in OpenML, integer IDs are - but less user friendly.

My suggestion would be to do a search call like

https://openml.org/api/v1/json/data/list/data_name/anneal/limit/1

which searches for the anneal dataset. The result will contain the ID of the first dataset called anneal. Then we can fetch that with a second API call as a CSV.
Finally we probably need to also do a call for the JSON meta-data, which tells us which column is the target, and probably also which columns are categorical and which are continuous, and possibly more.
For our interface, we definitely need the target column, though.

This should be fairly straight-forward.

@amueller amueller added Enhancement Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Aug 13, 2017
@amueller
Copy link
Member Author

We might want to provide a read-only API key so that OpenML can figure out which calls come from us (but we don't want users uploading datasets in our name).

@YSanchezAraujo
Copy link
YSanchezAraujo commented Aug 13, 2017

@amueller are users expected to know the dataset name then?

Not sure exactly what the API should look like but in terms of functionality are you looking for something like this ?

import sys
import json
import arff
import urllib
import pandas as pd

json_dl = "https://openml.org/api/v1/json/data/list/data_name/{}/limit/1"
data_dl = "https://www.openml.org/data/download/{}"

def url_dl(url):
    vers = sys.version_info[0]
    if vers == 3:
        return urllib.request.urlretrieve(url)[0]
    elif vers == 2:
        return urllib.urlretrieve(url)[0]

def get_info(name, dl=json_dl):
    url_path = url_dl(dl.format(name))
    with open(url_path, 'r') as tmp:
        json_data = json.load(tmp)
    return json_data['data']['dataset'][0]


def get_data(json_data, dl=data_dl):
    url_path = url_dl(dl.format(json_data['file_id']))
    with open(url_path, 'r') as tmp:
        data = arff.load(tmp)
    att = data['attributes']
    col_heads = [val[0] for val in att]
    return pd.DataFrame(data=data['data'], columns=col_heads), att

test = get_data(get_info("iris"))

@jnothman
Copy link
Member
jnothman commented Aug 13, 2017 via email

@YSanchezAraujo
Copy link
YSanchezAraujo commented Aug 13, 2017

I thought that was the intention of the suggestion:

My suggestion would be to do a search call like

https://openml.org/api/v1/json/data/list/data_name/anneal/limit/1

which searches for the anneal dataset. The result will contain the ID of the first dataset called anneal.

@jnothman
Copy link
Member
jnothman commented Aug 13, 2017 via email

@amueller
Copy link
Member Author

I thought names would be more user-friendly (even if ambiguous). fetch_mldata uses names, though their names are unique.
Like if I want to load mnist, I don't want to remember the id of MNIST.

@YSanchezAraujo you don't need arff and you can't use pandas unfortunately, because it's not a requirement.

The interface should be pretty similar to that of fetch_mldata I think.

@YSanchezAraujo
Copy link

@amueller Alright I'll change up the code a bit

@jnothman
Copy link
Member
jnothman commented Aug 15, 2017 via email

@amueller
Copy link
Member Author

I think the code I gave will give the smallest id, but we should confirm. And happy to warn if it's not unique. I'm not sure how many of the datasets are unique. I raised a point about having unique names, and we could probably have them add unique names if that would help.
We can also do just IDs for now, but I don't think it's a very friendly interface.

@jnothman
Copy link
Member
jnothman commented Aug 15, 2017 via email

@amueller
Copy link
Member Author

Ids are allocated in chronological order, datasets can be deleted (and/or flagged as inactive). So you're concerned the result might change because a dataset gets deleted.
Maybe @janvanrijn or @joaquinvanschoren can comment. If datasets only get flagged as inactive and can still be searched, we could get future-proof behavior by searching including inactive datasets and erroring if it's inactive. That means that if the first dataset with a given name becomes inactive, it can't be accessed through our mechanism any more by name. Which I guess is fine?

@joaquinvanschoren
Copy link
Contributor
joaquinvanschoren commented Aug 16, 2017

Thanks for doing this, and we're happy to help in any way possible.

When you search a dataset by name (using the API call above), you get the oldest version that was not deactivated for serious reasons. If you want to have the exact same version (e.g. for unit tests or documentation), then it is best to store the ID number after you download it. This supports both active research and archival use.

Also, name+version_nr is unique and immutable (for caching). We can add a version filter to that API call if that's easier than remembering an ID, something like https://openml.org/api/v1/json/data/list/data_name/anneal/version/2

If a dataset is deactivated, that earlier API call (without the version nr) will return the active version that replaces it (with issues fixed). If you don't want that, you'll have to use the ID:
https://openml.org/api/v1/json/data/_id_

If a dataset is deleted (this only happens if the owner expressly wants this) then it will indeed not be accessible anymore. I don't think this has happened before, but of course there are legal ownership and other rights, so this is not impossible. If people think that the data cannot remain public, they won't upload it publicly to OpenML. Same for mldata, UCI,...

@joaquinvanschoren
Copy link
Contributor

Oh, sorry, I missed part of Andreas's answer. We can also add a status filter to the dataset search:
https://openml.org/api/v1/json/data/list/data_name/anneal/status/all

To return both active and non-active datasets. That would indeed help if a dataset is deactivated and not replaced by a new active version.

@jnothman
Copy link
Member
jnothman commented Aug 16, 2017 via email

@joaquinvanschoren
Copy link
Contributor

They would either specify a name and version, e.g. iris version 1
Or, specify a dataset id, e.g. 61 -> https://www.openml.org/d/61

In the first case, you would look up the ID with
https://openml.org/api/v1/json/data/list/data_name/iris

and then find version 1, or you ask us to implement
https://openml.org/api/v1/json/data/list/data_name/iris/version/1

If the user specifies an ID, you simply call
https://openml.org/api/v1/json/data/61

@GKjohns
Copy link
Contributor
GKjohns commented Aug 26, 2017

Could we create a wrapper around the function that does this in the openml module (something like oml.datasets.get_dataset(did))? It would add a dependency but would reduce a ton of redundant work.

@amueller
Copy link
Member Author

No, I don't want to depend on openml, in particular because of the arff dependency it has (which is a github development version). And actually, there is not a lot of redundant work. This functionality can be implemented in ~10 lines.

@amueller
Copy link
Member Author

(I wrote a bunch of the openml python module)

@amueller
Copy link
Member Author

@YSanchezAraujo are you still working on this? if so, can you submit a PR?

@janvanrijn
Copy link
Contributor

FYI current version of openml-python does not depend on development versions, liac-arff 2.1.1 is on PIP

@amueller
Copy link
Member Author

@janvanrijn sweet! - still I think this is very easy to implement without pulling in all the complexity of openml-python.

@janvanrijn
Copy link
Contributor
janvanrijn commented Aug 28, 2017 via email

@rrkarim
Copy link
rrkarim commented Aug 30, 2017

@amueller so what is the final decision? There should be version number passed to fetch_mldata? But then how users will be able to choose appropriate version number on every request?

@jnothman
Copy link
Member

So names uniquely identify a dataset but not a version.
@joaquinvanschoren: does version get updated only if the data changes, or even if the meta-data changes? Can you provide us a test dataset where there are multiple versions, some active/inactive/deprecated/deleted? (is inactive the same as deleted?)

How about:

Parameters
----------
name_or_id : str or int
    The name of the dataset if a string, otherwise the integer ID (in which
    case "version" below is ignored).
version : int or "newest", default=1
	When retrieving a dataset by name, this chooses which version to retrieve.
	"newest" will retrieve the active version with the highest version number.
warn_if_newer : True
	If the dataset is retrieved by version, a UserWarning warning will be
	emitted if a newer version of the dataset is available.

Raises
------
ValueError if the requested version is deleted.

??

@jnothman
Copy link
Member

No, I think this isn't quite right because the existence of iris version 3 doesn't mean that iris version 1 should not be used.

@joaquinvanschoren, something like /status/all might help. Otherwise, some kind of "substitute" or "replace-with" or "newer" field would be valuable so that we can inform the user that a dataset exists with improved quality or metadata:

{"data":{"dataset":[
    {"did":61,
   "name":"iris",
   "version":1,
   "status":"deprecated",
   "status_timestamp":"2017-01-01T12:04:00Z"
   "newer":969,
   ...
}

I'm not sure then whether we need to iteratively follow "newer" links or whether you will assure us that (eventually) newer will point to the newest.

@YSanchezAraujo
Copy link

I don't have time to follow this at the moment, so for whoever will send the PR this is where I stopped some time ago, if it's useful

import json
import warnings
import numpy as np
try:
    # Python 2
    from urllib import urlretrieve
except ImportError:
    # Python 3
    from urllib.request import urlretrieve
import scipy.io.arff as sia

jsons = "https://openml.org/api/v1/json/data/list/data_name/{}"
data_dl = "https://www.openml.org/data/download/{}"

def get_dataset(name, name_vers=None, json_loc=jsons, data_loc=data_dl):
    json_dl = urlretrieve(json_loc.format(name))[0]
    # get the json file
    with open(json_dl, 'r') as tmp:
        json_data = json.load(tmp)['data']['dataset']    
    vers = [(idx, val) for idx, item in enumerate(json_data) 
            for key, val in item.items() if key == "version"]
    # tell user there are more versions if they dont specify number        
    if len(vers) > 1 and name_vers is None:
        msg = ("dataset: {} has versions {}, "
               "default is {}").format(name,
                                       [i[1] for i in vers],
                                       min([i[1] for i in vers]))
        warnings.warn(msg)
    # check if the version specified (if it is) is in the ones gotten    
    use = 1 if name_vers is None else name_vers
    for v in vers:
        if v[1] == use:
            to_get = json_data[v[0]]['file_id']
    # download data
    data_tmp = urlretrieve(data_loc.format(to_get))[0]
    # load the data
    data = sia.loadarff(data_tmp)
    data_fmt = np.zeros((data[0].shape[0], len(data[0][0])), dtype=object)
    # scipy returns a tuple so try to put it in the right format
    for idx, row in enumerate(data[0]):
        data_fmt[idx, :] = [val for val in row]
    return data_fmt

@vrishank97
Copy link
Contributor
vrishank97 commented Sep 19, 2017

Can I take up this enhancement?

@amueller
Copy link
Member Author

@vrishank97 in principle yes, but looks like you already have 3 PRs with CI failing or review comments. Maybe focus on finishing these up first (says the guy with 17 open PRs).

@vrishank97
Copy link
Contributor

Thanks, I took care the those PRs. I have only been working on 'easy' issues, is this something I can take up as my first larger contribution?

@jnothman
Copy link
Member
jnothman commented Oct 1, 2017 via email

@janvanrijn
Copy link
Contributor

@vrishank97 if you have questions about the openml side of this spectrum, you probably want to consult with @joaquinvanschoren @mfeurer or of course myself. Good luck :)

F438

@vrishank97
Copy link
Contributor

Thanks @janvanrijn. Is a CSV file call with headers available now?

@amueller
Copy link
Member Author
amueller commented Oct 9, 2017

@vrishank97 yes, there is.

@amueller
Copy link
Member Author

ok so if someone has used a dataset before, they probably want to be able to use it again without internet. But we don't know whether they address it by id or by name, and we don't know the correspondence (unless we store that locally).
I don't see a better way then keeping a (name,version)->id dictionary... hm.

@jnothman
Copy link
Member
jnothman commented Oct 10, 2017 via email

@vrishank97
Copy link
Contributor

I'm currently implementing a fetcher similar to the one we have for MLdata, where we store all datasets used locally. Thanks @jnothman, I'll try using joblib.Memory.

@vrishank97
Copy link
Contributor

Is there any alternative to joblib? I don't think we can use joblib as it's not a requirement.

@amueller amueller mentioned this issue Oct 11, 2017
5 tasks
@amueller
Copy link
Member Author

@vrishank97 joblib is a requirement. we are actually shipping it in external/ right now. I have been working on this a bit in #9908. If you have time to work on this, feel free to pick up there. I'm with the openml people this week, so I'd really like to finish it up this week, because that will make it much easier to fix any issues.

@amueller
Copy link
Member Author

@jnothman yeah that's probably a better idea.

@amueller
Copy link
Member Author
amueller commented Oct 11, 2017

@jnothman is there a particular reason why you want to do it optionally?

@jnothman
Copy link
Member
jnothman commented Oct 15, 2017 via email

@joaquinvanschoren
Copy link
Contributor

What still needs to be done to finish openml_fetch?

@jnothman
Copy link
Member

Well, @amueller is generally quite unavailable atm, and his PR at #9908 most recently tried to use the CSV interface, which I personally think should be avoided, in part because we need to support sparse datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants
0