-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add OpenML dataset fetcher #9543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We might want to provide a read-only API key so that OpenML can figure out which calls come from us (but we don't want users uploading datasets in our name). |
@amueller are users expected to know the dataset name then? Not sure exactly what the API should look like but in terms of functionality are you looking for something like this ? import sys
import json
import arff
import urllib
import pandas as pd
json_dl = "https://openml.org/api/v1/json/data/list/data_name/{}/limit/1"
data_dl = "https://www.openml.org/data/download/{}"
def url_dl(url):
vers = sys.version_info[0]
if vers == 3:
return urllib.request.urlretrieve(url)[0]
elif vers == 2:
return urllib.urlretrieve(url)[0]
def get_info(name, dl=json_dl):
url_path = url_dl(dl.format(name))
with open(url_path, 'r') as tmp:
json_data = json.load(tmp)
return json_data['data']['dataset'][0]
def get_data(json_data, dl=data_dl):
url_path = url_dl(dl.format(json_data['file_id']))
with open(url_path, 'r') as tmp:
data = arff.load(tmp)
att = data['attributes']
col_heads = [val[0] for val in att]
return pd.DataFrame(data=data['data'], columns=col_heads), att
test = get_data(get_info("iris")) |
why would you do a search by name and not require the user to specify by
id? I'd assume names are ambiguous.
…On 14 Aug 2017 5:30 am, "Yoel Sanchez Araujo" ***@***.***> wrote:
@amueller <https://github.com/amueller> are users expected to know the
dataset name then?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9543 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz63hzOif8e8YWri5br5cq8g2kQQ7Bks5sX07hgaJpZM4O1txK>
.
|
I thought that was the intention of the suggestion: My suggestion would be to do a search call like https://openml.org/api/v1/json/data/list/data_name/anneal/limit/1 which searches for the anneal dataset. The result will contain the ID of the first dataset called anneal. |
ah okay.
|
I thought names would be more user-friendly (even if ambiguous). @YSanchezAraujo you don't need arff and you can't use pandas unfortunately, because it's not a requirement. The interface should be pretty similar to that of |
@amueller Alright I'll change up the code a bit |
if the names are not unique, don't do it by name, or search by name but
error if multiple are returned. Or warn and take the search result with the
lowest numeric id or something similarly well-specified
|
I think the code I gave will give the smallest id, but we should confirm. And happy to warn if it's not unique. I'm not sure how many of the datasets are unique. I raised a point about having unique names, and we could probably have them add unique names if that would help. |
well minimum id with a given name works as long as id allocated in
chronological order and there's no deletion
…On 16 Aug 2017 2:48 am, "Andreas Mueller" ***@***.***> wrote:
I think the code I gave will give the smallest id, but we should confirm.
And happy to warn if it's not unique. I'm not sure how many of the datasets
are unique. I raised a point about having unique names, and we could
probably have them add unique names if that would help.
We can also do just IDs for now, but I don't think it's a very friendly
interface.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9543 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz651yhbBwRm-lQwtrR23UPN0pD_MQks5sYcvcgaJpZM4O1txK>
.
|
Ids are allocated in chronological order, datasets can be deleted (and/or flagged as inactive). So you're concerned the result might change because a dataset gets deleted. |
Thanks for doing this, and we're happy to help in any way possible. When you search a dataset by name (using the API call above), you get the oldest version that was not deactivated for serious reasons. If you want to have the exact same version (e.g. for unit tests or documentation), then it is best to store the ID number after you download it. This supports both active research and archival use. Also, name+version_nr is unique and immutable (for caching). We can add a version filter to that API call if that's easier than remembering an ID, something like https://openml.org/api/v1/json/data/list/data_name/anneal/version/2 If a dataset is deactivated, that earlier API call (without the version nr) will return the active version that replaces it (with issues fixed). If you don't want that, you'll have to use the ID: If a dataset is deleted (this only happens if the owner expressly wants this) then it will indeed not be accessible anymore. I don't think this has happened before, but of course there are legal ownership and other rights, so this is not impossible. If people think that the data cannot remain public, they won't upload it publicly to OpenML. Same for mldata, UCI,... |
Oh, sorry, I missed part of Andreas's answer. We can also add a status filter to the dataset search: To return both active and non-active datasets. That would indeed help if a dataset is deactivated and not replaced by a new active version. |
could you help explicitly describe what we might do to allow users to
specify a dataset by name but that it needs to be uniquely identified?
…On 17 Aug 2017 8:03 am, "Joaquin Vanschoren" ***@***.***> wrote:
Oh, sorry, I missed part of A
8000
ndreas's answer. We can also add a status
filter to the dataset search:
https://openml.org/api/v1/json/data/list/data_name/anneal/status/all
To return both active and non-active datasets. That would indeed help if a
dataset is deactivated and not replaced by a new active version.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9543 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_Q3SD1a19-o3znwIb-QtHvI1Oj9ks5sY2dMgaJpZM4O1txK>
.
|
They would either specify a name and version, e.g. iris version 1 In the first case, you would look up the ID with and then find version 1, or you ask us to implement If the user specifies an ID, you simply call |
Could we create a wrapper around the function that does this in the openml module (something like |
No, I don't want to depend on openml, in particular because of the arff dependency it has (which is a github development version). And actually, there is not a lot of redundant work. This functionality can be implemented in ~10 lines. |
(I wrote a bunch of the openml python module) |
@YSanchezAraujo are you still working on this? if so, can you submit a PR? |
FYI current version of openml-python does not depend on development versions, liac-arff 2.1.1 is on PIP |
@janvanrijn sweet! - still I think this is very easy to implement without pulling in all the complexity of openml-python. |
Cool, let me know if you need any help :)
2017-08-28 23:11 GMT+02:00 Andreas Mueller <notifications@github.com>:
… @janvanrijn <https://github.com/janvanrijn> sweet! - still I think this
is very easy to implement without pulling in all the complexity of
openml-python.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9543 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACL7-tFO0JDahccMDMDJbzjxV_WoGPWcks5scyzzgaJpZM4O1txK>
.
|
@amueller so what is the final decision? There should be |
So names uniquely identify a dataset but not a version. How about:
?? |
No, I think this isn't quite right because the existence of iris version 3 doesn't mean that iris version 1 should not be used. @joaquinvanschoren, something like /status/all might help. Otherwise, some kind of "substitute" or "replace-with" or "newer" field would be valuable so that we can inform the user that a dataset exists with improved quality or metadata:
I'm not sure then whether we need to iteratively follow "newer" links or whether you will assure us that (eventually) newer will point to the newest. |
I don't have time to follow this at the moment, so for whoever will send the PR this is where I stopped some time ago, if it's useful import json
import warnings
import numpy as np
try:
# Python 2
from urllib import urlretrieve
except ImportError:
# Python 3
from urllib.request import urlretrieve
import scipy.io.arff as sia
jsons = "https://openml.org/api/v1/json/data/list/data_name/{}"
data_dl = "https://www.openml.org/data/download/{}"
def get_dataset(name, name_vers=None, json_loc=jsons, data_loc=data_dl):
json_dl = urlretrieve(json_loc.format(name))[0]
# get the json file
with open(json_dl, 'r') as tmp:
json_data = json.load(tmp)['data']['dataset']
vers = [(idx, val) for idx, item in enumerate(json_data)
for key, val in item.items() if key == "version"]
# tell user there are more versions if they dont specify number
if len(vers) > 1 and name_vers is None:
msg = ("dataset: {} has versions {}, "
"default is {}").format(name,
[i[1] for i in vers],
min([i[1] for i in vers]))
warnings.warn(msg)
# check if the version specified (if it is) is in the ones gotten
use = 1 if name_vers is None else name_vers
for v in vers:
if v[1] == use:
to_get = json_data[v[0]]['file_id']
# download data
data_tmp = urlretrieve(data_loc.format(to_get))[0]
# load the data
data = sia.loadarff(data_tmp)
data_fmt = np.zeros((data[0].shape[0], len(data[0][0])), dtype=object)
# scipy returns a tuple so try to put it in the right format
for idx, row in enumerate(data[0]):
data_fmt[idx, :] = [val for val in row]
return data_fmt |
Can I take up this enhancement? |
@vrishank97 in principle yes, but looks like you already have 3 PRs with CI failing or review comments. Maybe focus on finishing these up first (says the guy with 17 open PRs). |
Thanks, I took care the those PRs. I have only been working on 'easy' issues, is this something I can take up as my first larger contribution? |
That sounds like a good idea.
|
@vrishank97 if you have questions about the openml side of this spectrum, you probably want to consult with @joaquinvanschoren @mfeurer or of course myself. Good luck :) |
Thanks @janvanrijn. Is a CSV file call with headers available now? |
@vrishank97 yes, there is. |
ok so if someone has used a dataset before, they probably want to be able to use it again without internet. But we don't know whether they address it by id or by name, and we don't know the correspondence (unless we store that locally). |
Maybe just have an option memory=True, which defaults to using a
joblib.Memory directory within the data dir. It caches all the HTTP
requests and hence includes name mappings and data.
… |
I'm currently implementing a fetcher similar to the one we have for MLdata, where we store all datasets used locally. Thanks @jnothman, I'll try using joblib.Memory. |
Is there any alternative to joblib? I don't think we can use joblib as it's not a requirement. |
@vrishank97 joblib is a requirement. we are actually shipping it in |
@jnothman yeah that's probably a better idea. |
@jnothman is there a particular reason why you want to do it optionally? |
if it's all stored, then there's no point to our discussion on what to do
if a newer version exists. But the interface can be finessed in the PR.
|
What still needs to be done to finish |
The status of the OpenML API is now so that I think we can relatively easily implement a fetcher for OpenML datasets.
You can see some of the discussion here:
openml/OpenML#218 (comment)
The interface should probably either accepting a name or an id. Names are not unique in OpenML, integer IDs are - but less user friendly.
My suggestion would be to do a search call like
which searches for the anneal dataset. The result will contain the ID of the first dataset called anneal. Then we can fetch that with a second API call as a CSV.
Finally we probably need to also do a call for the JSON meta-data, which tells us which column is the target, and probably also which columns are categorical and which are continuous, and possibly more.
For our interface, we definitely need the target column, though.
This should be fairly straight-forward.
The text was updated successfully, but these errors were encountered: