8000 [MRG] pulling data from openml.org rather than original data source by maxcopeland · Pull Request #12004 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] pulling data from openml.org rather than original data source #12004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 8, 2018
Merged

[MRG] pulling data from openml.org rather than original data source #12004

merged 3 commits into from
Sep 8, 2018

Conversation

maxcopeland
Copy link
Contributor
@maxcopeland maxcopeland commented Sep 4, 2018

Fixes #11858

What does this implement/fix? Explain your changes.

This pulls data from openml.org rather than the original data source.

Any other comments?

The dataset has spent several days on openml.org but is still "in_preparation". Data can be pulled via fetch_openml, but gives a warning about the "in_preparation" status of the dataset's current version. plot_gpr_co2 is technically functional, but need to get dataset to "active" state.

Left to-do:

  • More efficient aggregation of monthly sum averages
  • Upload additional versions of arff file to get dataset activated by openml admins

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

counts.append(1)
else:
# aggregate monthly sum to produce average
ppmv_sums[-1] += float(ppmv)
ppmv_sums[-1] += float(ppmvs[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, this is redundant

month_float = y + (m - 1) / 12
ppmvs = ml_data.target

for i in range(len(ppmvs)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as well just iterate over zip(month_float, ppmvs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.

@jnothman
Copy link
Member
jnothman commented Sep 5, 2018

Why is this marked WIP?

@jnothman
Copy link
Member
jnothman commented Sep 5, 2018

That is, what work do you intend to do before this is safe to consider merging?

@maxcopeland
Copy link
Contributor Author
maxcopeland commented Sep 5, 2018

On the openml.org side, the dataset version still needs to be approved and set to "active". While its status is "in_preparation", their admins could in theory reject the dataset and set as "inactive" (due to compliance issues with tasks or workflows, etc) and would break fetch_openml. Once it's active, the merge will be safe.

@jnothman
Copy link
Member
jnothman commented Sep 5, 2018

@janvanrijn what's the chance of https://www.openml.org/d/41187 not being approved? ;)

@maxcopeland is there a reason not to say "Fixes #..." in the PR description? Using that wording, rather than "Works on" means that github will automatically close the original issue when this is merged.

@maxcopeland
Copy link
Contributor Author

@jnothman Sorry about that! Edited my PR comment. I'll use "Fixes #..." in the future. Newbie error :/

@janvanrijn
Copy link
Contributor

cool, new datasets :)
it's active now

@qinhanmin2014
Copy link
Member

@maxcopeland Thanks for uploading the dataset. Seems that there are still some formatting issues in the wiki part. Also, maybe we can provide more information in the wiki (e.g., the url you obtain the data)

@maxcopeland maxcopeland changed the title [WIP] pulling data from openml.org rather than original data source [MRG] pulling data from openml.org rather than original data source Sep 5, 2018
@maxcopeland
Copy link
Contributor Author

Thanks @qinhanmin2014, I've updated the wiki here. Let me know if it's acceptable.

Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, the example looks much cleaner using the openml fetcher!

@rth rth merged commit 2242f4c into scikit-learn:master Sep 8, 2018
@maxcopeland maxcopeland deleted the mauna-loa-openml branch September 8, 2018 15:03
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 9, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload Mauna Loa CO2 data to OpenML.org
5 participants
0