-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MNT Add type annotations for OpenML fetcher #17053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -81,15 +88,15 @@ def _open_openml_url(openml_path, data_home): | |||
result : stream | |||
A stream to the OpenML resource | |||
""" | |||
def is_gzip(_fsrc): | |||
def is_gzip_encoded(_fsrc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed because I found the name confusing. It's not actually a gzipped file, but rather that Content-Encoding == 'gzip'
in the header. So we need to use gzip for local files when this function is False.
Whether to raise an error if OpenML returns an acceptable error (e.g., | ||
date not found). If this argument is set to False, a None is returned | ||
in case of acceptable errors. Note that all other errors (e.g., 404) | ||
will still be raised as normal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove it and catch the exception directly in one case when it's necessary. This makes this function return a single type, instead of Dict or None.
if raise_if_error: | ||
raise ValueError(error_message) | ||
return None | ||
raise OpenMLError(error_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenMLError inherits from ValueError, so it is backward compatible.
azure-pipelines.yml
Outdated
conda create --name flake8_env --yes python=3.8 | ||
conda activate flake8_env | ||
source activate flake8_env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conda activate was actually not working in master with,
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
and conda init was not working either. We were just not checking for exit status (now done with set -ex
) So reverted to source activate
as I don't want to debug this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one issue with conda
is that it expects the shell function to be the entry point and not the executable. So if we have the executable in the PATH. This has already given me a headache in other places.
@@ -171,6 +172,18 @@ | |||
_RE_SPARSE_LINE = re.compile(r'^\s*\{.*\}\s*$', re.UNICODE) | |||
_RE_NONTRIVIAL_DATA = re.compile('["\'{}\\s]', re.UNICODE) | |||
|
|||
ArffDataType = Tuple[List, ...] | |||
|
|||
if typing.TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to TYPE_CHECKING docs this is used for performance reasons. If this is case, let's have a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the reason is that https://pypi.org/project/typing-extensions/ (imported below) is a mypy dependency so we can only import it if we are checking for types and therefore it is installed. I'll add a comment.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict): | |||
attributes = OrderedDict(arff['attributes']) | |||
arff_columns = list(attributes) | |||
|
|||
if isinstance(arff['data'], tuple): | |||
raise ValueError("Unreachable code. arff['data'] must be a generator.") | |||
|
|||
# calculate chunksize | |||
first_row = next(arff['data']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above check is necessary as technically arff['data']
could be a tuple for sparse data, and then next
would fails as it's not implemented for tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly check that arff['data'
is not iterable?
if not isinstance(arff['data'], Iterable)
(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
@thomasjpfan I addressed your comments, please let me know if you have others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to try out typehints here. Thank you @rth!
@@ -311,6 +333,9 @@ def _convert_arff_data_dataframe(arff, columns, features_dict): | |||
attributes = OrderedDict(arff['attributes']) | |||
arff_columns = list(attributes) | |||
|
|||
if isinstance(arff['data'], tuple): | |||
raise ValueError("Unreachable code. arff['data'] must be a generator.") | |||
|
|||
# calculate chunksize | |||
first_row = next(arff['data']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly check that arff['data'
is not iterable?
if not isinstance(arff['data'], Iterable)
(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicks
@@ -245,6 +261,10 @@ def _convert_arff_data(arff, col_slice_x, col_slice_y, shape=None): | |||
""" | |||
arff_data = arff['data'] | |||
if isinstance(arff_data, Generator): | |||
if shape is None: | |||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this have test coverage?
@@ -311,6 +333,11 @@ def _convert_arff_data_dataframe(arff, columns, features_dict): | |||
attributes = OrderedDict(arff['attributes']) | |||
arff_columns = list(attributes) | |||
|
|||
if not isinstance(arff['data'], Generator): | |||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage?
Thanks for the review @jnothman , I added a test to cover those two lines and coverage is now green. Merging. |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
I find that OpenML fetcher code is currently very difficult to read, there is a bunch of private functions with undocumented signatures and manipulating objects with non trivial types.
This PR adds some type annotations for function signatures, to at least make it a bit clearer what is the expected function input and output.
I'll contribute some of it to liac-arff upstream as soon as it drops Python 2 support renatopp/liac-arff#107