8000 Fix dtype=S1 encoding in to_netcdf() by shoyer · Pull Request #2158 · pydata/xarray · GitHub
[go: up one dir, main page]

Skip to content

Fix dtype=S1 encoding in to_netcdf() #2158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 1, 2018
Merged

Conversation

shoyer
Copy link
Member
@shoyer shoyer commented May 18, 2018
  • Closes [REGRESSION] to_netcdf doesn't accept dtype=S1 encoding anymore #2149 (remove if there is no corresponding issue, which should only be the case for minor changes)
  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

@crusaderky please take a look. Testing here is not as thorough as in #2150 yet, but it does include a regression test.

@crusaderky
Copy link
Contributor

@shoyer just pull the test from #2150...

@@ -184,8 +184,9 @@ def prepare_variable(self, name, variable, check_encoding=False,
raise ValueError("'zlib' and 'compression' encodings mismatch")
encoding.setdefault('compression', 'gzip')

if (check_encoding and encoding.get('complevel') not in
(None, encoding.get('compression_opts'))):
if (check_encoding and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W291 trailing whitespace

(None, encoding.get('compression_opts'))):
if (check_encoding and
'complevel' in encoding and 'compression_opts' in encoding and
encoding['complevel'] != encoding['compression_opts']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E127 continuation line over-indented for visual indent

with self.roundtrip(ds, save_kwargs=kwargs) as actual:
assert_equal(actual, ds)
assert actual.x.encoding['dtype'] == 'f4'
assert actual.x.encoding['zlib'] == True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E712 comparison to True should be 'if cond is True:' or 'if cond:'

assert actual.x.encoding['dtype'] == 'f4'
assert actual.x.encoding['zlib'] == True
assert actual.x.encoding['complevel'] == 9
assert actual.x.encoding['fletcher32'] == True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will b 10000 e displayed to describe this comment to others. Learn more.

E712 comparison to True should be 'if cond is True:' or 'if cond:'

assert actual.x.encoding['complevel'] == 9
assert actual.x.encoding['fletcher32'] == True
assert actual.x.encoding['chunksizes'] == (5,)
assert actual.x.encoding['shuffle'] == True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E712 comparison to True should be 'if cond is True:' or 'if cond:'

# there should be no chunks
self.assertEqual(v.chunks, None)
assert v.chunks == None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E711 comparison to None should be 'if cond is None:'

@@ -1793,19 +1818,19 @@ def test_dump_encodings_h5py(self):
kwargs = {'encoding': {'x': {
'compression': 'gzip', 'compression_opts': 9}}}
with self.roundtrip(ds, save_kwargs=kwargs) as actual:
self.assertEqual(actual.x.encoding['zlib'], True)
self.assertEqual(actual.x.encoding['complevel'], 9)
assert actual.x.encoding['zlib'] == True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E712 comparison to True should be 'if cond is True:' or 'if cond:'

self.assertEqual(actual.x.encoding['compression'], 'lzf')
self.assertEqual(actual.x.encoding['compression_opts'], None)
assert actual.x.encoding['compression'] == 'lzf'
assert actual.x.encoding['compression_opts'] == None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E711 comparison to None should be 'if cond is None:'

original = Dataset({'x': [u'foo', u'bar', u'baz']})
kwargs = dict(encoding={'x': {'dtype': str}})
with raises_regex(ValueError, 'encoding dtype=str for vlen'):
with self.roundtrip(original, save_kwargs=kwargs) as actual:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F841 local variable 'actual' is assigned to but never used

@shoyer shoyer force-pushed the encoding-S1-fix branch from 8bbf70c to 7418924 Compare May 24, 2018 04:00
@shoyer shoyer mentioned this pull request May 25, 2018
8 tasks
@shoyer
Copy link
Member Author
shoyer commented May 25, 2018

I think this is ready for another review.

@crusaderky I ended up splitting up your test into a few smaller pieces, so we could more easily handle logic for different backends with subclassing.

@@ -58,6 +62,9 @@ Bug fixes
bug where non-scalar data-variables that did not include the aggregation
dimension were improperly skipped.
By `Stephan Hoyer <https://github.com/shoyer>`_
- Fixed a regression in 0.10.4, where specifying ``{'dtype': 'S1'}`` in
``encoding`` with ``to_netcdf()`` raised an error.
`Stephan Hoyer <https://github.com/shoyer>`_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate entry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

vlen_dtype = h5py.check_dtype(vlen=var.dtype)
if vlen_dtype is not None:
if vlen_dtype is not unicode_type: # pragma: no cover
raise NotImplementedError('unexpected vlen dtype: {!r}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference from var.dtype.kind == 'U'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h5py uses special NumPy dtypes (np.object_ with metadata) for defining variable length types. It doesn't use var.dtype.kind == 'U', because HDF5 doesn't have a unicode type with a memory model matching np.unicode_. (HDF5 implements UTF-8 with a fixed number of bytes for storage, not a fixed number of unicode characters like NumPy.)

if 'dtype' in var.encoding and var.encoding['dtype'] != 'S1':
if ('dtype' in var.encoding and
var.encoding['dtype'] != 'S1' and
var.encoding['dtype'] is not str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if var.encoding.get('dtype') not in (None, 'S1', str):
or a bit more verbose but more readable:
if 'dtype' in var.encoding and var.encoding['dtype'] not in ('S1', str):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my first version. Sadly it doesn't work due to numpy/numpy#7242. I added a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype in {'S1', str} hashesh dtype and compares the hashes.
dtype in ('S1', str) invokes dtype.__eq__ and is the same as any(dtype == x for x in ('S1', str)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I misread that. Updated.

warnings.warn("CF decoding is overwriting dtype on variable {!r}"
.format(name))
else:
if 'dtype' not in encoding:
encoding['dtype'] = original_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encoding.set_default('dtype', original_dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

warnings.warn("CF decoding is overwriting dtype on variable {!r}"
.format(name))
else:
if 'dtype' not in encoding:
encoding['dtype'] = original_dtype

if 'dtype' in attributes and attributes['dtype'] == 'bool':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if attributes.get('dtype') == 'bool':

@@ -873,6 +886,7 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

@requires_netCDF4
class BaseNetCDF4Test(CFEncodedDataTest):
"""Tests for both netCDF4-python and h5netcdf."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment contradicts the @requires_netCDF4 decorator immediately above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -274,3 +274,6 @@ def test_invalid_dataarray_names_raise(self):

def test_encoding_kwarg(self):
pass

def test_encoding_kwarg_fixed_width_string(self):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the purpose of this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TestCFEncodedDataStore test is intended for checking the backend interface without depending on any particular file format. That means it doesn't support string encodings. (Added a comment.)

Copy link
Member Author
@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky thanks for the review!

@@ -58,6 +62,9 @@ Bug fixes
bug where non-scalar data-variables that did not include the aggregation
dimension were improperly skipped.
By `Stephan Hoyer <https://github.com/shoyer>`_
- Fixed a regression in 0.10.4, where specifying ``{'dtype': 'S1'}`` in
``encoding`` with ``to_netcdf()`` raised an error.
`Stephan Hoyer <https://github.com/shoyer>`_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if 'dtype' in var.encoding and var.encoding['dtype'] != 'S1':
if ('dtype' in var.encoding and
var.encoding['dtype'] != 'S1' and
var.encoding['dtype'] is not str):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my first version. Sadly it doesn't work due to numpy/numpy#7242. I added a comment.

warnings.warn("CF decoding is overwriting dtype on variable {!r}"
.format(name))
else:
if 'dtype' not in encoding:
encoding['dtype'] = original_dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -873,6 +886,7 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

@requires_netCDF4
class BaseNetCDF4Test(CFEncodedDataTest):
"""Tests for both netCDF4-python and h5netcdf."""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -274,3 +274,6 @@ def test_invalid_dataarray_names_raise(self):

def test_encoding_kwarg(self):
pass

def test_encoding_kwarg_fixed_width_string(self):
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TestCFEncodedDataStore test is intended for checking the backend interface without depending on any particular file format. That means it doesn't support string encodings. (Added a comment.)

@shoyer
Copy link
Member Author
shoyer commented May 31, 2018

I plan to merge this (and issue the 0.10.5 release) end of day today unless anyone has further comments.

@shoyer shoyer merged commit 4106b94 into pydata:master Jun 1, 2018
@shoyer shoyer deleted the encoding-S1-fix branch June 1, 2018 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[REGRESSION] to_netcdf doesn't accept dtype=S1 encoding anymore
3 participants
0