Fix dtype=S1 encoding in to_netcdf() #2158

shoyer · 2018-05-18T06:30:55Z

Closes [REGRESSION] to_netcdf doesn't accept dtype=S1 encoding anymore #2149 (remove if there is no corresponding issue, which should only be the case for minor changes)
Tests added (for all bug fixes or enhancements)
Tests passed (for all non-documentation changes)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

@crusaderky please take a look. Testing here is not as thorough as in #2150 yet, but it does include a regression test.

crusaderky · 2018-05-18T06:33:32Z

@shoyer just pull the test from #2150...

stickler-ci · 2018-05-18T07:01:04Z

xarray/backends/h5netcdf_.py

@@ -184,8 +184,9 @@ def prepare_variable(self, name, variable, check_encoding=False,
                raise ValueError("'zlib' and 'compression' encodings mismatch")
            encoding.setdefault('compression', 'gzip')

-        if (check_encoding and encoding.get('complevel') not in
-                (None, encoding.get('compression_opts'))):
+        if (check_encoding and 


W291 trailing whitespace

stickler-ci · 2018-05-18T07:01:04Z

xarray/backends/h5netcdf_.py

-                (None, encoding.get('compression_opts'))):
+        if (check_encoding and 
+                'complevel' in encoding and 'compression_opts' in encoding and
+                 encoding['complevel'] != encoding['compression_opts']):


E127 continuation line over-indented for visual indent

stickler-ci · 2018-05-21T00:23:12Z

xarray/tests/test_backends.py

+        with self.roundtrip(ds, save_kwargs=kwargs) as actual:
+            assert_equal(actual, ds)
+            assert actual.x.encoding['dtype'] == 'f4'
+            assert actual.x.encoding['zlib'] == True


E712 comparison to True should be 'if cond is True:' or 'if cond:'

stickler-ci · 2018-05-21T00:23:12Z

xarray/tests/test_backends.py

+            assert actual.x.encoding['dtype'] == 'f4'
+            assert actual.x.encoding['zlib'] == True
+            assert actual.x.encoding['complevel'] == 9
+            assert actual.x.encoding['fletcher32'] == True


E712 comparison to True should be 'if cond is True:' or 'if cond:'

stickler-ci · 2018-05-21T00:23:13Z

xarray/tests/test_backends.py

+            assert actual.x.encoding['complevel'] == 9
+            assert actual.x.encoding['fletcher32'] == True
+            assert actual.x.encoding['chunksizes'] == (5,)
+            assert actual.x.encoding['shuffle'] == True


E712 comparison to True should be 'if cond is True:' or 'if cond:'

stickler-ci · 2018-05-21T00:23:13Z

xarray/tests/test_backends.py

                # there should be no chunks
-                self.assertEqual(v.chunks, None)
+                assert v.chunks == None


E711 comparison to None should be 'if cond is None:'

stickler-ci · 2018-05-21T00:23:13Z

xarray/tests/test_backends.py

@@ -1793,19 +1818,19 @@ def test_dump_encodings_h5py(self):
        kwargs = {'encoding': {'x': {
            'compression': 'gzip', 'compression_opts': 9}}}
        with self.roundtrip(ds, save_kwargs=kwargs) as actual:
-            self.assertEqual(actual.x.encoding['zlib'], True)
-            self.assertEqual(actual.x.encoding['complevel'], 9)
+            assert actual.x.encoding['zlib'] == True


E712 comparison to True should be 'if cond is True:' or 'if cond:'

stickler-ci · 2018-05-21T00:23:13Z

xarray/tests/test_backends.py

-            self.assertEqual(actual.x.encoding['compression'], 'lzf')
-            self.assertEqual(actual.x.encoding['compression_opts'], None)
+            assert actual.x.encoding['compression'] == 'lzf'
+            assert actual.x.encoding['compression_opts'] == None


E711 comparison to None should be 'if cond is None:'

stickler-ci · 2018-05-21T07:59:56Z

xarray/tests/test_backends.py

+        original = Dataset({'x': [u'foo', u'bar', u'baz']})
+        kwargs = dict(encoding={'x': {'dtype': str}})
+        with raises_regex(ValueError, 'encoding dtype=str for vlen'):
+            with self.roundtrip(original, save_kwargs=kwargs) as actual:


F841 local variable 'actual' is assigned to but never used

Fixes GH2149

shoyer · 2018-05-25T00:53:30Z

I think this is ready for another review.

@crusaderky I ended up splitting up your test into a few smaller pieces, so we could more easily handle logic for different backends with subclassing.

crusaderky · 2018-05-25T21:54:57Z

doc/whats-new.rst

@@ -58,6 +62,9 @@ Bug fixes
  bug where non-scalar data-variables that did not include the aggregation
  dimension were improperly skipped.
  By `Stephan Hoyer <https://github.com/shoyer>`_
+- Fixed a regression in 0.10.4, where specifying ``{'dtype': 'S1'}`` in
+  ``encoding`` with ``to_netcdf()`` raised an error.
+  `Stephan Hoyer <https://github.com/shoyer>`_



duplicate entry

crusaderky · 2018-05-25T21:56:34Z

xarray/backends/h5netcdf_.py

+            vlen_dtype = h5py.check_dtype(vlen=var.dtype)
+            if vlen_dtype is not None:
+                if vlen_dtype is not unicode_type:  # pragma: no cover
+                    raise NotImplementedError('unexpected vlen dtype: {!r}'


What's the difference from var.dtype.kind == 'U'?

h5py uses special NumPy dtypes (np.object_ with metadata) for defining variable length types. It doesn't use var.dtype.kind == 'U', because HDF5 doesn't have a unicode type with a memory model matching np.unicode_. (HDF5 implements UTF-8 with a fixed number of bytes for storage, not a fixed number of unicode characters like NumPy.)

crusaderky · 2018-05-25T22:04:24Z

xarray/conventions.py

-    if 'dtype' in var.encoding and var.encoding['dtype'] != 'S1':
+    if ('dtype' in var.encoding and
+            var.encoding['dtype'] != 'S1' and
+            var.encoding['dtype'] is not str):


if var.encoding.get('dtype') not in (None, 'S1', str):
or a bit more verbose but more readable:
if 'dtype' in var.encoding and var.encoding['dtype'] not in ('S1', str):

Yes, that was my first version. Sadly it doesn't work due to numpy/numpy#7242. I added a comment.

dtype in {'S1', str} hashesh dtype and compares the hashes.
dtype in ('S1', str) invokes dtype.__eq__ and is the same as any(dtype == x for x in ('S1', str)).

Indeed, I misread that. Updated.

crusaderky · 2018-05-26T00:34:18Z

xarray/conventions.py

-            warnings.warn("CF decoding is overwriting dtype on variable {!r}"
-                          .format(name))
-    else:
+    if 'dtype' not in encoding:
        encoding['dtype'] = original_dtype


encoding.set_default('dtype', original_dtype)

crusaderky · 2018-05-26T00:34:50Z

xarray/conventions.py

-            warnings.warn("CF decoding is overwriting dtype on variable {!r}"
-                          .format(name))
-    else:
+    if 'dtype' not in encoding:
        encoding['dtype'] = original_dtype

    if 'dtype' in attributes and attributes['dtype'] == 'bool':


if attributes.get('dtype') == 'bool':

crusaderky · 2018-05-26T00:36:11Z

xarray/tests/test_backends.py

@@ -873,6 +886,7 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

 @requires_netCDF4
 class BaseNetCDF4Test(CFEncodedDataTest):
+    """Tests for both netCDF4-python and h5netcdf."""



this comment contradicts the @requires_netCDF4 decorator immediately above

crusaderky · 2018-05-26T00:38:10Z

xarray/tests/test_conventions.py

@@ -274,3 +274,6 @@ def test_invalid_dataarray_names_raise(self):

    def test_encoding_kwarg(self):
        pass
+
+    def test_encoding_kwarg_fixed_width_string(self):
+        pass


I'm not sure I understand the purpose of this...

This TestCFEncodedDataStore test is intended for checking the backend interface without depending on any particular file format. That means it doesn't support string encodings. (Added a comment.)

shoyer

@crusaderky thanks for the review!

shoyer · 2018-05-29T03:16:01Z

doc/whats-new.rst

@@ -58,6 +62,9 @@ Bug fixes
  bug where non-scalar data-variables that did not include the aggregation
  dimension were improperly skipped.
  By `Stephan Hoyer <https://github.com/shoyer>`_
+- Fixed a regression in 0.10.4, where specifying ``{'dtype': 'S1'}`` in
+  ``encoding`` with ``to_netcdf()`` raised an error.
+  `Stephan Hoyer <https://github.com/shoyer>`_



shoyer · 2018-05-29T03:21:19Z

xarray/conventions.py

-    if 'dtype' in var.encoding and var.encoding['dtype'] != 'S1':
+    if ('dtype' in var.encoding and
+            var.encoding['dtype'] != 'S1' and
+            var.encoding['dtype'] is not str):


Yes, that was my first version. Sadly it doesn't work due to numpy/numpy#7242. I added a comment.

shoyer · 2018-05-29T03:21:33Z

xarray/conventions.py

-            warnings.warn("CF decoding is overwriting dtype on variable {!r}"
-                          .format(name))
-    else:
+    if 'dtype' not in encoding:
        encoding['dtype'] = original_dtype


shoyer · 2018-05-29T03:45:24Z

xarray/tests/test_backends.py

@@ -873,6 +886,7 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

 @requires_netCDF4
 class BaseNetCDF4Test(CFEncodedDataTest):
+    """Tests for both netCDF4-python and h5netcdf."""



shoyer · 2018-05-29T03:48:30Z

xarray/tests/test_conventions.py

@@ -274,3 +274,6 @@ def test_invalid_dataarray_names_raise(self):

    def test_encoding_kwarg(self):
        pass
+
+    def test_encoding_kwarg_fixed_width_string(self):
+        pass


This TestCFEncodedDataStore test is intended for checking the backend interface without depending on any particular file format. That means it doesn't support string encodings. (Added a comment.)

shoyer · 2018-05-31T15:43:46Z

I plan to merge this (and issue the 0.10.5 release) end of day today unless anyone has further comments.

stickler-ci reviewed May 18, 2018

View reviewed changes

stickler-ci reviewed May 21, 2018

View reviewed changes

shoyer added 9 commits May 23, 2018 19:54

Fix dtype=S1 encoding in to_netcdf()

f1da165

Fixes GH2149

Add test_encoding_kwarg_compression from crusaderky

c0b5bd1

Fix dtype=S1 in kwargs for bytes, too

48d99fc

Fix lint

de001a5

Move compression encoding kwarg test

46dd6aa

Remvoe no longer relevant chanegs

b85dc35

Fix encoding dtype=str

905de86

More lint

e723941

Fix failed tests

7418924

shoyer force-pushed the encoding-S1-fix branch from 8bbf70c to 7418924 Compare May 24, 2018 04:00

shoyer mentioned this pull request May 25, 2018

v0.10.5 release #2182

Closed

8 tasks

shoyer added the needs review label May 25, 2018

crusaderky requested changes May 26, 2018

View reviewed changes

Review comments

58c3ac6

shoyer commented May 29, 2018

View reviewed changes

shoyer and others added 4 commits May 28, 2018 21:40

oops, we still need to skip that test

a6ad090

Merge branch 'master' into encoding-S1-fix

c45847c

check for presence in a tuple rather than making two comparisons

c6ddbc5

Merge branch 'master' into encoding-S1-fix

6f5b631

shoyer merged commit 4106b94 into pydata:master Jun 1, 2018

shoyer deleted the encoding-S1-fix branch June 1, 2018 01:09

Uh oh!

Fix dtype=S1 encoding in to_netcdf() #2158

Fix dtype=S1 encoding in to_netcdf() #2158

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!