-
Notifications
You must be signed in to change notification settings - Fork 155
Description
A user reported that they can no longer load datasets from the legacy Pangeo intake catalog (pangeo-data/pangeo-datastore#132). I played around with it and boiled it down to the following minimal reproducer. The bucket in question is requester pays, so you can either run this from Pangeo Cloud or set up your own gcs credentials.
import gcsfs
fs = gcsfs.GCSFileSystem(requester_pays=True)
fs.exists("gs://pangeo-cmems-duacs")
This raises
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Input In [1], in <cell line: 3>()
1 import gcsfs
2 fs = gcsfs.GCSFileSystem(requester_pays=True)
----> 3 fs.exists("gs://pangeo-cmems-duacs")
File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:85, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
82 @functools.wraps(func)
83 def wrapper(*args, **kwargs):
84 self = obj or args[0]
---> 85 return sync(self.loop, func, *args, **kwargs)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:65, in sync(loop, func, timeout, *args, **kwargs)
63 raise FSTimeoutError from return_result
64 elif isinstance(return_result, BaseException):
---> 65 raise return_result
66 else:
67 return return_result
File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:25, in _runner(event, coro, result, timeout)
23 coro = asyncio.wait_for(coro, timeout=timeout)
24 try:
---> 25 result[0] = await coro
26 except Exception as ex:
27 result[0] = ex
File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:547, in AsyncFileSystem._exists(self, path)
545 async def _exists(self, path):
546 try:
--> 547 await self._info(path)
548 return True
549 except FileNotFoundError:
File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:654, in GCSFileSystem._info(self, path, **kwargs)
652 path = self._strip_protocol(path).rstrip("/")
653 if "/" not in path:
--> 654 out = await self._call("GET", f"b/{path}", json_out=True)
655 out.update(size=0, type="directory")
656 return out
File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:386, in GCSFileSystem._call(self, method, path, json_out, info_out, *args, **kwargs)
381 async def _call(
382 self, method, path, *args, json_out=False, info_out=False, **kwargs
383 ):
384 logger.debug(f"{method.upper()}: {path}, {args}, {kwargs.get('headers')}")
--> 386 status, headers, info, contents = await self._request(
387 method, path, *args, **kwargs
388 )
389 if json_out:
390 return json.loads(contents)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/decorator.py:221, in decorate.<locals>.fun(*args, **kw)
219 if not kwsyntax:
220 args, kw = fix(args, kw, sig)
--> 221 return await caller(func, *(extras + args), **kw)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/retry.py:115, in retry_request(func, retries, *args, **kwargs)
113 if retry > 0:
114 await asyncio.sleep(min(random.random() + 2 ** (retry - 1), 32))
--> 115 return await func(*args, **kwargs)
116 except (
117 HttpError,
118 requests.exceptions.RequestException,
(...)
121 aiohttp.client_exceptions.ClientError,
122 ) as e:
123 if (
124 isinstance(e, HttpError)
125 and e.code == 400
126 and "requester pays" in e.message
127 ):
File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:378, in GCSFileSystem._request(self, method, path, headers, json, data, *args, **kwargs)
375 info = r.request_info # for debug only
376 contents = await r.read()
--> 378 validate_response(status, contents, path, args)
379 return status, headers, info, contents
File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/retry.py:96, in validate_response(status, content, path, args)
93 msg = content
95 if status == 403:
---> 96 raise IOError("Forbidden: %s\n%s" % (path, msg))
97 elif status == 502:
98 raise requests.exceptions.ProxyError()
OSError: Forbidden: b/pangeo-cmems-duacs
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.
What is particular about this is the fact that there is a zarr dataset in the root path of the bucket. We can list individual objects just fine, e.g.
fs.exists("gs://pangeo-cmems-duacs/.zmetadata")
However, we cannot call exists
on the root of the bucket. Therefore, we cannot call
import xarray
xr.open_dataset("gs://pangeo-cmems-duacs", engine="zarr")
because that function calls exists
.
However, this does work
ds = xr.open_zarr(fs.get_mapper("gs://pangeo-cmems-duacs"))
This is related to zarr-developers/zarr-python#911, which addresses the multiple different ways we can create zarr stores from fsspec filesystems.
I know I say this a lot, but I am certain that this used to work at some point in the past! Perhaps what changed is the use of FSStore in Zarr?
In any case, I feel that I should be able to call fs.exsists('gs://pangeo-cmems-duacs')
on a bucket I have read access to without the storage.buckets.get
auth scope. I think that is a gcsfs bug that should be fixed.