8000 Permission error with exists on a bucket (OSError storage.buckets.get scope needed) · Issue #462 · fsspec/gcsfs · GitHub
[go: up one dir, main page]

Skip to content
Permission error with exists on a bucket (OSError storage.buckets.get scope needed) #462
@rabernat

Description

@rabernat

A user reported that they can no longer load datasets from the legacy Pangeo intake catalog (pangeo-data/pangeo-datastore#132). I played around with it and boiled it down to the following minimal reproducer. The bucket in question is requester pays, so you can either run this from Pangeo Cloud or set up your own gcs credentials.

import gcsfs
fs = gcsfs.GCSFileSystem(requester_pays=True)
fs.exists("gs://pangeo-cmems-duacs")

This raises

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 import gcsfs
      2 fs = gcsfs.GCSFileSystem(requester_pays=True)
----> 3 fs.exists("gs://pangeo-cmems-duacs")

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:85, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
     82 @functools.wraps(func)
     83 def wrapper(*args, **kwargs):
     84     self = obj or args[0]
---> 85     return sync(self.loop, func, *args, **kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:65, in sync(loop, func, timeout, *args, **kwargs)
     63     raise FSTimeoutError from return_result
     64 elif isinstance(return_result, BaseException):
---> 65     raise return_result
     66 else:
     67     return return_result

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:25, in _runner(event, coro, result, timeout)
     23     coro = asyncio.wait_for(coro, timeout=timeout)
     24 try:
---> 25     result[0] = await coro
     26 except Exception as ex:
     27     result[0] = ex

File /srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py:547, in AsyncFileSystem._exists(self, path)
    545 async def _exists(self, path):
    546     try:
--> 547         await self._info(path)
    548         return True
    549     except FileNotFoundError:

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:654, in GCSFileSystem._info(self, path, **kwargs)
    652 path = self._strip_protocol(path).rstrip("/")
    653 if "/" not in path:
--> 654     out = await self._call("GET", f"b/{path}", json_out=True)
    655     out.update(size=0, type="directory")
    656     return out

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:386, in GCSFileSystem._call(self, method, path, json_out, info_out, *args, **kwargs)
    381 async def _call(
    382     self, method, path, *args, json_out=False, info_out=False, **kwargs
    383 ):
    384     logger.debug(f"{method.upper()}: {path}, {args}, {kwargs.get('headers')}")
--> 386     status, headers, info, contents = await self._request(
    387         method, path, *args, **kwargs
    388     )
    389     if json_out:
    390         return json.loads(contents)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/decorator.py:221, in decorate.<locals>.fun(*args, **kw)
    219 if not kwsyntax:
    220     args, kw = fix(args, kw, sig)
--> 221 return await caller(func, *(extras + args), **kw)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/retry.py:115, in retry_request(func, retries, *args, **kwargs)
    113     if retry > 0:
    114         await asyncio.sleep(min(random.random() + 2 ** (retry - 1), 32))
--> 115     return await func(*args, **kwargs)
    116 except (
    117     HttpError,
    118     requests.exceptions.RequestException,
   (...)
    121     aiohttp.client_exceptions.ClientError,
    122 ) as e:
    123     if (
    124         isinstance(e, HttpError)
    125         and e.code == 400
    126         and "requester pays" in e.message
    127     ):

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/core.py:378, in GCSFileSystem._request(self, method, path, headers, json, data, *args, **kwargs)
    375 info = r.request_info  # for debug only
    376 contents = await r.read()
--> 378 validate_response(status, contents, path, args)
    379 return status, headers, info, contents

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gcsfs/retry.py:96, in validate_response(status, content, path, args)
     93     msg = content
     95 if status == 403:
---> 96     raise IOError("Forbidden: %s\n%s" % (path, msg))
     97 elif status == 502:
     98     raise requests.exceptions.ProxyError()

OSError: Forbidden: b/pangeo-cmems-duacs
prod-user-sa@pangeo-integration-te-3eea.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.

What is particular about this is the fact that there is a zarr dataset in the root path of the bucket. We can list individual objects just fine, e.g.

fs.exists("gs://pangeo-cmems-duacs/.zmetadata")

However, we cannot call exists on the root of the bucket. Therefore, we cannot call

import xarray
xr.open_dataset("gs://pangeo-cmems-duacs", engine="zarr")

because that function calls exists.

However, this does work

ds = xr.open_zarr(fs.get_mapper("gs://pangeo-cmems-duacs"))

This is related to zarr-developers/zarr-python#911, which addresses the multiple different ways we can create zarr stores from fsspec filesystems.


I know I say this a lot, but I am certain that this used to work at some point in the past! Perhaps what changed is the use of FSStore in Zarr?

In any case, I feel that I should be able to call fs.exsists('gs://pangeo-cmems-duacs') on a bucket I have read access to without the storage.buckets.get auth scope. I think that is a gcsfs bug that should be fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0