diff --git a/intermediate/xarray_and_dask.ipynb b/intermediate/xarray_and_dask.ipynb index 9020532d..4de7a1ce 100644 --- a/intermediate/xarray_and_dask.ipynb +++ b/intermediate/xarray_and_dask.ipynb @@ -19,7 +19,106 @@ "2. Learn that all xarray built-in operations can transparently use dask\n", "\n", "\n", - "**Important:** *Using Dask does not always make your computations run faster!* Performance will depend on the computational infrastructure you're using (for example, how many CPU cores), how the data you're working with is structured and stored, and the algorithms and code you're running. Be sure to review the [Dask best-practices](https://docs.dask.org/en/stable/best-practices.html) if you're new to Dask!" + "**Important:** *Using Dask does not always make your computations run faster!* \n", + "\n", + "Performance will depend on the computational infrastructure you're using (for example, how many CPU cores), how the data you're working with is structured and stored, and the algorithms and code you're running. Be sure to review the [Dask best-practices](https://docs.dask.org/en/stable/best-practices.html) if you're new to Dask!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is Dask\n", + "\n", + "When we talk about Xarray + Dask, we are *usually* talking about two things:\n", + "1. `dask.array` as a drop-in replacement for numpy arrays\n", + "2. A \"scheduler\" that actually runs computations on dask arrays (commonly [distributed](https://docs.dask.org/en/stable/deploying.html))\n", + "\n", + "## Introduction to dask.array\n", + "\n", + "> Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays (*blocks* or *chunks*). This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import dask\n", + "import dask.array\n", + "\n", + "dasky = dask.array.ones((10, 5), chunks=(2, 2))\n", + "dasky" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Why dask.array\n", + "\n", + "1. Use parallel resources to speed up computation\n", + "2. Work with datasets bigger than RAM (\"out-of-core\")\n", + " > \"dask lets you scale from memory-sized datasets to disk-sized datasets\"\n", + "\n", + "### dask is lazy\n", + "\n", + "Operations are not computed until you explicitly request them. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dasky.mean(axis=-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**first an apology!**\n", + "\n", + "So what did dask do when you called `.mean`? It added that operation to the \"graph\" or a blueprint of operations to execute later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dask.visualize(dasky.mean(axis=-1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dasky.mean(axis=-1).compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More\n", + "\n", + "See the [dask.array tutorial](https://tutorial.dask.org/02_array.html)\n", + "\n", + "\n", + "### Dask + Xarray\n", + "\n", + "Remember that Xarray can wrap many different array types. So Xarray can wrap dask arrays too. \n", + "\n", + "We use Xarray to enable using our metadata to express our analysis." ] }, { @@ -28,7 +127,7 @@ "source": [ "\n", "\n", - "## Reading data\n", + "## Creating dask-backed Xarray objects\n", "\n", "The `chunks` argument to both `open_dataset` and `open_mfdataset` allow you to\n", "read datasets as dask arrays. \n" @@ -43,6 +142,16 @@ "import xarray as xr" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.tutorial.open_dataset(\"air_temperature\")\n", + "ds.air" + ] + }, { "cell_type": "code", "execution_count": null, @@ -52,12 +161,12 @@ "ds = xr.tutorial.open_dataset(\n", " \"air_temperature\",\n", " chunks={ # this tells xarray to open the dataset as a dask array\n", - " \"lat\": 25,\n", + " \"lat\": \"auto\",\n", " \"lon\": 25,\n", " \"time\": -1,\n", " },\n", ")\n", - "ds.air" + "ds" ] }, { @@ -76,6 +185,15 @@ "ds.air.chunks" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -87,10 +205,73 @@ { "cell_type": "markdown", "metadata": {}, + "source": [ + "## Extracting underlying data\n", + "\n", + "There are two ways to pull out the underlying array object in an xarray object.\n", + "\n", + "1. `.to_numpy` or `.values` will always return a NumPy array. For dask-backed xarray objects,\n", + " this means that compute will always be called\n", + "2. `.data` will return a Dask array\n", + "\n", + "**tip**: Use `to_numpy` or `as_numpy` instead of `.values` so that your code generalizes to other array types (like CuPy arrays, sparse arrays)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air.data # dask array, not numpy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.as_numpy().data ## numpy array" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Try calling `mean.values` and `mean.data`. Do you understand the difference?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "ds.air.to_numpy()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, "source": [ "\n", "\n", - "## lazy computation \n", + "## Lazy computation \n", "\n", "Xarray seamlessly wraps dask so all computation is deferred until explicitly\n", "requested." @@ -102,7 +283,8 @@ "metadata": {}, "outputs": [], "source": [ - "mean = ds.air.mean(\"time\")" + "mean = ds.air.mean(\"time\")\n", + "mean" ] }, { @@ -118,24 +300,135 @@ "metadata": {}, "outputs": [], "source": [ - "mean.data.visualize(optimize_graph=True)" + "mean.data # dask array" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "### Getting concrete values from dask arrays\n", + "# visualize the graph for the underlying dask array\n", + "# we ask it to visualize the graph from left to right because it looks nicer\n", + "dask.visualize(mean.data, rankdir=\"LR\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [], + "toc-hr-collapsed": true + }, + "source": [ + "## Getting concrete values\n", "\n", "At some point, you will want to actually get concrete values (_usually_ a numpy array) from dask.\n", "\n", "There are two ways to compute values on dask arrays.\n", "\n", - "1. `.compute()` returns an xarray object\n", + "1. `.compute()` returns an xarray object *just like a dask array*\n", "2. `.load()` replaces the dask array in the xarray object with a numpy array.\n", - " This is equivalent to `ds = ds.compute()`" + " This is equivalent to `ds = ds.compute()`\n", + " \n", + "**Tip:** There is a third option : \"persisting\". `.persist()` loads the values into distributed RAM. The values are computed but remain distributed across workers. So `ds.air.persist()` still returns a dask array. This is useful if you will be repeatedly using a dataset for computation but it is too large to load into local memory. You will see a persistent task on the dashboard. See the [dask user guide](https://docs.dask.org/en/latest/api.html#dask.persist) for more on persisting\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Try running `mean.compute` and then examine `mean` after that. Is it still a dask array?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "outputs_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "mean" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-output", + "hide-input" + ] + }, + "outputs": [], + "source": [ + "mean.compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-input", + "hide-output" + ] + }, + "outputs": [], + "source": [ + "mean" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise\n", + "\n", + "Now repeat that exercise with `mean.load`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-input", + "hide-output" + ] + }, + "outputs": [], + "source": [ + "mean.load()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "hide-input", + "hide-output" + ] + }, + "outputs": [], + "source": [ + "mean" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, @@ -168,8 +461,8 @@ "import dask\n", "import os\n", "\n", - "if os.environ.get('JUPYTERHUB_USER'):\n", - " dask.config.set(**{\"distributed.dashboard.link\": \"/user/{JUPYTERHUB_USER}/proxy/{port}/status\"})\n", + "# if os.environ.get('JUPYTERHUB_USER'):\n", + "# dask.config.set(**{\"distributed.dashboard.link\": \"/user/{JUPYTERHUB_USER}/proxy/{port}/status\"})\n", "\n", "client = Client(local_directory='/tmp')\n", "client" @@ -189,7 +482,12 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "jupyter": { + "outputs_hidden": true + }, + "tags": [] + }, "outputs": [], "source": [ "import dask.array\n", @@ -201,7 +499,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Examining a DataArray with dask" + "## Computation" ] }, { @@ -233,17 +531,22 @@ "metadata": {}, "outputs": [], "source": [ - "timeseries = ds.air.rolling(time=5).mean().isel(lon=1, lat=20) # no activity on dashboard\n", - "timeseries # contains dask array" + "rolling_mean = ds.air.rolling(time=5).mean() # no activity on dashboard\n", + "rolling_mean # contains dask array" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "jupyter": { + "outputs_hidden": true + }, + "tags": [] + }, "outputs": [], "source": [ - "timeseries = ds.air.rolling(time=5).mean() # no activity on dashboard\n", + "timeseries = rolling_mean.isel(lon=1, lat=20) # no activity on dashboard\n", "timeseries # contains dask array" ] }, @@ -253,7 +556,7 @@ "metadata": {}, "outputs": [], "source": [ - "computed = mean.compute() # activity on dashboard\n", + "computed = rolling_mean.compute() # activity on dashboard\n", "computed # has real numpy values" ] }, @@ -270,85 +573,41 @@ "metadata": {}, "outputs": [], "source": [ - "mean" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "But if we call `.load()`, `mean` will now contain a numpy array" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean.load()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's check that again...\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean" + "rolling_mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Tip:** `.persist()` loads the values into distributed RAM. This is useful if\n", - "you will be repeatedly using a dataset for computation but it is too large to\n", - "load into local memory. You will see a persistent task on the dashboard.\n", - "\n", - "See https://docs.dask.org/en/latest/api.html#dask.persist for more\n" + "**tip** While these operations all work, not all of them are necessarily the optimal implementation for parallelism. Usually analysis pipelines need some tinkering and tweaking to get things to work. In particular read the user guidie recommendations for [chunking](https://docs.xarray.dev/en/stable/user-guide/dask.html#chunking-and-performance) and [performance](https://docs.xarray.dev/en/stable/user-guide/dask.html#optimization-tips)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Extracting underlying data: `.values` vs `.data`\n", - "\n", - "There are two ways to pull out the underlying data in an xarray object.\n", + "## Xarray data structures are first-class dask collections.\n", "\n", - "1. `.values` will always return a NumPy array. For dask-backed xarray objects,\n", - " this means that compute will always be called\n", - "2. `.data` will return a Dask array" + "This means you can do things like `dask.compute(xarray_object)`,\n", + "`dask.visualize(xarray_object)`, `dask.persist(xarray_object)`. This works for\n", + "both DataArrays and Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Xarray data structures are first-class dask collections.\n", - "\n", - "This means you can do things like `dask.compute(xarray_object)`,\n", - "`dask.visualize(xarray_object)`, `dask.persist(xarray_object)`. This works for\n", - "both DataArrays and Datasets\n", - "\n", "### Exercise\n", "\n", - "Visualize the task graph for a few different computations on ds.air!" + "Visualize the task graph for a few different computations on `ds.air`! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Finish up\n", "Gracefully shutdown our connection to the Dask cluster. This becomes more important when you are running on large HPC or Cloud servers rather than a laptop!" ] }, @@ -360,6 +619,16 @@ "source": [ "client.close()" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next\n", + "\n", + "\n", + "See the [Xarray user guide on dask](https://docs.xarray.dev/en/stable/user-guide/dask.html). " + ] } ], "metadata": { diff --git a/overview/xarray-in-45-min.ipynb b/overview/xarray-in-45-min.ipynb index 8054916c..1c0eec1c 100644 --- a/overview/xarray-in-45-min.ipynb +++ b/overview/xarray-in-45-min.ipynb @@ -83,11 +83,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, "outputs": [], "source": [ "# pull out \"air\" dataarray with dictionary syntax\n", @@ -106,11 +102,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "remove-output" - ] - }, + "metadata": {}, "outputs": [], "source": [ "# pull out dataarray using dot notation\n", @@ -229,6 +221,15 @@ "ds.air.attrs" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.air" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -245,11 +246,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "output_scroll" - ] - }, + "metadata": {}, "outputs": [], "source": [ "ds.air.data" @@ -300,7 +297,15 @@ "# plot the first timestep\n", "lat = ds.air.lat.data # numpy array\n", "lon = ds.air.lon.data # numpy array\n", - "temp = ds.air.data # numpy array\n", + "temp = ds.air.data # numpy array" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "plt.figure()\n", "plt.pcolormesh(lon, lat, temp[0, :, :]);" ] @@ -329,7 +334,7 @@ "metadata": {}, "outputs": [], "source": [ - "ds.air.isel(time=1).plot(x=\"lon\");" + "ds.air.isel(time=0).plot(x=\"lon\");" ] }, { @@ -345,7 +350,7 @@ "metadata": {}, "outputs": [], "source": [ - "ds.air.mean(\"time\")" + "ds.air.mean(dim=\"time\").plot(x=\"lon\")" ] }, { @@ -377,11 +382,17 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, + "outputs": [], + "source": [ + "# here's what ds looks like\n", + "ds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "# pull out data for all of 2013-May\n", @@ -391,11 +402,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, "outputs": [], "source": [ "# demonstrate slicing\n", @@ -405,11 +412,16 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, + "outputs": [], + "source": [ + "ds.sel(time=\"2013\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "# demonstrate \"nearest\" indexing\n", @@ -419,11 +431,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, "outputs": [], "source": [ "# \"nearest indexing at multiple points\"\n", @@ -443,26 +451,26 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "hide-output" - ] - }, + "metadata": {}, "outputs": [], "source": [ - "# pull out time index 0 and lat index 0\n", - "ds.air.isel(time=0, lat=0) # much better than ds.air[0, 0, :]" + "ds.air.data[0, 2, 3]" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "tags": [ - "output_scroll", - "hide-output" - ] - }, + "metadata": {}, + "outputs": [], + "source": [ + "# pull out time index 0, lat index 2, and lon index 3\n", + "ds.air.isel(time=0, lat=2, lon=3) # much better than ds.air[0, 2, 3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, "outputs": [], "source": [ "# demonstrate slicing\n", @@ -513,6 +521,16 @@ "2. We used [ones_like](https://docs.xarray.dev/en/stable/generated/xarray.ones_like.html) to create a DataArray that looks like `ds.air.lon` in all respects, except that the data are all ones" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# returns an xarray DataArray!\n", + "np.cos(np.deg2rad(ds.lat))" + ] + }, { "cell_type": "code", "execution_count": null, @@ -627,7 +645,29 @@ "means that your xarray coordinates were not aligned _exactly_.\n", "\n", "For more, see\n", - "[the Xarray documentation](https://docs.xarray.dev/en/stable/user-guide/computation.html#automatic-alignment). [This tutorial notebook](https://tutorial.xarray.dev/fundamentals/02.3_aligning_data_objects.html) also covers alignment and broadcasting.\n" + "[the Xarray documentation](https://docs.xarray.dev/en/stable/user-guide/computation.html#automatic-alignment). [This tutorial notebook](https://tutorial.xarray.dev/fundamentals/02.3_aligning_data_objects.html) also covers alignment and broadcasting (*highly recommended*)\n", + "\n", + "To make sure variables are aligned as you think they are, do the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "raises-exception" + ] + }, + "outputs": [], + "source": [ + "xr.align(cell_area_bad, ds.air, join=\"exact\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above statement raises an error since the two are not aligned." ] }, { @@ -665,6 +705,16 @@ "### groupby\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# here's ds\n", + "ds" + ] + }, { "cell_type": "code", "execution_count": null, @@ -704,6 +754,15 @@ "seasonal_mean" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "seasonal_mean.air.plot(col=\"season\")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -735,7 +794,7 @@ "outputs": [], "source": [ "# weight by cell_area and take mean over (time, lon)\n", - "ds.weighted(cell_area).mean([\"lon\", \"time\"]).air.plot();" + "ds.weighted(cell_area).mean([\"lon\", \"time\"]).air.plot(y=\"lat\");" ] }, { @@ -761,7 +820,7 @@ "outputs": [], "source": [ "# facet the seasonal_mean\n", - "seasonal_mean.air.plot(col=\"season\");" + "seasonal_mean.air.plot(col=\"season\", col_wrap=2);" ] }, { @@ -784,15 +843,6 @@ "seasonal_mean.air.mean(\"lon\").plot.line(hue=\"season\", y=\"lat\");" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1132,15 +1182,7 @@ ], "metadata": { "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3" + "name": "" }, "toc": { "base_numbering": 1,