[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

edsavage · 2023-10-25T09:23:11Z

Pytorch 2.1.2 has recently been released. Opening this PR to upgrade to it and potentially flush out any BWC issues via the unit and ES integration tests.

Closes #2587

Upgrade to the latest version of PyTorch, which is 2.1.0

droberts195 · 2023-10-25T11:38:35Z

3rd_party/licenses/pytorch-INFO.csv

@@ -1,2 +1,2 @@
 name,version,revision,url,license,copyright,sourceURL
-PyTorch,1.11.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,,
+PyTorch,2.1.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,,


The Git commit (3rd value) should also be changed here.

droberts195 · 2023-10-25T11:40:28Z

build-setup/linux.md

 ```

 This assumes that you have cloned PyTorch in the directory `${PYTORCH_SRC_DIR}` in the step above. Make sure that this path is correct.

-Building IPEX requires a lot of memory. To reduce this requirement, we can patch the IPEX build system to lower the number of parallel processes. We use `sed` to replace the call to `multiprocessing.cpu_count()` with a constant `1` in the file `setup.py`:
+Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes:


Suggested change

Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes:

Building IPEX requires a lot of memory. To reduce this requirement, we can set the `MAX_JOBS` environment variable to lower the number of parallel processes:

droberts195 · 2023-10-25T11:44:45Z

build-setup/linux.md

+
+IPEX expects that the `blas-devel` library package be installed:
+```bash
+yum install blas-devel.x86_64


This needs some due diligence. The library that gets installed here is libblas.so.3. Does IPEX end up requiring that at runtime? If so then that's a problem as we cannot guarantee that it will be present on end user systems.

Or if IPEX doesn't need to link libblas at runtime, why is it needed at build time? Are there environment variables that can be set to say not to use this?

droberts195 · 2023-10-25T11:47:17Z

build-setup/windows.md


 Install it mainly using the default options _except_ on the "Install Options" dialog check "Add CMake to the system PATH for all users".

 ### zlib

 Whilst it is possible to download a pre-built version of `zlib1.dll`, for consistency we want one that links against the Visual Studio 2019 C runtime library. Therefore it is necessary to build zlib from source.

-Download the source code from <http://zlib.net/> - the file is called `zlib1212.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.12`.
+Download the source code from <http://zlib.net/> - the file is called `zlib1213.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.13`.


This means 3rd_party/licenses/zlib-INFO.csv needs updating too.

droberts195 · 2023-10-25T11:49:36Z

dev-tools/docker/linux_image/Dockerfile

@@ -19,7 +19,7 @@ MAINTAINER David Roberts <dave.roberts@elastic.co>
 # libffi is required for building Python
 RUN \
  rm /var/lib/rpm/__db.* && \
-  yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel
+  yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel blas-devel.x86_64


This list is no longer in alphabetical order.

But also, please investigate my other comment about whether this is really needed, if so why, and is it causing a runtime complication.

droberts195 · 2023-10-25T17:40:59Z

We usually release-note major library upgrades like this, so please add a release note like https://github.com/elastic/ml-cpp/pull/2430/files#diff-6eabb97e98640429cb68b61823f99438d287d604d72ecf0e93ed2e9ecbec7479R35 and https://github.com/elastic/ml-cpp/pull/2253/files#diff-6eabb97e98640429cb68b61823f99438d287d604d72ecf0e93ed2e9ecbec7479R37.

droberts195 · 2023-10-26T09:35:28Z

build-setup/linux.md

-```bash
-yum install blas-devel.x86_64
-```
+The IPEX third party library dependency `LIBXSMM` link stage has a dependency on `libblas`. This dependency can be removed by setting `BLAS=0` in the environment.


Please can you confirm with ldd that the BLAS=0 workaround works. Also check if there are any other niche OS library dependencies of either IPEX or PyTorch 2.1.

I am a bit worried about the side effects of setting BLAS=0 for LIBXSMM. It might mean that then certain operations of LIBXSMM simply don't work. We'd only detect this at runtime when a model that uses IPEX needs to do one of these operations. Our PR build tests are nowhere near up to the job of detecting subtle problems like this.

For 8.12 this doesn't actually matter, as we're not using IPEX. However, this is an extra piece of work to do before we ever do use in in the future.

This line in the LIBXSMM Makefile makes me wonder if it's possible to force the BLAS to always be MKL, and then we could set the build up like that: https://github.com/libxsmm/libxsmm/blob/54b8bb32042d204ff6fe550fa70d961f22c0b99e/Makefile#L246

Please can you add a TODO in this file to revisit BLAS=0 before actually using IPEX in production, and also open an issue to investigate the best way to build IPEX that's both portable to all supported Linux distributions and means that IPEX won't have any missing areas of functionality that cause runtime errors.

But don't try and solve this as part of this PR.

The dynamic dependencies of libintel-ext-pt-cpu.so when built with BLAS=0 are:

[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd build/Release/packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so linux-vdso.so.1 => (0x00007ffd41ffa000) libtorch_cpu.so => /usr/local/gcc103/lib/libtorch_cpu.so (0x00007f47886aa000) libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f4796ed8000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f478848e000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f478828a000) libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f4787e06000) libm.so.6 => /lib64/libm.so.6 (0x00007f4787b04000) libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f47878c1000) libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f47876a8000) libc.so.6 => /lib64/libc.so.6 (0x00007f47872da000) /lib64/ld-linux-x86-64.so.2 (0x00007f4796d75000) librt.so.1 => /lib64/librt.so.1 (0x00007f47870d2000) libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f47863b2000) libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f4784652000) libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f4780011000)

similarly for libtorch_cpu.so

[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd /usr/local/gcc103/lib/libtorch_cpu.so linux-vdso.so.1 => (0x00007ffe2d7c4000) libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f752d277000) librt.so.1 => /lib64/librt.so.1 (0x00007f7522507000) libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f75222ee000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f75220ea000) libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f75213ca000) libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f751f66a000) libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f751b029000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f751ae0d000) libm.so.6 => /lib64/libm.so.6 (0x00007f751ab0b000) libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f751a8c8000) libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f751a444000) libc.so.6 => /lib64/libc.so.6 (0x00007f751a076000) /lib64/ld-linux-x86-64.so.2 (0x00007f752d113000)

From my understanding of the libxsmm documentation:

Similarly, an application is free to choose any BLAS or LAPACK library (if the link model available on the OS supports this), and therefore linking GEMM routines when linking LIBXSMM itself (by supplying BLAS=1|2) may prevent a user from making this decision at the time of linking the actual application. To use LIBXSMM without GEMM-related functionality, any BLAS-dependency can be removed in two ways: (1) building a special library with make BLAS 8000 =0, or (2) linking the application against the libxsmmnoblas library. If an application however uses BLAS already, the [Call Wrapper](https://libxsmm.readthedocs.io/en/latest/libxsmm_mm/#call-wrapper) can be used to intercept existing BLAS calls (and to rely on LIBXSMM instead).

building with BLAS=0 is recommended (although quite why it's not the default is odd if that's the case).

As you suggest I think further investigation at a later stage is needed.

Those dynamic dependencies you've listed look good.

The BLAS documentation for LIBXSMM isn't 100% clear to me, and then there's the extra complexity on top of when/how IPEX uses LIBXSMM.

Thanks for opening #2589. Nothing more is needed for this PR though.

edsavage · 2023-10-26T14:40:42Z

buildkite build this please and thank you

edsavage · 2023-10-26T14:44:14Z

buildkite build this

droberts195 · 2023-10-27T15:52:37Z

Looks good, but we need to run the QA suite against these changes before merging to main, as main could get promoted into serverless production immediately and we don't want to find out there are problems after this.

I think some manual compatibility testing with the current version of Eland should also be done before merging this. So build an Elasticsearch locally that incorporates a C++ build from this PR branch, then use the current version of Eland (which uses PyTorch 1.13.1) to download a model from HuggingFace, install it, and start it. Then perform at least one inference against that model. (The latest main branch snapshot of Kibana should be able to connect to the locally built Elasticsearch for this testing.)

Finally, with that same local install of Kibana and Elasticsearch, check that you can download and install ELSER v2 and do an inference using that.

edsavage · 2023-11-06T14:25:48Z

manual compatibility testing

Sanity checks were performed using elastic/distilbert-base-uncased-finetuned-conll03-english for named entity recognition and typeform/distilbert-base-uncased-mnli for zero shot classification. Both of these models were imported from huggingface via the eland_import_hub_model script and used to run inference jobs. No issues were encountered in either of these tasks.

Likewise, no issues were encountered downloading ELSER v2 and testing against it.

droberts195 · 2023-11-15T10:03:21Z

TODO: Set QAF_TESTS_TO_RUN to the appropriate value in .buildkite/pipelines/run_qa_tests.yml.sh and then trigger the QA framework CI job.

droberts195 · 2023-12-14T13:09:04Z

The crashes we've observed with the Intel-optimised variant of ELSER are caused by IPEX. I have opened intel/intel-extension-for-pytorch#484 for this. Depending on the outcome of that we will either have to upgrade to a patched version of IPEX in this PR, or remove it entirely.

Since this has dragged on so long we should also consider rebuilding with PyTorch 2.1.2 when we pick this up again. Apparently that release is imminent: pytorch/pytorch#113962

droberts195

LGTM if CI passes and if you could just adjust the release notes before merging - it's probably worth waiting for the current CI to finish before adjusting the release notes to expedite smoking out any other problems, then merge first thing tomorrow if all is good.

droberts195 · 2024-01-09T16:15:27Z

docs/CHANGELOG.asciidoc

+* Improve forecasting for time series with step changes. (See {ml-pull}#2591[2591],
+  issue: {ml-issue}2466[#2466]).


Please delete these two lines. They're duplicated about 10 lines below in the 8.11.2 section, and 8.11.2 seems to be correct. (I think this might be caused by one of my attempts to resolve a merge clash - sorry about that.)

This reverts commit 089b4d3.

…c#2588)"" This reverts commit 435d3f9.

…"" (#2612) This reverts commit 435d3f9.

edsavage added 5 commits October 19, 2023 14:29

[ML] Upgrade to pytorch 2.1.0

2c9ae95

Upgrade to the latest version of PyTorch, which is 2.1.0

Update build instructions for linux and windows

6f7ca5c

Update build instructions for linux and macos

cd28b30

Update linux build instructions

7c301f5

Further updates to linux build instructions

511a9cb

edsavage added >non-issue :ml v8.12.0 ci:build-linux ci:build-windows labels Oct 25, 2023

Typo :-/

bc3cfdc

droberts195 reviewed Oct 25, 2023

View reviewed changes

Address review comments

f14bf2d

droberts195 changed the title ~~Upgrade to Pytorch 2.1.0~~ Upgrade to Pytorch 2.1.0 and zlib 1.2.13 Oct 25, 2023

droberts195 added >enhancement and removed >non-issue labels Oct 25, 2023

edsavage added 2 commits October 26, 2023 10:08

Update build instruction for IPEX library.

9060d80

Update changelog

3644759

edsavage changed the title ~~Upgrade to Pytorch 2.1.0 and zlib 1.2.13~~ [ML] Upgrade to Pytorch 2.1.0 and zlib 1.2.13 Oct 26, 2023

droberts195 reviewed Oct 26, 2023

View reviewed changes

Add a TODO note to revisit the decision to build IPEX with BLAS=0.

26585bd

edsavage mentioned this pull request Oct 26, 2023

[ML] Investigate the best way to build IPEX #2589

Closed

edsavage removed ci:build-linux ci:build-windows labels Oct 26, 2023

Use latest docker image for linux x86

82a0854

edsavage marked this pull request as ready for review November 15, 2023 09:45

Merge branch 'main' of github.com:elastic/ml-cpp into pytorch_2_1_0

b496d5b

droberts195 added v8.13.0 and removed v8.12.0 labels Dec 5, 2023

Merge branch 'main' of github.com:elastic/ml-cpp into pytorch_2_1_0

b25d738

droberts195 changed the title ~~[ML] Upgrade to Pytorch 2.1.0 and zlib 1.2.13~~ [ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 Jan 4, 2024

droberts195 and others added 3 commits January 4, 2024 10:24

Switch to PyTorch 2.1.2

9a519ea

Replace lines removed in error

b00d58e

Merge branch 'main' of github.com:elastic/ml-cpp into pytorch_2_1_0

f3ceb27

droberts195 approved these changes Jan 9, 2024

View reviewed changes

Remove duplicate change logs

9cb30d8

edsavage merged commit 089b4d3 into elastic:main Jan 10, 2024

droberts195 added a commit that referenced this pull request Jan 10, 2024

Revert "[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 (#2588)"

435d3f9

This reverts commit 089b4d3.

edsavage added a commit to edsavage/ml-cpp that referenced this pull request Jan 10, 2024

Revert "Revert "[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 (elasti…

dc68361

…c#2588)"" This reverts commit 435d3f9.

droberts195 pushed a commit that referenced this pull request Jan 13, 2024

Revert "Revert "[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 (#2588)…

e8a44b7

…"" (#2612) This reverts commit 435d3f9.

davidkyle mentioned this pull request Feb 20, 2024

Upgrade PyTorch to version 2.1.2 and elastic/eland#669

Closed

davidkyle mentioned this pull request Jun 21, 2024

pytorch_inference crash on Intel x86 mac elastic/elasticsearch#109778

Open

edsavage deleted the pytorch_2_1_0 branch September 19, 2024 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes:
	Building IPEX requires a lot of memory. To reduce this requirement, we can set the `MAX_JOBS` environment variable to lower the number of parallel processes:

		* Improve forecasting for time series with step changes. (See {ml-pull}#2591[2591],
		issue: {ml-issue}2466[#2466]).

[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!