8000 [ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 by edsavage · Pull Request #2588 · elastic/ml-cpp · GitHub
[go: up one dir, main page]

Skip to content

[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 10, 2024

Conversation

edsavage
Copy link
Contributor
@edsavage edsavage commented Oct 25, 2023

Pytorch 2.1.2 has recently been released. Opening this PR to upgrade to it and potentially flush out any BWC issues via the unit and ES integration tests.

Closes #2587

@@ -1,2 +1,2 @@
name,version,revision,url,license,copyright,sourceURL
PyTorch,1.11.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,,
PyTorch,2.1.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,,
Copy link
Contributor
< 8000 h3 class="f5 text-normal" style="flex: 1 1 auto">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Git commit (3rd value) should also be changed here.

```

This assumes that you have cloned PyTorch in the directory `${PYTORCH_SRC_DIR}` in the step above. Make sure that this path is correct.

Building IPEX requires a lot of memory. To reduce this requirement, we can patch the IPEX build system to lower the number of parallel processes. We use `sed` to replace the call to `multiprocessing.cpu_count()` with a constant `1` in the file `setup.py`:
Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes:
Building IPEX requires a lot of memory. To reduce this requirement, we can set the `MAX_JOBS` environment variable to lower the number of parallel processes:


IPEX expects that the `blas-devel` library package be installed:
```bash
yum install blas-devel.x86_64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some due diligence. The library that gets installed here is libblas.so.3. Does IPEX end up requiring that at runtime? If so then that's a problem as we cannot guarantee that it will be present on end user systems.

Or if IPEX doesn't need to link libblas at runtime, why is it needed at build time? Are there environment variables that can be set to say not to use this?


Install it mainly using the default options _except_ on the "Install Options" dialog check "Add CMake to the system PATH for all users".

### zlib

Whilst it is possible to download a pre-built version of `zlib1.dll`, for consistency we want one that links against the Visual Studio 2019 C runtime library. Therefore it is necessary to build zlib from source.

Download the source code from <http://zlib.net/> - the file is called `zlib1212.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.12`.
Download the source code from <http://zlib.net/> - the file is called `zlib1213.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.13`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means 3rd_party/licenses/zlib-INFO.csv needs updating too.

@@ -19,7 +19,7 @@ MAINTAINER David Roberts <dave.roberts@elastic.co>
# libffi is required for building Python
RUN \
rm /var/lib/rpm/__db.* && \
yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel
yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel blas-devel.x86_64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is no longer in alphabetical order.

But also, please investigate my other comment about whether this is really needed, if so why, and is it causing a runtime complication.

@droberts195 droberts195 changed the title Upgrade to Pytorch 2.1.0 Upgrade to Pytorch 2.1.0 and zlib 1.2.13 Oct 25, 2023
@droberts195
Copy link
Contributor

@edsavage edsavage changed the title Upgrade to Pytorch 2.1.0 and zlib 1.2.13 [ML] Upgrade to Pytorch 2.1.0 and zlib 1.2.13 Oct 26, 2023
```bash
yum install blas-devel.x86_64
```
The IPEX third party library dependency `LIBXSMM` link stage has a dependency on `libblas`. This dependency can be removed by setting `BLAS=0` in the environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you confirm with ldd that the BLAS=0 workaround works. Also check if there are any other niche OS library dependencies of either IPEX or PyTorch 2.1.

I am a bit worried about the side effects of setting BLAS=0 for LIBXSMM. It might mean that then certain operations of LIBXSMM simply don't work. We'd only detect this at runtime when a model that uses IPEX needs to do one of these operations. Our PR build tests are nowhere near up to the job of detecting subtle problems like this.

For 8.12 this doesn't actually matter, as we're not using IPEX. However, this is an extra piece of work to do before we ever do use in in the future.

This line in the LIBXSMM Makefile makes me wonder if it's possible to force the BLAS to always be MKL, and then we could set the build up like that: https://github.com/libxsmm/libxsmm/blob/54b8bb32042d204ff6fe550fa70d961f22c0b99e/Makefile#L246

Please can you add a TODO in this file to revisit BLAS=0 before actually using IPEX in production, and also open an issue to investigate the best way to build IPEX that's both portable to all supported Linux distributions and means that IPEX won't have any missing areas of functionality that cause runtime errors.

But don't try and solve this as part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dynamic dependencies of libintel-ext-pt-cpu.so when built with BLAS=0 are:

[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd build/Release/packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so 
	linux-vdso.so.1 =>  (0x00007ffd41ffa000)
	libtorch_cpu.so => /usr/local/gcc103/lib/libtorch_cpu.so (0x00007f47886aa000)
	libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f4796ed8000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f478848e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f478828a000)
	libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f4787e06000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f4787b04000)
	libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f47878c1000)
	libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f47876a8000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f47872da000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4796d75000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f47870d2000)
	libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f47863b2000)
	libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f4784652000)
	libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f4780011000)

similarly for libtorch_cpu.so

[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd /usr/local/gcc103/lib/libtorch_cpu.so 
	linux-vdso.so.1 =>  (0x00007ffe2d7c4000)
	libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f752d277000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f7522507000)
	libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f75222ee000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f75220ea000)
	libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f75213ca000)
	libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f751f66a000)
	libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f751b029000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f751ae0d000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f751ab0b000)
	libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f751a8c8000)
	libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f751a444000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f751a076000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f752d113000)

From my understanding of the libxsmm documentation:

Similarly, an application is free to choose any BLAS or LAPACK library (if the link model available on the OS supports
this), and therefore linking GEMM routines when linking LIBXSMM itself (by supplying BLAS=1|2) may prevent a user
from making this decision at the time of linking the actual application. To use LIBXSMM without GEMM-related
functionality, any BLAS-dependency can be removed in two ways: (1) building a special library with make BLAS
8000
=0, or
(2) linking the application against the libxsmmnoblas library. If an application however uses BLAS already, the [Call
Wrapper](https://libxsmm.readthedocs.io/en/latest/libxsmm_mm/#call-wrapper) can be used to intercept existing
BLAS calls (and to rely on LIBXSMM instead).

building with BLAS=0 is recommended (although quite why it's not the default is odd if that's the case).

As you suggest I think further investigation at a later stage is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those dynamic dependencies you've listed look good.

The BLAS documentation for LIBXSMM isn't 100% clear to me, and then there's the extra complexity on top of when/how IPEX uses LIBXSMM.

Thanks for opening #2589. Nothing more is needed for this PR though.

@edsavage
Copy link
Contributor Author

buildkite build this please and thank you

@edsavage
Copy link
Contributor Author

buildkite build this

@droberts195
Copy link
Contributor

Looks good, but we need to run the QA suite against these changes before merging to main, as main could get promoted into serverless production immediately and we don't want to find out there are problems after this.

I think some manual compatibility testing with the current version of Eland should also be done before merging this. So build an Elasticsearch locally that incorporates a C++ build from this PR branch, then use the current version of Eland (which uses PyTorch 1.13.1) to download a model from HuggingFace, install it, and start it. Then perform at least one inference against that model. (The latest main branch snapshot of Kibana should be able to connect to the locally built Elasticsearch for this testing.)

Finally, with that same local install of Kibana and Elasticsearch, check that you can download and install ELSER v2 and do an inference using that.

@edsavage
Copy link
Contributor Author
edsavage commented Nov 6, 2023

manual compatibility testing

Sanity checks were performed using elastic/distilbert-base-uncased-finetuned-conll03-english for named entity recognition and typeform/distilbert-base-uncased-mnli for zero shot classification. Both of these models were imported from huggingface via the eland_import_hub_model script and used to run inference jobs. No issues were encountered in either of these tasks.

Likewise, no issues were encountered downloading ELSER v2 and testing against it.

@edsavage edsavage marked this pull request as ready for review November 15, 2023 09:45
@droberts195
Copy link
Contributor
droberts195 commented Nov 15, 2023

TODO: Set QAF_TESTS_TO_RUN to the appropriate value in .buildkite/pipelines/run_qa_tests.yml.sh and then trigger the QA framework CI job.

@droberts195
Copy link
Contributor
droberts195 commented Dec 14, 2023

The crashes we've observed with the Intel-optimised variant of ELSER are caused by IPEX. I have opened intel/intel-extension-for-pytorch#484 for this. Depending on the outcome of that we will either have to upgrade to a patched version of IPEX in this PR, or remove it entirely.

Since this has dragged on so long we should also consider rebuilding with PyTorch 2.1.2 when we pick this up again. Apparently that release is imminent: pytorch/pytorch#113962

@droberts195 droberts195 changed the title [ML] Upgrade to Pytorch 2.1.0 and zlib 1.2.13 [ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 Jan 4, 2024
Copy link
Contributor
@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI passes and if you could just adjust the release notes before merging - it's probably worth waiting for the current CI to finish before adjusting the release notes to expedite smoking out any other problems, then merge first thing tomorrow if all is good.

Comment on lines 46 to 47
* Improve forecasting for time series with step changes. (See {ml-pull}#2591[2591],
issue: {ml-issue}2466[#2466]).
Copy link
Contributor
@droberts195 droberts195 Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete these two lines. They're duplicated about 10 lines below in the 8.11.2 section, and 8.11.2 seems to be correct. (I think this might be caused by one of my attempts to resolve a merge clash - sorry about that.)

@edsavage edsavage merged commit 089b4d3 into elastic:main Jan 10, 2024
droberts195 added a commit that referenced this pull request Jan 10, 2024
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Jan 10, 2024
droberts195 pushed a commit that referenced this pull request Jan 13, 2024
@edsavage edsavage deleted the pytorch_2_1_0 branch September 19, 2024 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Upgrade to PyTorch 2.1
2 participants
0