-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] Upgrade to Pytorch 2.1.2 and zlib 1.2.13 #2588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Upgrade to the latest version of PyTorch, which is 2.1.0
3rd_party/licenses/pytorch-INFO.csv
Outdated
@@ -1,2 +1,2 @@ | |||
name,version,revision,url,license,copyright,sourceURL | |||
PyTorch,1.11.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,, | |||
PyTorch,2.1.0,bc2c6edaf163b1a1330e37a6e34caf8c553e4755,https://pytorch.org,BSD-3-Clause,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Git commit (3rd value) should also be changed here.
build-setup/linux.md
Outdated
``` | ||
|
||
This assumes that you have cloned PyTorch in the directory `${PYTORCH_SRC_DIR}` in the step above. Make sure that this path is correct. | ||
|
||
Building IPEX requires a lot of memory. To reduce this requirement, we can patch the IPEX build system to lower the number of parallel processes. We use `sed` to replace the call to `multiprocessing.cpu_count()` with a constant `1` in the file `setup.py`: | ||
Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building IPEX requires a lot of memory. To reduce this requirement, we can set the MAX_JOBS environment variable to lower the number of parallel processes: | |
Building IPEX requires a lot of memory. To reduce this requirement, we can set the `MAX_JOBS` environment variable to lower the number of parallel processes: |
build-setup/linux.md
Outdated
|
||
IPEX expects that the `blas-devel` library package be installed: | ||
```bash | ||
yum install blas-devel.x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some due diligence. The library that gets installed here is libblas.so.3
. Does IPEX end up requiring that at runtime? If so then that's a problem as we cannot guarantee that it will be present on end user systems.
Or if IPEX doesn't need to link libblas
at runtime, why is it needed at build time? Are there environment variables that can be set to say not to use this?
|
||
Install it mainly using the default options _except_ on the "Install Options" dialog check "Add CMake to the system PATH for all users". | ||
|
||
### zlib | ||
|
||
Whilst it is possible to download a pre-built version of `zlib1.dll`, for consistency we want one that links against the Visual Studio 2019 C runtime library. Therefore it is necessary to build zlib from source. | ||
|
||
Download the source code from <http://zlib.net/> - the file is called `zlib1212.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.12`. | ||
Download the source code from <http://zlib.net/> - the file is called `zlib1213.zip`. Unzip this file under `C:\tools`, so that you end up with a directory called `C:\tools\zlib-1.2.13`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means 3rd_party/licenses/zlib-INFO.csv
needs updating too.
@@ -19,7 +19,7 @@ MAINTAINER David Roberts <dave.roberts@elastic.co> | |||
# libffi is required for building Python | |||
RUN \ | |||
rm /var/lib/rpm/__db.* && \ | |||
yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel | |||
yum install -y bzip2 gcc gcc-c++ git libffi-devel make texinfo unzip wget which xz zip zlib-devel blas-devel.x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is no longer in alphabetical order.
But also, please investigate my other comment about whether this is really needed, if so why, and is it causing a runtime complication.
We usually release-note major library upgrades like this, so please add a release note like https://github.com/elastic/ml-cpp/pull/2430/files#diff-6eabb97e98640429cb68b61823f99438d287d604d72ecf0e93ed2e9ecbec7479R35 and https://github.com/elastic/ml-cpp/pull/2253/files#diff-6eabb97e98640429cb68b61823f99438d287d604d72ecf0e93ed2e9ecbec7479R37. |
build-setup/linux.md
Outdated
```bash | ||
yum install blas-devel.x86_64 | ||
``` | ||
The IPEX third party library dependency `LIBXSMM` link stage has a dependency on `libblas`. This dependency can be removed by setting `BLAS=0` in the environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you confirm with ldd
that the BLAS=0
workaround works. Also check if there are any other niche OS library dependencies of either IPEX or PyTorch 2.1.
I am a bit worried about the side effects of setting BLAS=0
for LIBXSMM
. It might mean that then certain operations of LIBXSMM
simply don't work. We'd only detect this at runtime when a model that uses IPEX needs to do one of these operations. Our PR build tests are nowhere near up to the job of detecting subtle problems like this.
For 8.12 this doesn't actually matter, as we're not using IPEX. However, this is an extra piece of work to do before we ever do use in in the future.
This line in the LIBXSMM
Makefile
makes me wonder if it's possible to force the BLAS to always be MKL, and then we could set the build up like that: https://github.com/libxsmm/libxsmm/blob/54b8bb32042d204ff6fe550fa70d961f22c0b99e/Makefile#L246
Please can you add a TODO in this file to revisit BLAS=0
before actually using IPEX in production, and also open an issue to investigate the best way to build IPEX that's both portable to all supported Linux distributions and means that IPEX won't have any missing areas of functionality that cause runtime errors.
But don't try and solve this as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dynamic dependencies of libintel-ext-pt-cpu.so
when built with BLAS=0
are:
[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd build/Release/packages/intel_extension_for_pytorch/lib/libintel-ext-pt-cpu.so
linux-vdso.so.1 => (0x00007ffd41ffa000)
libtorch_cpu.so => /usr/local/gcc103/lib/libtorch_cpu.so (0x00007f47886aa000)
libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f4796ed8000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f478848e000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f478828a000)
libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f4787e06000)
libm.so.6 => /lib64/libm.so.6 (0x00007f4787b04000)
libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f47878c1000)
libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f47876a8000)
libc.so.6 => /lib64/libc.so.6 (0x00007f47872da000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4796d75000)
librt.so.1 => /lib64/librt.so.1 (0x00007f47870d2000)
libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f47863b2000)
libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f4784652000)
libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f4780011000)
similarly for libtorch_cpu.so
[root@6c8e9918c9aa intel-extension-for-pytorch]# ldd /usr/local/gcc103/lib/libtorch_cpu.so
linux-vdso.so.1 => (0x00007ffe2d7c4000)
libc10.so => /usr/local/gcc103/lib/libc10.so (0x00007f752d277000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7522507000)
libgcc_s.so.1 => /usr/local/gcc103/lib64/libgcc_s.so.1 (0x00007f75222ee000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f75220ea000)
libmkl_intel_lp64.so => /usr/local/gcc103/lib/libmkl_intel_lp64.so (0x00007f75213ca000)
libmkl_gnu_thread.so => /usr/local/gcc103/lib/libmkl_gnu_thread.so (0x00007f751f66a000)
libmkl_core.so => /usr/local/gcc103/lib/libmkl_core.so (0x00007f751b029000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f751ae0d000)
libm.so.6 => /lib64/libm.so.6 (0x00007f751ab0b000)
libgomp.so.1 => /usr/local/gcc103/lib64/libgomp.so.1 (0x00007f751a8c8000)
libstdc++.so.6 => /usr/local/gcc103/lib64/libstdc++.so.6 (0x00007f751a444000)
libc.so.6 => /lib64/libc.so.6 (0x00007f751a076000)
/lib64/ld-linux-x86-64.so.2 (0x00007f752d113000)
From my understanding of the libxsmm
documentation:
Similarly, an application is free to choose any BLAS or LAPACK library (if the link model available on the OS supports
this), and therefore linking GEMM routines when linking LIBXSMM itself (by supplying BLAS=1|2) may prevent a user
from making this decision at the time of linking the actual application. To use LIBXSMM without GEMM-related
functionality, any BLAS-dependency can be removed in two ways: (1) building a special library with make BLAS
8000
=0, or
(2) linking the application against the libxsmmnoblas library. If an application however uses BLAS already, the [Call
Wrapper](https://libxsmm.readthedocs.io/en/latest/libxsmm_mm/#call-wrapper) can be used to intercept existing
BLAS calls (and to rely on LIBXSMM instead).
building with BLAS=0
is recommended (although quite why it's not the default is odd if that's the case).
As you suggest I think further investigation at a later stage is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those dynamic dependencies you've listed look good.
The BLAS documentation for LIBXSMM
isn't 100% clear to me, and then there's the extra complexity on top of when/how IPEX uses LIBXSMM
.
Thanks for opening #2589. Nothing more is needed for this PR though.
buildkite build this please and thank you |
buildkite build this |
Looks good, but we need to run the QA suite against these changes before merging to I think some manual compatibility testing with the current version of Eland should also be done before merging this. So build an Elasticsearch locally that incorporates a C++ build from this PR branch, then use the current version of Eland (which uses PyTorch 1.13.1) to download a model from HuggingFace, install it, and start it. Then perform at least one inference against that model. (The latest main branch snapshot of Kibana should be able to connect to the locally built Elasticsearch for this testing.) Finally, with that same local install of Kibana and Elasticsearch, check that you can download and install ELSER v2 and do an inference using that. |
Sanity checks were performed using Likewise, no issues were encountered downloading ELSER v2 and testing against it. |
TODO: Set |
The crashes we've observed with the Intel-optimised variant of ELSER are caused by IPEX. I have opened intel/intel-extension-for-pytorch#484 for this. Depending on the outcome of that we will either have to upgrade to a patched version of IPEX in this PR, or remove it entirely. Since this has dragged on so long we should also consider rebuilding with PyTorch 2.1.2 when we pick this up again. Apparently that release is imminent: pytorch/pytorch#113962 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CI passes and if you could just adjust the release notes before merging - it's probably worth waiting for the current CI to finish before adjusting the release notes to expedite smoking out any other problems, then merge first thing tomorrow if all is good.
docs/CHANGELOG.asciidoc
Outdated
* Improve forecasting for time series with step changes. (See {ml-pull}#2591[2591], | ||
issue: {ml-issue}2466[#2466]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete these two lines. They're duplicated about 10 lines below in the 8.11.2 section, and 8.11.2 seems to be correct. (I think this might be caused by one of my attempts to resolve a merge clash - sorry about that.)
This reverts commit 089b4d3.
Pytorch 2.1.2 has recently been released. Opening this PR to upgrade to it and potentially flush out any BWC issues via the unit and ES integration tests.
Closes #2587