8000 gh-90536: Add support for the BOLT post-link binary optimizer by kmod · Pull Request #95908 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-90536: Add support for the BOLT post-link binary optimizer #95908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 18, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add support for the BOLT post-link binary optimizer
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
  • Loading branch information
kmod committed Aug 11, 2022
commit ef8e98d08ad70fb3b982ea5d38aa0df65a2013c4
9 changes: 9 additions & 0 deletions Makefile.pre.in
Original file line number Diff line number Diff line change
Expand Up @@ -640,6 +640,15 @@ profile-opt: profile-run-stamp
-rm -f profile-clean-stamp
$(MAKE) @DEF_MAKE_RULE@ CFLAGS_NODIST="$(CFLAGS_NODIST) $(PGO_PROF_USE_FLAG)" LDFLAGS_NODIST="$(LDFLAGS_NODIST)"

bolt-opt: @PREBOLT_RULE@
rm -f *.fdata
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst
@LLVM_BOLT@ ./$(BUILDPYTHON) -instrument -instrumentation-file-append-pid -instrumentation-file=$(abspath $(BUILDPYTHON).bolt) -o $(BUILDPYTHON).bolt_inst

./$(BUILDPYTHON).bolt_inst $(PROFILE_TASK) || true
@MERGE_FDATA@ $(BUILDPYTHON).*.fdata > $(BUILDPYTHON).fdata
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@LLVM_BOLT@ $(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot
@LLVM_BOLT@ ./$(BUILDPYTHON) -o $(BUILDPYTHON).bolt -data=$(BUILDPYTHON).fdata -update-debug-sections -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -icf=1 -inline-all -split-eh -reorder-functions-use-hot-size -peepholes=all -jump-tables=aggressive -inline-ap -indirect-call-promotion=all -dyno-stats -use-gnu-stack -frame-opt=hot

rm -f *.fdata
mv $(BUILDPYTHON).bolt $(BUILDPYTHON)

# Compile and run with gcov
.PHONY=coverage coverage-lcov coverage-report
coverage:
Expand Down
2 changes: 2 additions & 0 deletions Misc/no-pie-compile.specs
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*self_spec:
+ %{!r:%{!fpie:%{!fPIE:%{!fpic:%{!fPIC:%{!fno-pic:-fno-PIE}}}}}}
2 changes: 2 additions & 0 deletions Misc/no-pie-link.specs
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*self_spec:
+ %{!shared:%{!r:%{!fPIE:%{!pie:-fno-PIE -no-pie}}}}
259 changes: 259 additions & 0 deletions configure

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

51 changes: 51 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -1881,6 +1881,57 @@ if test "$Py_LTO" = 'true' ; then
LDFLAGS_NODIST="$LDFLAGS_NODIST $LTOFLAGS"
fi

# Enable bolt flags
Py_BOLT='false'
AC_MSG_CHECKING(for --enable-bolt)
AC_ARG_ENABLE(bolt, AS_HELP_STRING(
[--enable-bolt],
[enable usage of the llvm-bolt post-link optimizer (default is no)]),
[
if test "$enableval" != no
then
Py_BOLT='true'
AC_MSG_RESULT(yes);
else
Py_BOLT='false'
AC_MSG_RESULT(no);
fi],
[AC_MSG_RESULT(no)])

AC_SUBST(PREBOLT_RULE)
if test "$Py_BOLT" = 'true' ; then
PREBOLT_RULE="${DEF_MAKE_ALL_RULE}"
DEF_MAKE_ALL_RULE="bolt-opt"
DEF_MAKE_RULE="build_all"

# These flags are required for bolt to work:
CFLAGS_NODIST="$CFLAGS_NODIST -fno-reorder-blocks-and-partition"
LDFLAGS_NODIST="$LDFLAGS_NODIST -Wl,--emit-relocs"

# These flags are required to get good performance from bolt:
CFLAGS_NODIST="$CFLAGS_NODIST -specs=Misc/no-pie-compile.specs"
LDFLAGS_NODIST="$LDFLAGS_NODIST -specs=Misc/no-pie-link.specs"
LDFLAGS_NOLTO="$LDFLAGS_NOLTO -specs=Misc/no-pie-link.specs"

AC_SUBST(LLVM_BOLT)
AC_PATH_TOOL(LLVM_BOLT, llvm-bolt, '', ${llvm_path})
if test -n "${LLVM_BOLT}" -a -x "${LLVM_BOLT}"
then
AC_MSG_RESULT("Found llvm-bolt")
else
AC_MSG_ERROR([llvm-bolt is required for a --enable-bolt build but could not be found.])
fi

AC_SUBST(MERGE_FDATA)
AC_PATH_TOOL(MERGE_FDATA, merge-fdata, '', ${llvm_path})
if test -n "${MERGE_FDATA}" -a -x "${MERGE_FDATA}"
then
AC_MSG_RESULT("Found merge-fdata")
else
AC_MSG_ERROR([merge-fdata is required for a --enable-bolt build but could not be found.])
fi
fi

# Enable PGO flags.
AC_SUBST(PGO_PROF_GEN_FLAG)
AC_SUBST(PGO_PROF_USE_FLAG)
Expand Down
0