8000 gh-126703: Add freelists for small size lists by eendebakpt · Pull Request #129921 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-126703: Add freelists for small size lists #129921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

eendebakpt
Copy link
Contributor
@eendebakpt eendebakpt commented Feb 9, 2025

We add freelists for all lists of small length. The advantage over the existing freelist for lists is that we do not need to allocate and deallocate the ob_item field.

Microbenchmarks show good results:

bench_list_create: Mean +- std dev: [main_list_v3] 51.8 ns +- 0.6 ns -> [pr_list_v3] 39.0 ns +- 2.1 ns: 1.33x faster
bench_list_create_const: Mean +- std dev: [main_list_v3] 51.6 ns +- 1.5 ns -> [pr_list_v3] 39.2 ns +- 1.7 ns: 1.31x faster
bench_list_from_tuple: Mean +- std dev: [main_list_v3] 77.5 ns +- 1.3 ns -> [pr_list_v3] 78.4 ns +- 2.2 ns: 1.01x slower
bench_list_copy: Mean +- std dev: [main_list_v3] 51.0 ns +- 1.2 ns -> [pr_list_v3] 37.7 ns +- 1.0 ns: 1.35x faster
bench_list_repeat: Mean +- std dev: [main_list_v3] 74.4 ns +- 2.8 ns -> [pr_list_v3] 57.4 ns +- 1.0 ns: 1.30x faster

Geometric mean: 1.25x faster

(Linux, PGO, tail calling interpreter). Pyperformance results are a 1% improvement, but this might be due to my unstable machine.

Pyperformance result
All benchmarks:
===============

2to3: Mean +- std dev: [main] 318 ms +- 3 ms -> [pr] 321 ms +- 4 ms: 1.01x slower
async_generators: Mean +- std dev: [main] 505 ms +- 17 ms -> [pr] 481 ms +- 7 ms: 1.05x faster
asyncio_websockets: Mean +- std dev: [main] 638 ms +- 12 ms -> [pr] 658 ms +- 15 ms: 1.03x slower
chaos: Mean +- std dev: [main] 75.3 ms +- 2.1 ms -> [pr] 74.2 ms +- 2.2 ms: 1.01x faster
comprehensions: Mean +- std dev: [main] 21.2 us +- 0.7 us -> [pr] 20.8 us +- 0.2 us: 1.02x faster
coverage: Mean +- std dev: [main] 268 ms +- 7 ms -> [pr] 275 ms +- 10 ms: 1.03x slower
crypto_pyaes: Mean +- std dev: [main] 83.7 ms +- 2.4 ms -> [pr] 82.5 ms +- 1.8 ms: 1.02x faster
deepcopy_reduce: Mean +- std dev: [main] 3.40 us +- 0.05 us -> [pr] 3.36 us +- 0.04 us: 1.01x faster
deltablue: Mean +- std dev: [main] 3.91 ms +- 0.12 ms -> [pr] 3.86 ms +- 0.11 ms: 1.01x faster
docutils: Mean +- std dev: [main] 3.21 sec +- 0.07 sec -> [pr] 3.16 sec +- 0.07 sec: 1.01x faster
dulwich_log: Mean +- std dev: [main] 79.3 ms +- 2.2 ms -> [pr] 78.0 ms +- 1.7 ms: 1.02x faster
fannkuch: Mean +- std dev: [main] 544 ms +- 13 ms -> [pr] 509 ms +- 14 ms: 1.07x faster
float: Mean +- std dev: [main] 93.9 ms +- 3.9 ms -> [pr] 91.0 ms +- 3.5 ms: 1.03x faster
gc_traversal: Mean +- std dev: [main] 5.05 ms +- 0.06 ms -> [pr] 5.01 ms +- 0.12 ms: 1.01x faster
generators: Mean +- std dev: [main] 37.3 ms +- 1.2 ms -> [pr] 36.4 ms +- 1.2 ms: 1.02x faster
genshi_text: Mean +- std dev: [main] 26.6 ms +- 1.0 ms -> [pr] 27.0 ms +- 0.7 ms: 1.01x slower
hexiom: Mean +- std dev: [main] 7.86 ms +- 0.21 ms -> [pr] 7.47 ms +- 0.17 ms: 1.05x faster
html5lib: Mean +- std dev: [main] 72.3 ms +- 2.0 ms -> [pr] 71.4 ms +- 1.8 ms: 1.01x faster
json_dumps: Mean +- std dev: [main] 14.7 ms +- 0.5 ms -> [pr] 14.2 ms +- 0.6 ms: 1.03x faster
json_loads: Mean +- std dev: [main] 32.1 us +- 0.8 us -> [pr] 32.7 us +- 1.1 us: 1.02x slower
logging_silent: Mean +- std dev: [main] 137 ns +- 3 ns -> [pr] 139 ns +- 3 ns: 1.02x slower
logging_simple: Mean +- std dev: [main] 7.30 us +- 0.26 us -> [pr] 7.08 us +- 0.11 us: 1.03x faster
mdp: Mean +- std dev: [main] 1.79 sec +- 0.07 sec -> [pr] 1.82 sec +- 0.08 sec: 1.02x slower
nbody: Mean +- std dev: [main] 123 ms +- 4 ms -> [pr] 122 ms +- 3 ms: 1.01x faster
nqueens: Mean +- std dev: [main] 106 ms +- 3 ms -> [pr] 103 ms +- 4 ms: 1.03x faster
pickle: Mean +- std dev: [main] 14.9 us +- 1.0 us -> [pr] 14.3 us +- 0.6 us: 1.05x faster
pickle_list: Mean +- std dev: [main] 5.62 us +- 0.15 us -> [pr] 5.36 us +- 0.13 us: 1.05x faster
pidigits: Mean +- std dev: [main] 223 ms +- 3 ms -> [pr] 226 ms +- 7 ms: 1.02x slower
python_startup: Mean +- std dev: [main] 14.3 ms +- 0.2 ms -> [pr] 14.2 ms +- 0.2 ms: 1.01x faster
python_startup_no_site: Mean +- std dev: [main] 11.0 ms +- 0.2 ms -> [pr] 10.9 ms +- 0.2 ms: 1.00x faster
regex_compile: Mean +- std dev: [main] 155 ms +- 3 ms -> [pr] 162 ms +- 5 ms: 1.04x slower
regex_dna: Mean +- std dev: [main] 188 ms +- 5 ms -> [pr] 190 ms +- 5 ms: 1.01x slower
regex_v8: Mean +- std dev: [main] 27.4 ms +- 0.6 ms -> [pr] 27.8 ms +- 0.6 ms: 1.01x slower
richards_super: Mean +- std dev: [main] 61.5 ms +- 1.2 ms -> [pr] 63.3 ms +- 1.7 ms: 1.03x slower
scimark_fft: Mean +- std dev: [main] 423 ms +- 13 ms -> [pr] 439 ms +- 18 ms: 1.04x slower
scimark_lu: Mean +- std dev: [main] 152 ms +- 2 ms -> [pr] 155 ms +- 4 ms: 1.02x slower
scimark_monte_carlo: Mean +- std dev: [main] 85.6 ms +- 3.0 ms -> [pr] 83.4 ms +- 2.4 ms: 1.03x faster
scimark_sor: Mean +- std dev: [main] 153 ms +- 4 ms -> [pr] 156 ms +- 6 ms: 1.02x slower
scimark_sparse_mat_mult: Mean +- std dev: [main] 5.62 ms +- 0.33 ms -> [pr] 5.48 ms +- 0.17 ms: 1.02x faster
spectral_norm: Mean +- std dev: [main] 122 ms +- 4 ms -> [pr] 129 ms +- 4 ms: 1.05x slower
sqlglot_parse: Mean +- std dev: [main] 1.59 ms +- 0.05 ms -> [pr] 1.56 ms +- 0.02 ms: 1.02x faster
sqlglot_transpile: Mean +- std dev: [main] 2.01 ms +- 0.08 ms -> [pr] 1.94 ms +- 0.03 ms: 1.04x faster
sqlglot_optimize: Mean +- std dev: [main] 72.7 ms +- 2.0 ms -> [pr] 69.2 ms +- 1.5 ms: 1.05x faster
sqlglot_normalize: Mean +- std dev: [main] 142 ms +- 5 ms -> [pr] 139 ms +- 4 ms: 1.03x faster
sqlite_synth: Mean +- std dev: [main] 3.02 us +- 0.15 us -> [pr] 2.94 us +- 0.18 us: 1.03x faster
telco: Mean +- std dev: [main] 9.42 ms +- 0.36 ms -> [pr] 9.28 ms +- 0.36 ms: 1.01x faster
tomli_loads: Mean +- std dev: [main] 2.52 sec +- 0.09 sec -> [pr] 2.48 sec +- 0.10 sec: 1.01x faster
typing_runtime_protocols: Mean +- std dev: [main] 219 us +- 11 us -> [pr] 208 us +- 7 us: 1.05x faster
unpack_sequence: Mean +- std dev: [main] 52.9 ns +- 1.7 ns -> [pr] 52.0 ns +- 1.7 ns: 1.02x faster
unpickle: Mean +- std dev: [main] 17.0 us +- 0.6 us -> [pr] 16.8 us +- 0.6 us: 1.01x faster
unpickle_list: Mean +- std dev: [main] 5.99 us +- 0.15 us -> [pr] 5.74 us +- 0.18 us: 1.04x faster
xml_etree_parse: Mean +- std dev: [main] 163 ms +- 4 ms -> [pr] 165 ms +- 5 ms: 1.01x slower
xml_etree_iterparse: Mean +- std dev: [main] 127 ms +- 7 ms -> [pr] 122 ms +- 3 ms: 1.04x faster
xml_etree_generate: Mean +- std dev: [main] 118 ms +- 6 ms -> [pr] 115 ms +- 4 ms: 1.03x faster
xml_etree_process: Mean +- std dev: [main] 78.0 ms +- 3.0 ms -> [pr] 79.2 ms +- 1.7 ms: 1.02x slower

Benchmark hidden because not significant (24): asyncio_tcp, asyncio_tcp_ssl, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, create_gc_cycles, genshi_xml, go, logging_format, mako, meteor_contest, pathlib, pickle_dict, pickle_pure_python, pprint_safe_repr, pprint_pformat, pyflate, raytrace, regex_effbot, richards, unpickle_pure_python

Geometric mean: 1.01x faster

Benchmark script
# Quick benchmark for cpython freelists for small lists

import pyperf

def bench_list_from_tuple(loops):
    range_it = range(loops)
   
    x = (1, 2)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = list(x)
    return pyperf.perf_counter() - t0

def bench_list_create(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = [1, ii]
    return pyperf.perf_counter() - t0

def bench_list_create_const(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = [1, 3]
    return pyperf.perf_counter() - t0

def bench_list_copy(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = x.copy()
    return pyperf.perf_counter() - t0

def bench_list_repeat(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = x * 2
    return pyperf.perf_counter() - t0


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("bench_list_create", bench_list_create)
    runner.bench_time_func("bench_list_create_const", bench_list_create_const)
    runner.bench_time_func("bench_list_from_tuple", bench_list_from_tuple)
    runner.bench_time_func("bench_list_copy", bench_list_copy)
    runner.bench_time_func("bench_list_repeat", bench_list_repeat)

Open questions:

  • Do we need to keep the existing freelist?
  • The freelist works well for lists that are directly allocated. But lists are also resized often. Can we pick the freelist bins to be multiples of 4 to align with the resize strategy? (this goes at the expense of a bit more overallocation)
  • Could we instead create a freelist for the ob_item elements?

@eendebakpt eendebakpt marked this pull request as draft February 9, 2025 21:53
@eendebakpt eendebakpt force-pushed the list_freelist_plus branch from c8f9ca3 to b81c62f Compare March 4, 2025 07:45
@eendebakpt
Copy link
Contributor Author

To determine how well the freelist works here are some statistics. For each allocation type (e.g. allocate via the small list freelist, the normal freelist or a normal allocatie) we keep track of the size of the allocated list. We run the pyperformance benchmark and plot the resulting allocations as a histogram.

image

We see that most allocations are for small lists. The funny parabola like shape is probably due to one of the benchmarks. We zoom into the region of small lists.

image

Most allocations for small sizes are now via the new freelist..

For the deallocations we have similar results (here the split is only between the small list pushes and other).

image

Finally, we look at the distribution of list resizes. Most resizes are to small lists. The resize implementation results in certain sizes being used more often (best visible for size > 30), but this also holds for smaller sizes (keep in mind we are using a log scale)

image

The branch used to generate the statistics is https://github.com/eendebakpt/cpython/tree/small_list_freelist_statistics

@eendebakpt eendebakpt marked this pull request as ready for review April 3, 2025 21:54
@Fidget-Spinner
Copy link
Member

This adds a slight bit more complexity than the other implementations. Could you ping Mike Droetboom on Monday to benchmark this on the bench runner please?

@eendebakpt
Copy link
Contributor Author

@mdboom Would you be able to run the performance benchmarks on this PR? Thanks

@Fidget-Spinner
Copy link
Member

No speedup nor slowdown (it's within noise) on the benchmark runner https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250408-3.14.0a7+-77f3e29

@eendebakpt
Copy link
Contributor Author

Thanks for running the benchmarks. The most improved benchmark (fannkuch) uses lists, the other improved benchmarks do not always involve lists so the overall performance change indeed seems to be neutral.

I have some more statistics gathered at freelist_stats.md. They show a clear distinction between the freelists for even list size and odd list size (probably this is related to the allocation strategy).

@mdboom
Copy link
Contributor
mdboom commented Apr 15, 2025

@mdboom Would you be able to run the performance benchmarks on this PR? Thanks

Sorry, this flew under my radar. Looks like @brandtbucher kicked these off for you, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0