gh-126703: Add freelists for small size lists #129921

eendebakpt · 2025-02-09T21:53:40Z

We add freelists for all lists of small length. The advantage over the existing freelist for lists is that we do not need to allocate and deallocate the ob_item field.

Microbenchmarks show good results:

bench_list_create: Mean +- std dev: [main_list_v3] 51.8 ns +- 0.6 ns -> [pr_list_v3] 39.0 ns +- 2.1 ns: 1.33x faster
bench_list_create_const: Mean +- std dev: [main_list_v3] 51.6 ns +- 1.5 ns -> [pr_list_v3] 39.2 ns +- 1.7 ns: 1.31x faster
bench_list_from_tuple: Mean +- std dev: [main_list_v3] 77.5 ns +- 1.3 ns -> [pr_list_v3] 78.4 ns +- 2.2 ns: 1.01x slower
bench_list_copy: Mean +- std dev: [main_list_v3] 51.0 ns +- 1.2 ns -> [pr_list_v3] 37.7 ns +- 1.0 ns: 1.35x faster
bench_list_repeat: Mean +- std dev: [main_list_v3] 74.4 ns +- 2.8 ns -> [pr_list_v3] 57.4 ns +- 1.0 ns: 1.30x faster

Geometric mean: 1.25x faster

(Linux, PGO, tail calling interpreter). Pyperformance results are a 1% improvement, but this might be due to my unstable machine.

Pyperformance result

All benchmarks:
===============

2to3: Mean +- std dev: [main] 318 ms +- 3 ms -> [pr] 321 ms +- 4 ms: 1.01x slower
async_generators: Mean +- std dev: [main] 505 ms +- 17 ms -> [pr] 481 ms +- 7 ms: 1.05x faster
asyncio_websockets: Mean +- std dev: [main] 638 ms +- 12 ms -> [pr] 658 ms +- 15 ms: 1.03x slower
chaos: Mean +- std dev: [main] 75.3 ms +- 2.1 ms -> [pr] 74.2 ms +- 2.2 ms: 1.01x faster
comprehensions: Mean +- std dev: [main] 21.2 us +- 0.7 us -> [pr] 20.8 us +- 0.2 us: 1.02x faster
coverage: Mean +- std dev: [main] 268 ms +- 7 ms -> [pr] 275 ms +- 10 ms: 1.03x slower
crypto_pyaes: Mean +- std dev: [main] 83.7 ms +- 2.4 ms -> [pr] 82.5 ms +- 1.8 ms: 1.02x faster
deepcopy_reduce: Mean +- std dev: [main] 3.40 us +- 0.05 us -> [pr] 3.36 us +- 0.04 us: 1.01x faster
deltablue: Mean +- std dev: [main] 3.91 ms +- 0.12 ms -> [pr] 3.86 ms +- 0.11 ms: 1.01x faster
docutils: Mean +- std dev: [main] 3.21 sec +- 0.07 sec -> [pr] 3.16 sec +- 0.07 sec: 1.01x faster
dulwich_log: Mean +- std dev: [main] 79.3 ms +- 2.2 ms -> [pr] 78.0 ms +- 1.7 ms: 1.02x faster
fannkuch: Mean +- std dev: [main] 544 ms +- 13 ms -> [pr] 509 ms +- 14 ms: 1.07x faster
float: Mean +- std dev: [main] 93.9 ms +- 3.9 ms -> [pr] 91.0 ms +- 3.5 ms: 1.03x faster
gc_traversal: Mean +- std dev: [main] 5.05 ms +- 0.06 ms -> [pr] 5.01 ms +- 0.12 ms: 1.01x faster
generators: Mean +- std dev: [main] 37.3 ms +- 1.2 ms -> [pr] 36.4 ms +- 1.2 ms: 1.02x faster
genshi_text: Mean +- std dev: [main] 26.6 ms +- 1.0 ms -> [pr] 27.0 ms +- 0.7 ms: 1.01x slower
hexiom: Mean +- std dev: [main] 7.86 ms +- 0.21 ms -> [pr] 7.47 ms +- 0.17 ms: 1.05x faster
html5lib: Mean +- std dev: [main] 72.3 ms +- 2.0 ms -> [pr] 71.4 ms +- 1.8 ms: 1.01x faster
json_dumps: Mean +- std dev: [main] 14.7 ms +- 0.5 ms -> [pr] 14.2 ms +- 0.6 ms: 1.03x faster
json_loads: Mean +- std dev: [main] 32.1 us +- 0.8 us -> [pr] 32.7 us +- 1.1 us: 1.02x slower
logging_silent: Mean +- std dev: [main] 137 ns +- 3 ns -> [pr] 139 ns +- 3 ns: 1.02x slower
logging_simple: Mean +- std dev: [main] 7.30 us +- 0.26 us -> [pr] 7.08 us +- 0.11 us: 1.03x faster
mdp: Mean +- std dev: [main] 1.79 sec +- 0.07 sec -> [pr] 1.82 sec +- 0.08 sec: 1.02x slower
nbody: Mean +- std dev: [main] 123 ms +- 4 ms -> [pr] 122 ms +- 3 ms: 1.01x faster
nqueens: Mean +- std dev: [main] 106 ms +- 3 ms -> [pr] 103 ms +- 4 ms: 1.03x faster
pickle: Mean +- std dev: [main] 14.9 us +- 1.0 us -> [pr] 14.3 us +- 0.6 us: 1.05x faster
pickle_list: Mean +- std dev: [main] 5.62 us +- 0.15 us -> [pr] 5.36 us +- 0.13 us: 1.05x faster
pidigits: Mean +- std dev: [main] 223 ms +- 3 ms -> [pr] 226 ms +- 7 ms: 1.02x slower
python_startup: Mean +- std dev: [main] 14.3 ms +- 0.2 ms -> [pr] 14.2 ms +- 0.2 ms: 1.01x faster
python_startup_no_site: Mean +- std dev: [main] 11.0 ms +- 0.2 ms -> [pr] 10.9 ms +- 0.2 ms: 1.00x faster
regex_compile: Mean +- std dev: [main] 155 ms +- 3 ms -> [pr] 162 ms +- 5 ms: 1.04x slower
regex_dna: Mean +- std dev: [main] 188 ms +- 5 ms -> [pr] 190 ms +- 5 ms: 1.01x slower
regex_v8: Mean +- std dev: [main] 27.4 ms +- 0.6 ms -> [pr] 27.8 ms +- 0.6 ms: 1.01x slower
richards_super: Mean +- std dev: [main] 61.5 ms +- 1.2 ms -> [pr] 63.3 ms +- 1.7 ms: 1.03x slower
scimark_fft: Mean +- std dev: [main] 423 ms +- 13 ms -> [pr] 439 ms +- 18 ms: 1.04x slower
scimark_lu: Mean +- std dev: [main] 152 ms +- 2 ms -> [pr] 155 ms +- 4 ms: 1.02x slower
scimark_monte_carlo: Mean +- std dev: [main] 85.6 ms +- 3.0 ms -> [pr] 83.4 ms +- 2.4 ms: 1.03x faster
scimark_sor: Mean +- std dev: [main] 153 ms +- 4 ms -> [pr] 156 ms +- 6 ms: 1.02x slower
scimark_sparse_mat_mult: Mean +- std dev: [main] 5.62 ms +- 0.33 ms -> [pr] 5.48 ms +- 0.17 ms: 1.02x faster
spectral_norm: Mean +- std dev: [main] 122 ms +- 4 ms -> [pr] 129 ms +- 4 ms: 1.05x slower
sqlglot_parse: Mean +- std dev: [main] 1.59 ms +- 0.05 ms -> [pr] 1.56 ms +- 0.02 ms: 1.02x faster
sqlglot_transpile: Mean +- std dev: [main] 2.01 ms +- 0.08 ms -> [pr] 1.94 ms +- 0.03 ms: 1.04x faster
sqlglot_optimize: Mean +- std dev: [main] 72.7 ms +- 2.0 ms -> [pr] 69.2 ms +- 1.5 ms: 1.05x faster
sqlglot_normalize: Mean +- std dev: [main] 142 ms +- 5 ms -> [pr] 139 ms +- 4 ms: 1.03x faster
sqlite_synth: Mean +- std dev: [main] 3.02 us +- 0.15 us -> [pr] 2.94 us +- 0.18 us: 1.03x faster
telco: Mean +- std dev: [main] 9.42 ms +- 0.36 ms -> [pr] 9.28 ms +- 0.36 ms: 1.01x faster
tomli_loads: Mean +- std dev: [main] 2.52 sec +- 0.09 sec -> [pr] 2.48 sec +- 0.10 sec: 1.01x faster
typing_runtime_protocols: Mean +- std dev: [main] 219 us +- 11 us -> [pr] 208 us +- 7 us: 1.05x faster
unpack_sequence: Mean +- std dev: [main] 52.9 ns +- 1.7 ns -> [pr] 52.0 ns +- 1.7 ns: 1.02x faster
unpickle: Mean +- std dev: [main] 17.0 us +- 0.6 us -> [pr] 16.8 us +- 0.6 us: 1.01x faster
unpickle_list: Mean +- std dev: [main] 5.99 us +- 0.15 us -> [pr] 5.74 us +- 0.18 us: 1.04x faster
xml_etree_parse: Mean +- std dev: [main] 163 ms +- 4 ms -> [pr] 165 ms +- 5 ms: 1.01x slower
xml_etree_iterparse: Mean +- std dev: [main] 127 ms +- 7 ms -> [pr] 122 ms +- 3 ms: 1.04x faster
xml_etree_generate: Mean +- std dev: [main] 118 ms +- 6 ms -> [pr] 115 ms +- 4 ms: 1.03x faster
xml_etree_process: Mean +- std dev: [main] 78.0 ms +- 3.0 ms -> [pr] 79.2 ms +- 1.7 ms: 1.02x slower

Benchmark hidden because not significant (24): asyncio_tcp, asyncio_tcp_ssl, bench_mp_pool, bench_thread_pool, coroutines, dask, deepcopy, deepcopy_memo, create_gc_cycles, genshi_xml, go, logging_format, mako, meteor_contest, pathlib, pickle_dict, pickle_pure_python, pprint_safe_repr, pprint_pformat, pyflate, raytrace, regex_effbot, richards, unpickle_pure_python

Geometric mean: 1.01x faster

Benchmark script

# Quick benchmark for cpython freelists for small lists

import pyperf

def bench_list_from_tuple(loops):
    range_it = range(loops)
   
    x = (1, 2)
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = list(x)
    return pyperf.perf_counter() - t0

def bench_list_create(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = [1, ii]
    return pyperf.perf_counter() - t0

def bench_list_create_const(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        x = [1, 3]
    return pyperf.perf_counter() - t0

def bench_list_copy(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = x.copy()
    return pyperf.perf_counter() - t0

def bench_list_repeat(loops):
    range_it = range(loops)
   
    x = [1, 2]
    t0 = pyperf.perf_counter()
    for ii in range_it:
        _ = x * 2
    return pyperf.perf_counter() - t0


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("bench_list_create", bench_list_create)
    runner.bench_time_func("bench_list_create_const", bench_list_create_const)
    runner.bench_time_func("bench_list_from_tuple", bench_list_from_tuple)
    runner.bench_time_func("bench_list_copy", bench_list_copy)
    runner.bench_time_func("bench_list_repeat", bench_list_repeat)

Open questions:

Do we need to keep the existing freelist?
The freelist works well for lists that are directly allocated. But lists are also resized often. Can we pick the freelist bins to be multiples of 4 to align with the resize strategy? (this goes at the expense of a bit more overallocation)
Could we instead create a freelist for the ob_item elements?

Issue: Use freelist for range object, iterator objects and other often used objects #126703

eendebakpt · 2025-04-03T21:14:34Z

To determine how well the freelist works here are some statistics. For each allocation type (e.g. allocate via the small list freelist, the normal freelist or a normal allocatie) we keep track of the size of the allocated list. We run the pyperformance benchmark and plot the resulting allocations as a histogram.

We see that most allocations are for small lists. The funny parabola like shape is probably due to one of the benchmarks. We zoom into the region of small lists.

Most allocations for small sizes are now via the new freelist..

For the deallocations we have similar results (here the split is only between the small list pushes and other).

Finally, we look at the distribution of list resizes. Most resizes are to small lists. The resize implementation results in certain sizes being used more often (best visible for size > 30), but this also holds for smaller sizes (keep in mind we are using a log scale)

The branch used to generate the statistics is https://github.com/eendebakpt/cpython/tree/small_list_freelist_statistics

Fidget-Spinner · 2025-04-05T15:57:18Z

This adds a slight bit more complexity than the other implementations. Could you ping Mike Droetboom on Monday to benchmark this on the bench runner please?

eendebakpt · 2025-04-08T20:49:27Z

@mdboom Would you be able to run the performance benchmarks on this PR? Thanks

Fidget-Spinner · 2025-04-15T05:45:44Z

No speedup nor slowdown (it's within noise) on the benchmark runner https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250408-3.14.0a7+-77f3e29

eendebakpt · 2025-04-15T07:26:42Z

Thanks for running the benchmarks. The most improved benchmark (fannkuch) uses lists, the other improved benchmarks do not always involve lists so the overall performance change indeed seems to be neutral.

I have some more statistics gathered at freelist_stats.md. They show a clear distinction between the freelists for even list size and odd list size (probably this is related to the allocation strategy).

mdboom · 2025-04-15T12:30:21Z

@mdboom Would you be able to run the performance benchmarks on this PR? Thanks

Sorry, this flew under my radar. Looks like @brandtbucher kicked these off for you, though.

eendebakpt requested a review from ericsnowcurrently as a code owner February 9, 2025 21:53

bedevere-app bot added the awaiting review label Feb 9, 2025

eendebakpt marked this pull request as draft February 9, 2025 21:53

bedevere-app bot removed the awaiting review label Feb 9, 2025

eendebakpt added 5 commits March 4, 2025 08:45

Create freelist for small lists

f799722

working version

32f1a36

comments

1feab37

clear debugging code

e9fe6b6

ci

b81c62f

eendebakpt force-pushed the list_freelist_plus branch from c8f9ca3 to b81c62f Compare March 4, 2025 07:45

eendebakpt added 3 commits March 16, 2025 22:27

Merge branch 'main' into list_freelist_plus

e1a57f5

Merge branch 'main' into list_freelist_plus

8000

e7dfbef

fix test

958ff12

eendebakpt and others added 4 commits April 3, 2025 23:15

increase size used to that of the tuples

0fd64a7

revert size increase

f1ada96

📜🤖 Added by blurb_it.

519ea7b

Merge branch 'main' into list_freelist_plus

ba00c15

eendebakpt marked this pull request as ready for review April 3, 2025 21:54

bedevere-app bot added the awaiting review label Apr 3, 2025

eendebakpt added 5 commits April 6, 2025 22:38

cleanup

42c7abb

Merge branch 'main' into list_freelist_plus

38ddac7

Merge branch 'main' into list_freelist_plus

a2670fc

update comment

5baec94

update comment

77f3e29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-126703: Add freelists for small size lists #129921

gh-126703: Add freelists for small size lists #129921

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-126703: Add freelists for small size lists #129921

Are you sure you want to change the base?

gh-126703: Add freelists for small size lists #129921

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!