8000 Increase _PY_NSMALLPOSINTS size · Issue #133059 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

Increase _PY_NSMALLPOSINTS size #133059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dg-pb opened this issue Apr 27, 2025 · 4 comments
Open

Increase _PY_NSMALLPOSINTS size #133059

dg-pb opened this issue Apr 27, 2025 · 4 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@dg-pb
Copy link
Contributor
dg-pb commented Apr 27, 2025

Feature or enhancement

Proposal:

See link to faster CPython for initial analyses and backstory if interested, but I will try to summarise it all here:


So the intent is to make int creation faster:

from collections import deque
consume = deque(maxlen=0).extend
INT_RANGE = list(range(100_000))
%timeit consume(range(100_000))    # 2.3 ms
%timeit consume(iter(INT_RANGE))   # 0.6 ms

Ant pre-storing int objects is straight forward path to achieve this:

from itertools import chain
chain = chain.from_iterable
%timeit consume(range(100_000))                         # 2.1 ms
%timeit consume(chain([range(256)] * (100_000 // 256))) # 0.5 ms

While increasing value of _PY_NSMALLPOSINTS does exactly that.


Benefits of this can be observed in benchmarks in pyperformance.
With each incremental increase of this number new statistically significant benefits surface:

  1. 1024 results in 10% faster regex_v8
  2. 2048 adds 4% faster regex_dna and 8% faster regex_effbot
  3. 8192 adds the following in the range of 4-11%: genshi_*, regex_compile, scimark_*, spectral_norm, xml_* and few more
  4. 100K adds 9% faster scimark_monte_carlo, 6% faster scimark_sparse_mat_mult and 12% faster spectral_norm

As it can be seen, each range of integers have benefits for different applications:

  1. <2048 covers most of standard everyday use cases. Additionally it would benefit some cases of iterative algorithms such as string parsing
  2. 8192 benefits a certain portion of scientific application of linear algebra with dimensions up to this number and integer operations where results largely fall into this range
  3. 100K+ has observable impact for scientific (and not only) applications of high dimensionality, such as sparse matrices, large graphs and similar

This number can be increased to 1M, 10M, etc and it is possible to find further observable benefits in specific applications.


Having that said, more involved scientific applications should aim to benefit from optimized libraries such as numpy, which most likely leaves (3) way out of scope of this change.

To attempt to find most optimal number the following two calculations can give some insight:

  1. Determining what sort of integers are used during startup and while importing all standard library packages.

Python is launched with:

from importlib import import_module
names = set(sys.stdlib_module_names)
names -= {'_msi', 'msilib', '_overlapped', '_winapi', '_wmi'}
names -= {'spwd', 'ossaudiodev', 'msvcrt', 'nt', 'winreg', 'winsound'}
names -= {'antigravity', 'this'}
mods = [import_module(name) for name in names]

Cumulative density graph of PyLongObject requests:

Image

256 captures ~83% of all used numbers in startup and imports.
Beyond 256, there are 2 visible increments of at 512 and 4096.
4096 would increase coverage to 93%.


  1. Github use case analysis

Above only captures imports and not the user use cases.
For that github search for \brange(\d+) function can provide some more insight.

Image
Total of 4.3M use cases.

It can be seen that 91% of ranges that are initialised with one argument have it set below 260. So current cache size covers it well.
However, the actual number of use cases is not indicative of performance benefit.
What is more appropriate here is to see how many integer objects would be re-used at each cache size.
Below is the graph of extra integers re-used with each incremental increase of this number.

Image

So current number 256 allows re-using of 99M (11+54+34) integers that are generated by range cases.
Increasing it to 1000 would add extra 33M.
Increasing it to 10000 would add extra 240M (33 + 2 + 5 + 200).
Etc...

So this is in line with findings from benchmark results above:

  1. There is benefit in increasing cache size to a medium sized number <= 10K.
  2. Increasing this number further has performance benefits up to very high value. Although these can be significant, they are rare cases and average user should not be paying for them.

So my best guess is that the upper bound for this number is ~10K, which:

  1. Increases startup time by 1% (with every extra 10K).
  2. Takes 280KB (2-3% of of initial Python process) of memory

There should be a very good reason to sacrifice anything more only for integer storage.
(And I suspect that this can be higher than what others would agree on.)

From all the information above together with some analyses from @eendebakpt in faster-cpython/ideas#725 it seems that both 1024 and 2048 could be good candidates.


So this is my best inference.
What do others think?

  1. Is such change possible?
  2. If yes, is upper bound above reasonable?
  3. What is the most optimal number to set?

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

faster-cpython/ideas#725

Linked PRs

@dg-pb dg-pb added the type-feature A feature request or enhancement label Apr 27, 2025
@picnixz picnixz added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Apr 27, 2025
@dg-pb
Copy link
Contributor Author
dg-pb commented Apr 28, 2025

The graph below has helped to make up my mind.

Image

Given all range calls for github search, it shows the use frequency of numbers in a specified range.

So for example, given all those range calls, every stored integer in range [0, 10) will be used at least 2.06M times (range(9)) most 4.26M times (range(0)), which is also the total number of search results.

It is visible that usage of numbers in range [600, 1030) is significant and incremental decrease in usage compared to the last range [260, 600) is very low.

Although the results are plotted based on lower bound, which is 0.35M, but there is a a significant segregation of those search results in the range of [1000, 1030). To be precise, 0.16M out of 0.20M numbers are in this range.
Thus, the majority of the numbers in this range are used ~0.3M times.

Numbers in this range are used only 2-3x less than the ones in [100, 260), which is being captured by the current cache.

The next range of numbers - [1030, 2000) - is in comparison significantly less useful.


So my proposal is to set this number to 1025.

It includes a common 1024 and captures full range of 1000, which is very common.

It has a significant positive impact on 2 benchmarks:

+--------------------------+-------------+-----------------+--------------+------------------------+
| Benchmark                | pyMAIN.json | pyWT5_1024.json | Change       | Significance           |
+==========================+=============+=================+==============+========================+
| regex_dna                | 210 ms      | 201 ms          | 1.05x faster | Significant (t=14.37) |
| regex_v8*                | 33.4 ms     | 30.3 ms         | 1.10x faster | Significant (t=17.99) |

Extra costs for increasing this number to 1025 are:

  1. ~22KB memory (~0.2% increase)
  2. ~23µs startup time (~0.1% increase)

@markshannon
Copy link
Member

@dg-pb do you have a public branch that we can benchmark on our infrastructure?

@dg-pb
Copy link
Contributor Author
dg-pb commented Apr 29, 2025

#133160

If you want to check different numbers there is nothing else to do apart from:

  1. Change number in pycore_runtime_structs.h
  2. Run generate_global_objects.py

@dg-pb
Copy link
Contributor Author
dg-pb commented Apr 30, 2025

To address negatives (as per comment/request #133160 (comment)), I have done a quick check.

  1. See the very first graph in original post which counts integers used when launching Python with complete set of standard library imports. It does include negatives too. It can not be seen in the graph, but there are:
    a) 2055 in [-5, -1]
    b) 193 in [-inf, -6]
    while total is 117556. (20% line is with 0 included, which is 18K)

  2. Similar github search of range use cases shows that there are 1.7K (maybe a bit more) of range calls that produce negative ints.

Image

Image


So from (1), current negative range of [-5,-1] is sufficient. It covers 2% of all integers used at startup including imports (see the first graph in the original post), while the rest of negative integers constitute only 0.2%.

But range analysis (2) shows that there is some scope to consider increasing this number. In this analysis it largely depends what perspective is taken:
a) If the aim is to capture the same percentage of usages as we would for positives if they were cached up to inclusive 1024 (which is ~96%), then caching negatives up to inclusive -19 would do it.
b) If the aim is to achieve the same hit/cache benefit, then none of the negatives qualify at all.

(b) in my opinion is much more tangible POV.


Also, there is little to no chance that such a small increase in negative number cache size has noticeable impact on runtime of benchmarks. Given the above I am comfortable of not running them for such a negligible chance. Even if significant improvement was detected, it would most likely be explained away by it being a very specific case.


Frequency of negative integer usage is simply nowhere close to that of positive ones and those that are used at startup are already captured.

Thus, I don't think there is enough evidence to justify increasing cache size of negative numbers. And I have not encountered any frequent use cases in practice that could challenge this conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants
0