8000 py: Faster qstr search by amirgon · Pull Request #6896 · micropython/micropython · GitHub
[go: up one dir, main page]

Skip to content

py: Faster qstr search #6896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

amirgon
Copy link
Contributor
@amirgon amirgon commented Feb 13, 2021

Motivation: qstr search scales poorly - see https://forum.micropython.org/viewtopic.php?f=2&t=9748

In short, when there are many qstrs, performance is poor because of the way they are searched.
This affects, for example, load time of mpy modules since strings are replaced by qstrs during import.
On ESP32 it takes many seconds loading few mpy files in the presence of large native libraries (LVGL for example).


Today qstr implementation scans strings sequntially.
In cases there are many strings this can become very inefficient.
This change improves qstr search performance by using binary search in
sorted qstr pools, when possible.

This change introduces an option to create a sorted string p 8000 ool, which
is then searched by a binary search instead of sequential search.

qstr pool can be either "sorted" or "unsorted", whereas the unsorted is
searched sequentally as today.
Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools.
Currently runtime strings are unsorted.

The constant string pools is split into two and a new pool is introduced,
"special_const_pool". This is required because the first sequence of
strings already requires special ordering therefore created unsorted,
while the rest of the constants are generated sorted.

qstr_find_strn searches strings in each pool. If the pool is sorted and
larger than a threshold, it will be search using binary search instead
of sequential search, significantly improving performance.

@stinos
Copy link
Contributor
stinos commented Feb 14, 2021

Do you have some measurements, preferably from different ports, showing how significant the improvement is in different situations? And if the search is faster what is the effect on creating and adding new qstr values at runtime?

@amirgon
Copy link
Contributor Author
amirgon commented Feb 14, 2021

Do you have some measurements, preferably from different ports, showing how significant the improvement is in different situations?

Unfortunately I can only provide measurements for ESP32.
On ESP32 @240Mhz I'm importing a 10KB view.mpy file shortly after reset:

>>> ticks_ms();import view;ticks_ms()
  • Without this change the import completes in 3590ms
  • With this change the import completes in 169ms

That's x 21 performance gain.
Measuring this multiple times I'm getting the same results with up to 1 ms difference.

The firmware is Micropython 1.14 with LVGL native module.
LVGL is a large library. Looking at qstrdefs.generated.h there are 4450 QDEFs. It's clear that scanning all these strings sequentially for each and every str-->qstr conversion is inefficient, particularly since the const qstr pool is stored on flash.

And if the search is faster what is the effect on creating and adding new qstr values at runtime?

There is no effect on creating and adding new qstr values at runtime because runtime qstr pools remain unsorted, and scanned in the same manner as today.
Only qstr pools generated at compile time are sorted, so the benefit becomes significant when scaling native and frozen modules.

@stinos
Copy link
Contributor
stinos commented Feb 15, 2021

That's x 21 performance gain.

Not bad :)

runtime qstr pools remain unsorted

Oops sorry missed that. Any particular reason for this?

@amirgon
Copy link
Contributor Author
amirgon commented Feb 15, 2021

runtime qstr pools remain unsorted

Oops sorry missed that. Any particular reason for this?

Yes.

  • qstrs are indices into qstr pools. So once a pool is in use (the code refers to specific qstrs in the pool), we can no longer change its sort order otherwise existing qstrs used in the code would no longer be correct.
    To overcome this we would either need to add an indirection level when accessing qstrs (that would cost performance and RAM) or select a different solution such as hash-table instead of sorted array. Hash-table is a valid alternative, but has its disadvantages (more costly in RAM, harder to tune optimally, more complex to implement).
  • From practical point of view, when scaling up I think we can expect the mass of qstrs to come from native or frozen modules, and the minority to be runtime qstrs. So most benefit would come from improving access to native / frozen qstrs.

@stinos
Copy link
Contributor
stinos commented Feb 15, 2021

Thanks for the explanation, makes sense. You might want to fix code formatting, some grammar issues in comments.

@amirgon
Copy link
Contributor Author
amirgon commented Feb 15, 2021

You might want to fix code formatting, some grammar issues in comments.

Done.

amirgon added a commit to lvgl/lv_micropython that referenced this pull request Feb 27, 2021
Pending upstream PR: micropython#6896

This optimization is important because LVGL creates thousands
of qstrs and 'import' can take seconds instead of milliseconds without
better qstr search.
Currently this PR is pending. Once it is merged or some other
optimization that solves this problem is merged, this change
could be reverted and replaced by the upstream solution.

Today qstr implementation scans strings sequntially.
In cases there are many strings this can become very inefficient.
This change improves qstr search performance by using binary search in
sorted qstr pools, when possible.

This change introduces an option to create a sorted string pool, which
is then searched by a binary search instead of sequential search.

qstr pool can be either "sorted" or "unsorted", whereas the unsorted is
searched sequentally as today.
Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools.
Currently runtime strings are unsorted.

The constant string pools is split into two and a new pool is introduced,
"special_const_pool". This is required because the first sequence of
strings already requires special ordering therefore created unsorted,
while the rest of the constants are generated sorted.

qstr_find_strn searches strings in each pool. If the pool is sorted and
larger than a threshold, it will be search using binary search instead
of sequential search, significantly improving performance.

(cherry picked from commit 8532c21)
@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Nov 29, 2021
@Gadgetoid
Copy link
Contributor

Just to weigh in here, I've been shipping a MicroPython build for PicoSystem that includes this patch since October 2021 and - so far - have had no problems with it. The speedup (alongside doing some heinous things with builtins) is essential for our wide/flat API surface.

@thyrrestrup
Copy link

Regarding indirection when accessing qstr’s see #8191

@dpgeorge
Copy link
Member
dpgeorge commented Feb 9, 2022

Running some benchmarks on this PR, it looks very good. On PYBv1.1 which has about 1600 static qstrs:

diff of scores (higher is better)
N=100 M=100                  pyb-p0 ->     pyb-p1         diff      diff% (error%)
bm_chaos.py                  369.56 ->     374.23 :      +4.67 =  +1.264% (+/-0.02%)
bm_fannkuch.py                77.51 ->      78.45 :      +0.94 =  +1.213% (+/-0.00%)
bm_fft.py                   2558.76 ->    2577.15 :     +18.39 =  +0.719% (+/-0.00%)
bm_float.py                 5922.05 ->    6143.30 :    +221.25 =  +3.736% (+/-0.02%)
bm_hexiom.py                  47.96 ->      49.77 :      +1.81 =  +3.774% (+/-0.00%)
bm_nqueens.py               4446.35 ->    4457.18 :     +10.83 =  +0.244% (+/-0.00%)
bm_pidigits.py               647.41 ->     650.75 :      +3.34 =  +0.516% (+/-0.15%)
core_qstr.py                5020.31 ->   16570.01 :  +11549.70 = +230.059% (+/-0.01%)
misc_aes.py                  421.88 ->     427.24 :      +5.36 =  +1.271% (+/-0.01%)
misc_mandel.py              3566.58 ->    3584.88 :     +18.30 =  +0.513% (+/-0.00%)
misc_pystone.py             2487.97 ->    2583.21 :     +95.24 =  +3.828% (+/-0.02%)
misc_raytrace.py             376.64 ->     383.84 :      +7.20 =  +1.912% (+/-0.02%)
viper_call0.py               576.54 ->     576.73 :      +0.19 =  +0.033% (+/-0.04%)
viper_call1a.py              550.05 ->     550.38 :      +0.33 =  +0.060% (+/-0.04%)
viper_call1b.py              435.39 ->     435.95 :      +0.56 =  +0.129% (+/-0.19%)
viper_call1c.py              441.49 ->     440.80 :      -0.69 =  -0.156% (+/-0.21%)
viper_call2a.py              535.74 ->     536.31 :      +0.57 =  +0.106% (+/-0.12%)
viper_call2b.py              381.40 ->     381.91 :      +0.51 =  +0.134% (+/-0.15%)

The new test there, core_qstr.py, effectively calls qstr_find_strn() many times in a loop. And it's significantly faster with this PR. Other benchmarks are not impacted, which is good.

On unix (which is always noisy) it's possible to also see an improvement, although unix only has about 600 static qstrs:

diff of scores (higher is better)
N=5000 M=100                unix-p0 ->    unix-p1         diff      diff% (error%)
bm_fannkuch.py                10.52 ->      11.00 :      +0.48 =  +4.563% (+/-4.87%)
bm_fft.py                 221486.06 ->  222913.26 :   +1427.20 =  +0.644% (+/-5.24%)
bm_float.py               446655.34 ->  452957.42 :   +6302.08 =  +1.411% (+/-4.16%)
bm_hexiom.py                3531.20 ->    3549.03 :     +17.83 =  +0.505% (+/-2.11%)
bm_nqueens.py             442323.78 ->  414218.36 :  -28105.42 =  -6.354% (+/-8.21%)
bm_pidigits.py             48953.40 ->   50719.93 :   +1766.53 =  +3.609% (+/-7.57%)
core_qstr.py              940349.56 -> 1432698.78 : +492349.22 = +52.358% (+/-7.42%)
misc_aes.py                33653.40 ->   32860.59 :    -792.81 =  -2.356% (+/-7.40%)
misc_mandel.py            242690.30 ->  238764.20 :   -3926.10 =  -1.618% (+/-6.17%)
misc_pystone.py           139726.47 ->  135769.40 :   -3957.07 =  -2.832% (+/-7.61%)
misc_raytrace.py           37689.06 ->   36494.55 :   -1194.51 =  -3.169% (+/-7.47%)
viper_call0.py             34051.60 ->   34323.85 :    +272.25 =  +0.800% (+/-6.95%)
viper_call1a.py            31946.89 ->   33872.77 :   +1925.88 =  +6.028% (+/-7.69%)
viper_call1b.py            27876.34 ->   26771.29 :   -1105.05 =  -3.964% (+/-6.14%)
viper_call1c.py            26162.97 ->   26561.40 :    +398.43 =  +1.523% (+/-10.81%)
viper_call2a.py            27179.72 ->   33407.50 :   +6227.78 = +22.913% (+/-12.60%)
viper_call2b.py            19137.07 ->   24199.32 :   +5062.25 = +26.453% (+/-8.79%)

py/qstr.c Outdated

// sequential search for the remaining strings
for (const byte **q = pool->qstrs + low; q != pool->qstrs + high + 1; q++) {
if (*q &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amirgon why does it check for *q being non-zero? The original code did not need to do this in the linear search part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this check after rebase.

py/qstr.c Outdated
@@ -171,14 +194,48 @@ STATIC qstr qstr_add(const byte *q_ptr) {
return MP_STATE_VM(last_pool)->total_prev_len + MP_STATE_VM(last_pool)->len - 1;
}

#define MP_QSTR_SEARCH_THRESHOLD 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this number chosen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rational was that sequential search over small sequences is more efficient than binary search because of caching, compiler optimizations etc. It's probably a different number for different architectures, compiler flags etc.

So this number was chosen simply because it was good enough on my tests.

py/qstr.c Outdated
if (len > str_len) {
len = str_len;
}
int cmp = memcmp(Q_GET_DATA(*q), str, str_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might work to sort on qstr hash (and length) instead of qstr data, and then this comparison here would just be a comparison (subtraction) of the 2 hashes, which would be a lot faster than a memcmp.

Once it reached a case of 2 identical hashes it would then need to use memcmp, or just fall through to a linear search.

Copy link
Contributor Author
@amirgon amirgon Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea so I gave it a shot.

It turned out to be more challenging than I expected:
As you can see, extra_coverage test fails. That's because it tests auto completion which prints qstrs in order(find_completions, print_completions). So once qstrs are sorted by hash, the autocompletion order is no longer alphabetical...

We have a few options:

  • Sort the auto completions results (allocate ram and sort qstr range by name?)
  • Keep autocompletions unsorted
  • Give up sorting by hash and keep alphabetical qstr sort

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to run some benchmarks on sorting by hash vs string content. Do you have the original version which used string sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the original version which used string sorting?

Yes. Only the last commit on this PR introduced hash sorting.
Simply checkout previous commmit 5808635

@codecov-commenter
Copy link
codecov-commenter commented Feb 19, 2022

Codecov Report

Merging #6896 (b314530) into master (ff9c708) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6896      +/-   ##
==========================================
+ Coverage   98.45%   98.47%   +0.02%     
==========================================
  Files         153      153              
  Lines       20153    20168      +15     
==========================================
+ Hits        19842    19861      +19     
+ Misses        311      307       -4     
Impacted Files Coverage Δ
py/qstr.c 95.65% <100.00%> (+0.44%) ⬆️
py/bc.c 92.23% <0.00%> (+3.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff9c708...b314530. Read the comment docs.

@amirgon amirgon force-pushed the qstr_bsearch branch 5 times, most recently from 60f27df to cc808fe Compare February 24, 2022 21:45
@dpgeorge
Copy link
Member

I did performance tests for sorting on strings and sorting on hashes (on PYBv1.0).

Going from baseline (current master) to sorting them on string value gives:

diff of scores (higher is better)
N=100 M=100                pyb-baseline -> pyb-sort-str         diff      diff% (error%)
bm_chaos.py                    357.40 ->     352.80 :      -4.60 =  -1.287% (+/-0.02%)
bm_fannkuch.py                  77.49 ->      77.40 :      -0.09 =  -0.116% (+/-0.01%)
bm_fft.py                     2539.26 ->    2525.13 :     -14.13 =  -0.556% (+/-0.00%)
bm_float.py                   5908.33 ->    5872.71 :     -35.62 =  -0.603% (+/-0.01%)
bm_hexiom.py                    47.93 ->      48.77 :      +0.84 =  +1.753% (+/-0.00%)
bm_nqueens.py                 4459.94 ->    4433.80 :     -26.14 =  -0.586% (+/-0.00%)
bm_pidigits.py                 644.88 ->     640.95 :      -3.93 =  -0.609% (+/-0.27%)
core_import_mpy_multi.py       581.49 ->     759.37 :    +177.88 = +30.590% (+/-0.00%)
core_import_mpy_single.py       67.16 ->     105.41 :     +38.25 = +56.954% (+/-0.01%)
core_qstr.py                    64.12 ->     206.44 :    +142.32 = +221.959% (+/-0.01%)
core_yield_from.py             354.50 ->     353.96 :      -0.54 =  -0.152% (+/-0.00%)
misc_aes.py                    405.59 ->     410.03 :      +4.44 =  +1.095% (+/-0.00%)
misc_mandel.py                3416.48 ->    3387.86 :     -28.62 =  -0.838% (+/-0.00%)
misc_pystone.py               2405.57 ->    2419.06 :     +13.49 =  +0.561% (+/-0.00%)
misc_raytrace.py               374.01 ->     368.50 :      -5.51 =  -1.473% (+/-0.00%)

Going from baseline to sorting on hash:

diff of scores (higher is better)
N=100 M=100                pyb-baseline -> pyb-sort-hash         diff      diff% (error%)
bm_chaos.py                    357.40 ->     359.03 :      +1.63 =  +0.456% (+/-0.03%)
bm_fannkuch.py                  77.49 ->      76.28 :      -1.21 =  -1.561% (+/-0.01%)
bm_fft.py                     2539.26 ->    2523.73 :     -15.53 =  -0.612% (+/-0.00%)
bm_float.py                   5908.33 ->    5910.11 :      +1.78 =  +0.030% (+/-0.01%)
bm_hexiom.py                    47.93 ->      48.85 :      +0.92 =  +1.919% (+/-0.00%)
bm_nqueens.py                 4459.94 ->    4421.59 :     -38.35 =  -0.860% (+/-0.00%)
bm_pidigits.py                 644.88 ->     644.49 :      -0.39 =  -0.060% (+/-0.21%)
core_import_mpy_multi.py       581.49 ->     763.36 :    +181.87 = +31.277% (+/-0.00%)
core_import_mpy_single.py       67.16 ->     106.25 :     +39.09 = +58.204% (+/-0.01%)
core_qstr.py                    64.12 ->     212.31 :    +148.19 = +231.114% (+/-0.00%)
core_yield_from.py             354.50 ->     354.05 :      -0.45 =  -0.127% (+/-0.00%)
misc_aes.py                    405.59 ->     404.15 :      -1.44 =  -0.355% (+/-0.00%)
misc_mandel.py                3416.48 ->    3397.47 :     -19.01 =  -0.556% (+/-0.00%)
misc_pystone.py               2405.57 ->    2446.53 :     +40.96 =  +1.703% (+/-0.00%)
misc_raytrace.py               374.01 ->     369.21 :      -4.80 =  -1.283% (+/-0.00%)

Difference going from sorting on string to sorting on hash is small:

diff of scores (higher is better)
N=100 M=100                pyb-sort-str -> pyb-sort-hash         diff      diff% (error%)
bm_chaos.py                    352.80 ->     359.03 :      +6.23 =  +1.766% (+/-0.01%)
bm_fannkuch.py                  77.40 ->      76.28 :      -1.12 =  -1.447% (+/-0.00%)
bm_fft.py                     2525.13 ->    2523.73 :      -1.40 =  -0.055% (+/-0.00%)
bm_float.py                   5872.71 ->    5910.11 :     +37.40 =  +0.637% (+/-0.01%)
bm_hexiom.py                    48.77 ->      48.85 :      +0.08 =  +0.164% (+/-0.00%)
bm_nqueens.py                 4433.80 ->    4421.59 :     -12.21 =  -0.275% (+/-0.00%)
bm_pidigits.py                 640.95 ->     644.49 :      +3.54 =  +0.552% (+/-0.25%)
core_import_mpy_multi.py       759.37 ->     763.36 :      +3.99 =  +0.525% (+/-0.00%)
core_import_mpy_single.py      105.41 ->     106.25 :      +0.84 =  +0.797% (+/-0.01%)
core_qstr.py                   206.44 ->     212.31 :      +5.87 =  +2.843% (+/-0.00%)
core_yield_from.py             353.96 ->     354.05 :      +0.09 =  +0.025% (+/-0.00%)
misc_aes.py                    410.03 ->     404.15 :      -5.88 =  -1.434% (+/-0.00%)
misc_mandel.py                3387.86 ->    3397.47 :      +9.61 =  +0.284% (+/-0.00%)
misc_pystone.py               2419.06 ->    2446.53 :     +27.47 =  +1.136% (+/-0.00%)
misc_raytrace.py               368.50 ->     369.21 :      +0.71 =  +0.193% (+/-0.00%)

The core_qstr.py test is intended to really stress test the qstr_find_strn() function, so I'd say these results give an accurate picture.

Those tests are run with lots of frozen code on the PYBv1.0, and frozen qstrs are not sorted. So might be better to try again without frozen code.

py/qstr.c Outdated
} else {
// avoid a rare (hash,len) collisions
while (MP_UNLIKELY(
pool->lengths[mid] != str_len ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this check on length. If the code gets here then cmp == 0, which can only happen if the lengths are equal (and also the hashes are equal).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And wouldn't it be simpler (smaller code, probably no effect on performance) to just break here and drop down into the sequential search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - this can be simplified.
Done in 3bc257c.

@amirgon
Copy link
Contributor Author
amirgon commented Feb 25, 2022

I did performance tests for sorting on strings and sorting on hashes (on PYBv1.0).

@dpgeorge did you try performance tests on PYB with different values of MP_QSTR_SEARCH_THRESHOLD?

@amirgon amirgon force-pushed the qstr_bsearch branch 4 times, most recently from ebde816 to 04ba17b Compare August 9, 2022 20:50
@amirgon
Copy link
Contributor Author
amirgon commented Aug 9, 2022

@dpgeorge

It's definitely worth sorting all the qstrs, including the first set (the pre-defined ones, and the ones that need to fit in 1 byte).

I agree.
Done on 311b33c.

Sorting the first set of qstrs via hash will move MP_QSTR_ away from the second position. I'm not 100% sure this is OK, will need to be checked if nothing relies on that qstr having value 1.

I'm not sure how to check this really.

Sorting the first set of qstrs via hash will need builds that use 1-byte qstr hashes to use the most-significant-byte of the hash, so that the order of the pre-defined qstrs is the same regardless of whether 1-byte or 2-byte qstr hashes are used.

Once the first set is sorted by hash, why do we care if the order is different between 1-byte and 2-byte qstr hashes?

The code size increase is quite large... but I think there might be a few opportunities to reduce that, eg shrink the size of the qstr_pool_t struct.

I don't understand why we see a large code increase.
The code change is rather small, the increase per pool is small and there are usually small number of pools. We could even keep qstr_pool_t the same size if we encoded sorted in a bit of one of the other fields.

But overall, where does the significant code increase come from?
Maybe the compiler is unrolling the loop? But when code size is important we would compile with -Os and prevent loop unrolling.
Did you measure the code increase when compiling with -Os?


One thing we can consider in order to decrease memory usage is to remove qstr lengths array entirely.
Checking qstr length would be a little more expensive (count until the null termination), but memory saving could be significant with large number of strings.

Today qstr implementation scans strings sequntially.
In cases there are many strings this can become very inefficient.
This change improves qstr search performance by using binary search in
sorted qstr pools, when possible.

This change introduces an option to create a sorted string pool, which
is then searched by a binary search instead of sequential search.

qstr pool can be either "sorted" or "unsorted", whereas the unsorted is
searched sequentally as today.
Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools.
Currently runtime strings are unsorted.

The constant string pools is split into two and a new pool is introduced,
"special_const_pool". This is required because the first sequence of
strings already requires special ordering therefore created unsorted,
while the rest of the constants are generated sorted.

qstr_find_
10000
strn searches strings in each pool. If the pool is sorted and
larger than a threshold, it will be search using binary search instead
of sequential search, significantly improving performance.

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
Use Qstr class instead of tuple, where properties are calculated only when
accessed.
This is needed as preparation to using hash as the sort key instead the
qstr string. It also makes the code more readable when referring to a qstr
in py/makeqstrdata.py and tools/mpy-tool.py (for example, refer to q.order
instead of q[0], or q.qstr instead of q[2])

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
Instead of sorting by string, sort qstrs by (hash, len).
This allows faster binary serach on qstr_find_strn, since it's faster
to compare hashes than strings.
A few strings needed to be moved to special string pool (QDEF0)
because their qstr is assumed to be small (8 bit) on py/scope.c

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
In case cmp == 0 there could be a hash collision. To save program size
Instead of checking for hash collisions, simply revert to linear search
which also compares strings.

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
This fixes a corner case where both hash and lengths of two qstrs are
equal. In such case, if the binary search lower bound is the second
instance of the same hash, the search will miss the first instance.
To overcome this, when lower bound hash equals the requested hash, we move
the lower bound backwards to cover all previous hashes which are also equal
to the requested hash.
Also removed length comparison since it's no longer needed.

Related: lvgl/lv_binding_micropython#224

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
MP_QSTR_SEARCH_THRESHOLD doesn't provide improvement with sorted hashes.
Remove MP_QSTR_SEARCH_THRESHOLD and simlify the binary search.

Sequential search is still needed in case of hash collisions, or
unsorted pools.

Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
@amirgon amirgon force-pushed the qstr_bsearch branch 3 times, most recently from 3570f0f to 72ee3ae Compare August 13, 2022 22:07
Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
dpgeorge pushed a commit to jimmo/micropython that referenced this pull request Oct 29, 2023
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
dpgeorge pushed a commit to jimmo/micropython that referenced this pull request Oct 30, 2023
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo
Copy link
Member
jimmo commented Oct 30, 2023

Thanks @amirgon for your work on this and for bringing it to our attention in the first place -- this has been merged via #12678 which contains essentially the same change to qstr.c (but via a fairly circuitous route, see #10758, where we investigated a bunch of other things). I think it's still worth investigating splitting the static pool, and I'd love to eventually find a way to remove hashes without the performance hit.

@jimmo jimmo closed this Oct 30, 2023
@dpgeorge
Copy link
Member

One thing we can consider in order to decrease memory usage is to remove qstr lengths array entirely.

I had a brief look at doing this and I think it would slow a lot of operations down, because there are many places in the code that query the qstr/str lengths. Eg all uses of GET_STR_DATA_LEN(), that would need to use strlen() to compute the qstr length each time it was called.

@gitcnd
Copy link
gitcnd commented Nov 11, 2023

What are all the qstr's for? How many are used in a normal application? Eradicating as many as possible might be a more sensible option? Wouldn't making the caller turn whatever they want into an ENUM would make more sense?

graeme-winter pushed a commit to winter-special-projects/micropython that referenced this pull request Sep 21, 2024
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
romanz pushed a commit to trezor/micropython that referenced this pull request Apr 9, 2025
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
romanz pushed a commit to trezor/micropython that referenced this pull request Apr 9, 2025
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
romanz pushed a commit to trezor/micropython that referenced this pull request Apr 14, 2025
This provides a significant performance boost for qstr_find_strn, which is
called a lot during parsing and loading of .mpy files, as well as interning
of string objects (which happens in most string methods that return new
strings).

Also adds comments to explain the "static" qstrs.  These are part of the
.mpy ABI and avoid needing to duplicate string data for QSTRs known to
already be in the firmware.  The static pool isn't currently sorted, but in
the future we could either split the static pool into the sorted regions,
or in the next .mpy version just sort them.

Based on initial work done by @amirgon in micropython#6896.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
(cherry picked from commit 64c79a5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0