-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py: Faster qstr search #6896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
py: Faster qstr search #6896
Conversation
Do you have some measurements, preferably from different ports, showing how significant the improvement is in different situations? And if the search is faster what is the effect on creating and adding new qstr values at runtime? |
Unfortunately I can only provide measurements for ESP32.
That's x 21 performance gain. The firmware is Micropython 1.14 with LVGL native module.
There is no effect on creating and adding new qstr values at runtime because runtime qstr pools remain unsorted, and scanned in the same manner as today. |
443ef67
to
c1c2855
Compare
Not bad :)
Oops sorry missed that. Any particular reason for this? |
Yes.
|
Thanks for the explanation, makes sense. You might want to fix code formatting, some grammar issues in comments. |
c1c2855
to
ddb974e
Compare
Done. |
ddb974e
to
8532c21
Compare
Pending upstream PR: micropython#6896 This optimization is important because LVGL creates thousands of qstrs and 'import' can take seconds instead of milliseconds without better qstr search. Currently this PR is pending. Once it is merged or some other optimization that solves this problem is merged, this change could be reverted and replaced by the upstream solution. Today qstr implementation scans strings sequntially. In cases there are many strings this can become very inefficient. This change improves qstr search performance by using binary search in sorted qstr pools, when possible. This change introduces an option to create a sorted string pool, which is then searched by a binary search instead of sequential search. qstr pool can be either "sorted" or "unsorted", whereas the unsorted is searched sequentally as today. Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools. Currently runtime strings are unsorted. The constant string pools is split into two and a new pool is introduced, "special_const_pool". This is required because the first sequence of strings already requires special ordering therefore created unsorted, while the rest of the constants are generated sorted. qstr_find_strn searches strings in each pool. If the pool is sorted and larger than a threshold, it will be search using binary search instead of sequential search, significantly improving performance. (cherry picked from commit 8532c21)
Just to weigh in here, I've been shipping a MicroPython build for PicoSystem that includes this patch since October 2021 and - so far - have had no problems with it. The speedup (alongside doing some heinous things with builtins) is essential for our wide/flat API surface. |
Regarding indirection when accessing qstr’s see #8191 |
Running some benchmarks on this PR, it looks very good. On PYBv1.1 which has about 1600 static qstrs:
The new test there, On unix (which is always noisy) it's possible to also see an improvement, although unix only has about 600 static qstrs:
|
py/qstr.c
Outdated
|
||
// sequential search for the remaining strings | ||
for (const byte **q = pool->qstrs + low; q != pool->qstrs + high + 1; q++) { | ||
if (*q && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amirgon why does it check for *q
being non-zero? The original code did not need to do this in the linear search part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this check after rebase.
py/qstr.c
Outdated
@@ -171,14 +194,48 @@ STATIC qstr qstr_add(const byte *q_ptr) { | |||
return MP_STATE_VM(last_pool)->total_prev_len + MP_STATE_VM(last_pool)->len - 1; | |||
} | |||
|
|||
#define MP_QSTR_SEARCH_THRESHOLD 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was this number chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rational was that sequential search over small sequences is more efficient than binary search because of caching, compiler optimizations etc. It's probably a different number for different architectures, compiler flags etc.
So this number was chosen simply because it was good enough on my tests.
py/qstr.c
Outdated
if (len > str_len) { | ||
len = str_len; | ||
} | ||
int cmp = memcmp(Q_GET_DATA(*q), str, str_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might work to sort on qstr hash (and length) instead of qstr data, and then this comparison here would just be a comparison (subtraction) of the 2 hashes, which would be a lot faster than a memcmp.
Once it reached a case of 2 identical hashes it would then need to use memcmp, or just fall through to a linear search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea so I gave it a shot.
It turned out to be more challenging than I expected:
As you can see, extra_coverage test fails. That's because it tests auto completion which prints qstrs in order(find_completions
, print_completions
). So once qstrs are sorted by hash, the autocompletion order is no longer alphabetical...
We have a few options:
- Sort the auto completions results (allocate ram and sort qstr range by name?)
- Keep autocompletions unsorted
- Give up sorting by hash and keep alphabetical qstr sort
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to run some benchmarks on sorting by hash vs string content. Do you have the original version which used string sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the original version which used string sorting?
Yes. Only the last commit on this PR introduced hash sorting.
Simply checkout previous commmit 5808635
8532c21
to
c661055
Compare
Codecov Report
@@ Coverage Diff @@
## master #6896 +/- ##
==========================================
+ Coverage 98.45% 98.47% +0.02%
==========================================
Files 153 153
Lines 20153 20168 +15
==========================================
+ Hits 19842 19861 +19
+ Misses 311 307 -4
Continue to review full report at Codecov.
|
60f27df
to
cc808fe
Compare
I did performance tests for sorting on strings and sorting on hashes (on PYBv1.0). Going from baseline (current master) to sorting them on string value gives:
Going from baseline to sorting on hash:
Difference going from sorting on string to sorting on hash is small:
The Those tests are run with lots of frozen code on the PYBv1.0, and frozen qstrs are not sorted. So might be better to try again without frozen code. |
py/qstr.c
Outdated
} else { | ||
// avoid a rare (hash,len) collisions | ||
while (MP_UNLIKELY( | ||
pool->lengths[mid] != str_len || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this check on length. If the code gets here then cmp == 0
, which can only happen if the lengths are equal (and also the hashes are equal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And wouldn't it be simpler (smaller code, probably no effect on performance) to just break here and drop down into the sequential search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right - this can be simplified.
Done in 3bc257c.
@dpgeorge did you try performance tests on PYB with different values of |
ebde816
to
04ba17b
Compare
I agree.
I'm not sure how to check this really.
Once the first set is sorted by hash, why do we care if the order is different between 1-byte and 2-byte qstr hashes?
I don't understand why we see a large code increase. But overall, where does the significant code increase come from? One thing we can consider in order to decrease memory usage is to remove qstr lengths array entirely. |
Today qstr implementation scans strings sequntially. In cases there are many strings this can become very inefficient. This change improves qstr search performance by using binary search in sorted qstr pools, when possible. This change introduces an option to create a sorted string pool, which is then searched by a binary search instead of sequential search. qstr pool can be either "sorted" or "unsorted", whereas the unsorted is searched sequentally as today. Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools. Currently runtime strings are unsorted. The constant string pools is split into two and a new pool is introduced, "special_const_pool". This is required because the first sequence of strings already requires special ordering therefore created unsorted, while the rest of the constants are generated sorted. qstr_find_ 10000 strn searches strings in each pool. If the pool is sorted and larger than a threshold, it will be search using binary search instead of sequential search, significantly improving performance. Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
Use Qstr class instead of tuple, where properties are calculated only when accessed. This is needed as preparation to using hash as the sort key instead the qstr string. It also makes the code more readable when referring to a qstr in py/makeqstrdata.py and tools/mpy-tool.py (for example, refer to q.order instead of q[0], or q.qstr instead of q[2]) Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
Instead of sorting by string, sort qstrs by (hash, len). This allows faster binary serach on qstr_find_strn, since it's faster to compare hashes than strings. A few strings needed to be moved to special string pool (QDEF0) because their qstr is assumed to be small (8 bit) on py/scope.c Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
In case cmp == 0 there could be a hash collision. To save program size Instead of checking for hash collisions, simply revert to linear search which also compares strings. Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
This fixes a corner case where both hash and lengths of two qstrs are equal. In such case, if the binary search lower bound is the second instance of the same hash, the search will miss the first instance. To overcome this, when lower bound hash equals the requested hash, we move the lower bound backwards to cover all previous hashes which are also equal to the requested hash. Also removed length comparison since it's no longer needed. Related: lvgl/lv_binding_micropython#224 Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
MP_QSTR_SEARCH_THRESHOLD doesn't provide improvement with sorted hashes. Remove MP_QSTR_SEARCH_THRESHOLD and simlify the binary search. Sequential search is still needed in case of hash collisions, or unsorted pools. Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
3570f0f
to
72ee3ae
Compare
Signed-off-by: Amir Gonnen <amirgonnen@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Thanks @amirgon for your work on this and for bringing it to our attention in the first place -- this has been merged via #12678 which contains essentially the same change to qstr.c (but via a fairly circuitous route, see #10758, where we investigated a bunch of other things). I think it's still worth investigating splitting the static pool, and I'd love to eventually find a way to remove hashes without the performance hit. |
I had a brief look at doing this and I think it would slow a lot of operations down, because there are many places in the code that query the qstr/str lengths. Eg all uses of |
What are all the qstr's for? How many are used in a normal application? Eradicating as many as possible might be a more sensible option? Wouldn't making the caller turn whatever they want into an ENUM would make more sense? |
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This provides a significant performance boost for qstr_find_strn, which is called a lot during parsing and loading of .mpy files, as well as interning of string objects (which happens in most string methods that return new strings). Also adds comments to explain the "static" qstrs. These are part of the .mpy ABI and avoid needing to duplicate string data for QSTRs known to already be in the firmware. The static pool isn't currently sorted, but in the future we could either split the static pool into the sorted regions, or in the next .mpy version just sort them. Based on initial work done by @amirgon in micropython#6896. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com> (cherry picked from commit 64c79a5)
Motivation: qstr search scales poorly - see https://forum.micropython.org/viewtopic.php?f=2&t=9748
In short, when there are many qstrs, performance is poor because of the way they are searched.
This affects, for example, load time of mpy modules since strings are replaced by qstrs during import.
On ESP32 it takes many seconds loading few mpy files in the presence of large native libraries (LVGL for example).
Today qstr implementation scans strings sequntially.
In cases there are many strings this can become very inefficient.
This change improves qstr search performance by using binary search in
sorted qstr pools, when possible.
This change introduces an option to create a sorted string p 8000 ool, which
is then searched by a binary search instead of sequential search.
qstr pool can be either "sorted" or "unsorted", whereas the unsorted is
searched sequentally as today.
Native modules (MP_ROM_QSTR) and frozen modules generate sorted pools.
Currently runtime strings are unsorted.
The constant string pools is split into two and a new pool is introduced,
"special_const_pool". This is required because the first sequence of
strings already requires special ordering therefore created unsorted,
while the rest of the constants are generated sorted.
qstr_find_strn searches strings in each pool. If the pool is sorted and
larger than a threshold, it will be search using binary search instead
of sequential search, significantly improving performance.