Improve performances of `fnmatch.translate` #122288

picnixz · 2024-07-25T16:05:21Z

Feature or enhancement

Proposal:

I implemented fnmatch.translate and fnmatch.filter in C in #121446 but the performances for fnmatch.filter are less pronounced than those in fnmatch.translate. While I believe that the factor 2x improvement brought by the C implementation is worthwhile for fnmatch.translate, this comes at a high maintenance cost. Instead, I translated (no pun intended) the implementation in Python and obtained the following timings (PGO build):

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 3.99 us: 1.53x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.39 us               | 4.07 us: 1.57x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.24 us               | 1.51 us: 1.49x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.97 us               | 1.12 us: 1.76x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 1.21 us: 2.48x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 3.33 us: 1.62x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.71x faster          |
+------------------------------------------+-----------------------+-----------------------+

It's not as good as the benchmarks reported in #121445, but I think this could be considered. Note that the improvements are brought by a new (but equivalent) algorithm during the translation phase. Currently, the translation algorithm does:

Do not compress consecutive '*' into 1 within the same loop iteration.
Use a sentinel to indicate a wildcard instead.
Add to a list re.escape(c) for every non-special character (i.e, not for groups beginning by ?*[).
In a second phase, merge the groups delimited by the sentinels in order to produce the final regex.

Our (Barney and me) implementation instead does the following:

Cache re.escape so that the lookup is faster.
Remember the positions of '*' in the above split instead of using a sentinel object.
That way, I can easily merge parts during the second phase (technically, it'd be possible to do it in one phase but two passes makes it clearer).

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

gh-122288: Improve performances of fnmatch.translate #122289

The text was updated successfully, but these errors were encountered:

barneygale · 2024-08-14T18:56:24Z

Instead of having many chunks with 1 or 2 characters, I try to have as many chunks with strings that are already formatted, e.g., ["abc", "*"] instead of ["a", "b", "c", SENTINEL].

The current implementation does this to minimise string concatenation, i.e. it defers join()ing until its unavoidable. This should be faster, so I'm surprised you're getting different results! It's quite possible I've misunderstood your patch though?

picnixz · 2024-08-15T08:40:01Z

Thanks for noticing this. I think there are two things to consider:

Calling re.escape(c) at every loop iteration (most of the time we don't have *?[ but rather "normal" characters to be escaped) is probably more costly. You have a call to isinstance and a call to str.translate. By looking at the implementation of str.translate, each time str.translate is called, a writer is allocated on the heap, so it slows down the entire algorithm at every call. In particular, if I need to successively call res.append(re.escape(c)) a lot of time (which is what happens in practice), then it's probably faster to do res.append(re.escape(''.join(x))) or res.extend(re.escape(c) for c in x).

My benchmarks show that res.append(re.escape(''.join(x))) is faster than res.extend(re.escape(c) for c in x). I suspect this is clearly because of the overhead of re.escape.

Now, there is one case where the new algorithm might be slower. This is when you only have a single character in the pending list of characters to escape, namely x = [c]. You can see that the improvements are less pronounced for the case a**?**cd**?**??k*** but still faster. I've added an other case to illustrate an extreme case:

+---------------------------------+-----------------------+-----------------------+
| Benchmark                       | fnmatch-translate-ref | fnmatch-translate-py  |
+=================================+=======================+=======================+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** | 5.09 us               | 4.95 us: 1.03x faster |
+---------------------------------+-----------------------+-----------------------+

Here the performances are almost identical, and this is what I would have expected. I've also added a comment:

# Handling '?' one at a time seems to more efficient
# even if there are consecutive '?' that could have
# been written directly.

The reason why it's more efficient is because the fast path is actually rarer than the slow path. So, looping over the consecutive '?' is heuristically less efficient. The same analysis could be applied to the case of consecutive * so I compared two implementations (the one in the PR and the one where I keep the treatment of * by not advancing the index in the if c == '*' but guarding the add(STAR) with another if). Benchmarks suggested that using a while to skip over successive * is slightly faster for the other cases (the case a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** is obvisouly slower with a while loop but not that much):

With an inner while-loop to skip over consecutive '*' (PR implementation)

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.06 us               | 4.92 us: 1.23x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.37 us               | 5.06 us: 1.26x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.21 us               | 2.00 us: 1.10x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.99 us               | 1.56 us: 1.28x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.04 us               | 1.50 us: 2.03x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 5.25 us: 1.03x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.29x faster          |
+------------------------------------------+-----------------------+-----------------------+

With a single if to skip over consecutive '*' (same as the ref implementation)

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 5.13 us: 1.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.36 us               | 5.23 us: 1.22x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 2.01 us               | 1.64 us: 1.22x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.10 us               | 1.48 us: 2.09x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.44 us               | 5.14 us: 1.06x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.26x faster          |
+------------------------------------------+-----------------------+-----------------------+

Benchmark hidden because not significant (1): a**?**cd**?**??k***

One reason is that using an inner while saves a check on whether we need to push the pending characters or not and that we can jump faster to the next character that is not *.

barneygale · 2024-08-16T18:15:42Z

Thank you very much for the explanation!

Q: could the cost of re.escape() be ameliorated with an lru_cache() instead? e.g.:

diff --git a/Lib/fnmatch.py b/Lib/fnmatch.py
index 73acb1fe8d4..dc9bf122e39 100644
--- a/Lib/fnmatch.py
+++ b/Lib/fnmatch.py
@@ -149,11 +149,14 @@ def _translate(pat, STAR, QUESTION_MARK):
                         stuff = '\\' + stuff
                     add(f'[{stuff}]')
         else:
-            add(re.escape(c))
+            add(_escape_char(c))
     assert i == n
     return res
 
 
+_escape_char = functools.lru_cache(maxsize=32768)(re.escape)
+
+
 def _join_translated_parts(inp, STAR):
     # Deal with STARs.
     res = []

picnixz · 2024-08-17T08:13:13Z

It could be, I didn't check this alternative. Also, we could simply make it faster by a simple dictionary lookup because the implementation is private and we're supposed to have only strings being passed. Alternatively, we could inline the escaping directly. I'll try your suggestion today.

picnixz · 2024-08-17T10:35:00Z

Huh, so your approach is actually the best (the cache is cleared before each benchmark but since there are many iterations for each row, the first call which caches the result is probably negligible).

+------------------------------------------+-----------------------+-----------------------+
| Benchmark                                | fnmatch-translate-ref | fnmatch-translate-py  |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!]  | 6.09 us               | 3.99 us: 1.53x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.39 us               | 4.07 us: 1.57x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k***                      | 2.24 us               | 1.51 us: 1.49x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c                              | 1.97 us               | 1.12 us: 1.76x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1                          | 3.00 us               | 1.21 us: 2.48x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o**          | 5.40 us               | 3.33 us: 1.62x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean                           | (ref)                 | 1.71x faster          |
+------------------------------------------+-----------------------+-----------------------+

There are significant performance gains for each runs. I totally forgot about having the possibility of caching re.escape. ~~I'll try to even reduce the memory cost by simply inlining re.escape.~~ (no need to have that much micro-optimization; it will be hard to maintain so let's keep your solution)

picnixz · 2024-08-17T10:51:21Z

I'll also update the C implementation. I discovered two other optimizations that I could on the C side but I think the C implementation is not that worth it unless you want a x3 improvement (here we are roughly x1.75, but I think we can go up to x2.8-x3.5 in C).

EDIT: We eventually were able to gain a factor 6 improvement but since the maintaince cost is high (I would be the one responsible for it probably and I should confess that I'd prefer maintining the Python implementation), I decided to give up on the C implementation.

Improve performance of this function by a factor of 1.7x. Co-authored-by: Barney Gale <barney.gale@gmail.com>

picnixz · 2024-11-27T16:47:18Z

Closing since completed.

…122289) Improve performance of this function by a factor of 1.7x. Co-authored-by: Barney Gale <barney.gale@gmail.com>

picnixz added type-feature A feature request or enhancement performance Performance or resource usage labels Jul 25, 2024

bedevere-app bot mentioned this issue Jul 25, 2024

gh-122288: Improve performances of fnmatch.translate #122289

Merged

picnixz added the stdlib Python modules in the Lib dir label Jul 25, 2024

8000
picnixz mentioned this issue Jul 28, 2024

Add C accelerator for fnmatch #121445

Closed

barneygale added a commit that referenced this issue Nov 27, 2024

gh-122288: Improve performances of fnmatch.translate (#122289)

78cb377

Improve performance of this function by a factor of 1.7x. Co-authored-by: Barney Gale <barney.gale@gmail.com>

picnixz closed this as completed Nov 27, 2024

ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025

pythongh-122288: Improve performances of fnmatch.translate (python#…

8390b77

…122289) Improve performance of this function by a factor of 1.7x. Co-authored-by: Barney Gale <barney.gale@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve performances of `fnmatch.translate` #122288

Improve performances of `fnmatch.translate` #122288

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve performances of fnmatch.translate #122288

Improve performances of fnmatch.translate #122288

Comments

Uh oh!

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve performances of `fnmatch.translate` #122288

Improve performances of `fnmatch.translate` #122288