32 |
| -the API, but it has to be global because both regcomp and regexec call it. | 33 |
| -It'd be better to get rid of that, as well as the static variables it |
34 |
| -sets, in favor of keeping the needed locale state in the regex structs. |
35 |
| -We have not done this yet for lack of a design for how to add |
36 |
| -application-specific state to the structs.) |
| 30 | +(Actually the above is a lie in one respect: there are two more global |
| 31 | +symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are |
| 32 | +not meant to be part of the API, but they have to be global because both |
| 33 | +regcomp and regexec call them. It'd be better to get rid of |
| 34 | +pg_set_regex_collation, as well as the static variables it sets, in favor of |
| 35 | +keeping the needed locale state in the regex structs. We have not done this |
| 36 | +yet for lack of a design for how to add application-specific state to the |
| 37 | +structs.) |
37 | 38 |
|
38 | 39 | What's where in src/backend/regex/:
|
39 | 40 |
|
@@ -274,28 +275,65 @@ colors:
|
274 | 275 | an existing color has to be subdivided.
|
275 | 276 |
|
276 | 277 | The last two of these are handled with the "struct colordesc" array and
|
277 |
| -the "colorchain" links in NFA arc structs. The color map proper (that |
278 |
| -is, the per-character lookup array) is handled as a multi-level tree, |
279 |
| -with each tree level indexed by one byte of a character's value. The |
280 |
| -code arranges to not have more than one copy of bottom-level tree pages |
281 |
| -that are all-the-same-color. |
282 |
| - |
283 |
| -Unfortunately, this design does not seem terribly efficient for common |
284 |
| -cases such as a tree in which all Unicode letters are colored the same, |
285 |
| -because there aren't that many places where we get a whole page all the |
286 |
| -same color, except at the end of the map. (It also strikes me that given |
287 |
| -PG's current restrictions on the range of Unicode values, we could use a |
288 |
| -3-level rather than 4-level tree; but there's not provision for that in |
289 |
| -regguts.h at the moment.) |
290 |
| - |
291 |
| -A bigger problem is that it just doesn't seem very reasonable to have to |
292 |
| -consider each Unicode letter separately at regex parse time for a regex |
293 |
| -such as "\w"; more than likely, a huge percentage of those codes will |
294 |
| -never be seen at runtime. We need to fix things so that locale-based |
295 |
| -character classes are somehow processed "symbolically" without making a |
296 |
| -full expansion of their contents at parse time. This would mean that we'd |
297 |
| -have to be ready to call iswalpha() at runtime, but if that only happens |
298 |
| -for high-code-value characters, it shouldn't be a big performance hit. |
| 278 | +the "colorchain" links in NFA arc structs. |
| 279 | + |
| 280 | +Ideally, we'd do the first two operations using a simple linear array |
| 281 | +storing the current color assignment for each character code. |
| 282 | +Unfortunately, that's not terribly workable for large charsets such as |
| 283 | +Unicode. Our solution is to divide the color map into two parts. A simple |
| 284 | +linear array is used for character codes up to MAX_SIMPLE_CHR, which can be |
| 285 | +chosen large enough to include all popular characters (so that the |
| 286 | +significantly-slower code paths about to be described are seldom invoked). |
| 287 | +Characters above that need be considered at compile time only if they |
| 288 | +appear explicitly in the regex pattern. We store each such mentioned |
| 289 | +character or character range as an entry in the "colormaprange" array in |
| 290 | +the colormap. (Overlapping ranges are split into unique subranges, so that |
| 291 | +each range in the finished list needs only a single color that describes |
| 292 | +all its characters.) When mapping a character above MAX_SIMPLE_CHR to a |
| 293 | +color at runtime, we search this list of ranges explicitly. |
| 294 | + |
| 295 | +That's still not quite enough, though, because of locale-dependent |
| 296 | +character classes such as [[:alpha:]]. In Unicode locales these classes |
| 297 | +may have thousands of entries that are above MAX_SIMPLE_CHR, and we |
| 298 | +certainly don't want to be searching large colormaprange arrays at runtime. |
| 299 | +Nor do we even want to spend the time to initialize cvec structures that |
| 300 | +exhaustively describe all of those characters. Our solution is to compute |
| 301 | +exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR. |
| 302 | +For characters above that, we apply the <ctype.h> or <wctype.h> lookup |
| 303 | +functions at runtime for each locale-dependent character class used in the |
| 304 | +regex pattern, constructing a bitmap that describes which classes the |
| 305 | +runtime character belongs to. The per-character-range data structure |
| 306 | +mentioned above actually holds, for each range, a separate color entry |
| 307 | +for each possible combination of character class properties. That is, |
| 308 | +the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, |
| 309 | +whose rows correspond to high characters or character ranges that are |
| 310 | +explicitly mentioned in the regex pattern, and whose columns correspond |
| 311 | +to sets of the locale-dependent character classes that are used in the |
| 312 | +regex. |
| 313 | + |
| 314 | +As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]' |
| 315 | +(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need |
| 316 | +a high color map with three rows. One row is for the single character |
| 317 | +U+1234 (represented as a single-element range), one is for the range |
| 318 | +U+1D100..U+1D1FF, and the other row represents all remaining high |
| 319 | +characters. The color map has two columns, one for characters that |
| 320 | +satisfy iswalnum() and one for those that don't. |
| 321 | + |
| 322 | +We build this color map in parallel with scanning the regex. Each time |
| 323 | +we detect a new explicit high character (or range) or a locale-dependent |
| 324 | +character class, we split existing entry(s) in the high color map so that |
| 325 | +characters we need to be able to distinguish will have distinct entries |
| 326 | +that can be given separate colors. Often, though, single entries in the |
| 327 | +high color map will represent very large sets of characters. |
| 328 | + |
| 329 | +If there are both explicit high characters/ranges and locale-dependent |
| 330 | +character classes, we may have entries in the high color map array that |
| 331 | +have non-WHITE colors but don't actually represent any real characters. |
| 332 | +(For example, in a row representing a singleton range, only one of the |
| 333 | +columns could possibly be a live entry; it's the one matching the actual |
| 334 | +locale properties for that single character.) We don't currently make |
| 335 | +any effort to reclaim such colors. In principle it could be done, but |
| 336 | +it's not clear that it's worth the trouble. |
299 | 337 |
|
300 | 338 |
|
301 | 339 | Detailed semantics of an NFA
|
|
0 commit comments