8000 wip Fully implement the UAX #15 quick-check algorithm. · gnprice/cpython@322e702 · GitHub
[go: up one dir, main page]

Skip to content

Commit 322e702

Browse files
committed
wip Fully implement the UAX python#15 quick-check algorithm.
TODO: - news etc.? - test somehow? at least make sure semantic tests are adequate - that "older version" path... shouldn't it be MAYBE? - mention explicitly in commit message that *this* is the actual algorithm from UAX python#15 - think if there are counter-cases where this is slower. If caller treats MAYBE same as NO... e.g. if caller actually just wants to normalize? May need to parametrize and offer both behaviors. This lets us return a NO answer instead of MAYBE when that's what a Quick_Check property tells us; or also when that's what the canonical combining classes tell us, after a Quick_Check property has said "maybe". At a quick test on my laptop, the existing code takes about 6.7 ms/MB (so 6.7 ns per byte) when the quick check returns MAYBE and it has to do the slow comparison: $ ./python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \ 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 6.67 msec per loop With this patch, it gets the answer instantly (78 ns) on the same 1 MB string: $ ./python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \ 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 78 nsec per loop
1 parent bb1dea2 commit 322e702

File tree

1 file changed

+22
-15
lines changed

1 file changed

+22
-15
lines changed

Modules/unicodedata.c

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -775,29 +775,31 @@ nfc_nfkc(PyObject *self, PyObject *input, int k)
775775
return result;
776776
}
777777

778-
typedef enum {YES, NO, MAYBE} NormalMode;
778+
typedef enum {YES, MAYBE, NO} NormalMode;
779779

780780
/* Return YES if the input is certainly normalized, NO or MAYBE if it might not be. */
781781
static NormalMode
782782
is_normalized(PyObject *self, PyObject *input, int nfc, int k)
783783
{
784-
Py_ssize_t i, len;
785-
int kind;
786-
void *data;
787-
unsigned char prev_combining = 0, quickcheck_mask;
784+
/* This is an implementation of the following algorithm:
785+
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
786+
See there for background.
787+
*/
788788

789789
/* An older version of the database is requested, quickchecks must be
790790
disabled. */
791791
if (self && UCD_Check(self))
792792
return NO;
793793

794-
/* This is an implementation of the following algorithm:
795-
https://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
796-
See there for background.
797-
*/
794+
Py_ssize_t i, len;
795+
int kind;
796+
void *data;
797+
unsigned char prev_combining = 0;
798798

799799
/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No. */
800-
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
800+
int quickcheck_mask_shift = ((nfc ? 4 : 0) + (k ? 2 : 0));
801+
802+
NormalMode result = YES; /* certainly normalized, unless we find something */
801803

802804
i = 0;
803805
kind = PyUnicode_KIND(input);
@@ -806,16 +808,21 @@ is_normalized(PyObject *self, PyObject *input, int nfc, int k)
806808
while (i < len) {
807809
Py_UCS4 ch = PyUnicode_READ(kind, data, i++);
808810
const _PyUnicode_DatabaseRecord *record = _getrecord_ex(ch);
809-
unsigned char combining = record->combining;
810-
unsigned char quickcheck = record->normalization_quick_check;
811811

812-
if (quickcheck & quickcheck_mask)
813-
return MAYBE; /* this string might need normalization */
812+
unsigned char combining = record->combining;
814813
if (combining && prev_combining > combining)
815814
return NO; /* non-canonical sort order, not normalized */
816815
prev_combining = combining;
816+
817+
unsigned char quickcheck = record->normalization_quick_check;
818+
switch ((quickcheck >> quickcheck_mask_shift) & 3) {
819+
case NO:
820+
return NO;
821+
case MAYBE:
822+
result = MAYBE; /* this string might need normalization */
823+
}
817824
}
818-
return YES; /* certainly normalized */
825+
return result;
819826
}
820827

821828
/*[clinic input]

0 commit comments

Comments
 (0)
0