-
Notifications
You must be signed in to change notification settings - Fork 2.5k
ctype: cast characters to unsigned when classifying characters #6679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy sta 8000 tement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Seems to be partly a duplicate of #6569 |
I don't really remember the rationale for having some of the Should we |
So, I went through the blame list in util.h to see if I could figure anything out here. 75a4636 git__tolower seems to have been added to solve a performance issue, but can only be used if we know that the text doesn't require localization. 0f49200 @vicent is a bit unclear about what the actual issue was. Could be performance, could be that not casting properly led to weird behavior with non-ascii characters getting random classification. d34f682 Custom isxdigit added for parse_header_oid() in patch_parse.c So I'm a bit reluctant to just switch over the custom git__isalnum() directly. It seems that at least some of them where added to solve specific problems. |
Yeah, I'm starting to remember - at least some version of MSVC takes a global lock on something heavy to figure out the locale. 😓 |
ctype classification takes an integer in as returned from getc() if we just sign extend characters to integers 128-255 will be misclassified. (255 will become EOF) Newlib in particular doesn't like this since they uses the value as an index in a lookup table.
I think that we should get rid of the ctype emulation functions on Windows — for example, let's update Creating That was... a few years ago. OK, a lot of years ago. Some naive changes to just use Should we just get rid of the commit cb10868cad0b5805e7c90da4f58e123694ffd59a
Author: Edward Thomson <ethomson@edwardthomson.com>
Date: Tue Dec 19 22:25:47 2023 +0000
ctype: don't emulate ctype functions on Windows
We previously added some emulation functions for some ctype functions
(`isalpha`, `tolower`, etc) since an old version of the Microsoft C
runtime took a lock on the locale to operate which was deeply
inefficient.
In 2023, this seems to have been resolved. Revert it, and just use the
standard functions.
diff --git a/src/util/util.h b/src/util/util.h
index 7f178b169..b28e14241 100644
--- a/src/util/util.h
+++ b/src/util/util.h
@@ -52,6 +52,12 @@
#define CASESELECT(IGNORE_CASE, ICASE, CASE) \
((IGNORE_CASE) ? (ICASE) : (CASE))
+# define git__tolower(a) tolower(a)
+# define git__isalpha(c) isalpha(c)
+# define git__isdigit(c) isdigit(c)
+# define git__isspace(c) isspace(c)
+# define git__isxdigit(c) isxdigit(c)
+
extern int git__prefixcmp(const char *str, const char *prefix);
extern int git__prefixcmp_icase(const char *str, const char *prefix);
extern int git__prefixncmp(const char *str, size_t str_n, const char *prefix);
@@ -83,15 +89,6 @@ extern char *git__strsep(char **end, const char *sep);
extern void git__strntolower(char *str, size_t len);
extern void git__strtolower(char *str);
-#ifdef GIT_WIN32
-GIT_INLINE(int) git__tolower(int c)
-{
- return (c >= 'A' && c <= 'Z') ? (c + 32) : c;
-}
-#else
-# define git__tolower(a) tolower(a)
-#endif
-
extern size_t git__linenlen(const char *buffer, size_t buffer_len);
GIT_INLINE(const char *) git__next_line(const char *s)
@@ -254,21 +251,6 @@ GIT_INLINE(bool) git__isupper(int c)
return (c >= 'A' && c <= 'Z');
}
-GIT_INLINE(bool) git__isalpha(int c)
-{
- return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'));
-}
-
-GIT_INLINE(bool) git__isdigit(int c)
-{
- return (c >= '0' && c <= '9');
-}
-
-GIT_INLINE(bool) git__isspace(int c)
-{
- return (c == ' ' || c == '\t' || c == '\n' || c == '\f' || c == '\r' || c == '\v');
-}
-
GIT_INLINE(bool) git__isspace_nonlf(int c)
{
return (c == ' ' || c == '\t' || c == '\f' || c == '\r' || c == '\v');
@@ -279,11 +261,6 @@ GIT_INLINE(bool) git__iswildcard(int c)
return (c == '*' || c == '?' || c == '[');
}
-GIT_INLINE(bool) git__isxdigit(int c)
-{
- return ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'));
-}
-
/*
* Parse a string value as a boolean, just like Core Git does.
* Curious what you think. |
I'm very torn on this. I suspect that we have a bunch of cases where these functions are used on UTF-8-encoded texts. src/util/wildmatch.c have Over all I am a big fan of using the ISO-C functions directly whenever possible, but this might not be a case where it is the right thing to do. The "proper" way would probably be to parse UTF-8 strings with mbrtowc() or something but that is just not fun and it's not like we know that whatever we get is multibyte strings, on some systems it may very well be that everything is single byte strings. |
OK, so, if we take a commit made with a system with i18n.commitEncoding set to n_US.utf8 and process it on a system with i18n.commitEncoding set to en_US.iso885915 we will treat every character in the 0xC0-0xFF range as two character of which the first always returns true for isalpha(). When parsing digits, hash values, and other control sequences we know that they are in the ASCII set so we don't really have to worry about them. All ctype functions should behave properly there. It is only a problem if we mix data with different encodings. git supports creating branches with names containing non-ascii characters
but this doesn't really impact our choice here because isalpha() will not work in either case so any validation has to be done some other way. So it seems to me that it shouldn't really matter if we use ctype or roll our own that just checks for characters in the asciii set. |
🙁 OK. I think that we probably need to dig a little bit deeper on some of this before making any changes. I'm worried that a) we might already be different from git in some circumstances, and b) git itself is not well defined and does things like refname comparisons differently depending on locale, whether it's a loose or packed ref, etc, and c) that some of these comparisons are implicitly LANG=C and some are locale dependent. 😢 Let's take this to fix our immediate problems, and come up with a deeper plan for ctype like functions post-libgit2 v1.8.0. |
Thanks for the work here, and for working with me while I muse about our architecture. 🙏 |
Unless you need this before Christmas I would prefer if you waited a bit. Git actually have a policy on this: https://git-scm.com/docs/git-commit/2.10.5#_discussion So this PR in its current state is just wrong, we should replace the calls with our custom ones. |
No rush, let's do it right. I appreciate your effort on this. We'll aim for a v1.8 after Christmas. |
ctype classification takes an integer in as returned from getc() if we just sign extend characters to integers 128-255 will be misclassified. (255 will become EOF)
Newlib in particular doesn't like this since they uses the value as an index in a lookup table.