ctype: cast characters to unsigned when classifying characters #6679

boretrk · 2023-12-14T18:43:19Z

ctype classification takes an integer in as returned from getc() if we just sign extend characters to integers 128-255 will be misclassified. (255 will become EOF)
Newlib in particular doesn't like this since they uses the value as an index in a lookup table.

boretrk · 2023-12-14T19:31:42Z

Seems to be partly a duplicate of #6569

ethomson · 2023-12-14T22:02:38Z

I don't really remember the rationale for having some of the git__ functions, but ... we do, and maybe we should continue to?

Should we #define git__isalnum(x) isalnum((unsigned char)(x)) instead of casting all over the place? 🤔

boretrk · 2023-12-15T12:24:27Z

I don't really remember the rationale for having some of the git__ functions, but ... we do, and maybe we should continue to?

Should we #define git__isalnum(x) isalnum((unsigned char)(x)) instead of casting all over the place? 🤔

So, I went through the blame list in util.h to see if I could figure anything out here.

75a4636 git__tolower seems to have been added to solve a performance issue, but can only be used if we know that the text doesn't require localization.

0f49200 @vicent is a bit unclear about what the actual issue was. Could be performance, could be that not casting properly led to weird behavior with non-ascii characters getting random classification.

d34f682 Custom isxdigit added for parse_header_oid() in patch_parse.c
I'm not sure if a separate oid parser is needed here.
Maybe we could use git_parse_advance_oid() or git_object__parse_oid_header() instead. (Unless we have a common function to parse oid pairs in 75a4636f50..0f49200c9a format somewhere.

So I'm a bit reluctant to just switch over the custom git__isalnum() directly. It seems that at least some of them where added to solve specific problems.
If all of them are performance issues on windows then I think using #define git__isalnum(x) isalnum((unsigned char)(x)) and use the ascii-only versions for windows is a good idea.
The only use of getc/getchar/fgetc I can find is en examples so we shouldn't have to worry about handling EOF.

ethomson · 2023-12-15T13:39:53Z

Yeah, I'm starting to remember - at least some version of MSVC takes a global lock on something heavy to figure out the locale. 😓

ctype classification takes an integer in as returned from getc() if we just sign extend characters to integers 128-255 will be misclassified. (255 will become EOF) Newlib in particular doesn't like this since they uses the value as an index in a lookup table.

ethomson · 2023-12-19T22:55:31Z

I think that we should get rid of the ctype emulation functions on Windows — for example, let's update git__tolower and just use tolower.

Creating git__tolower was a reaction to MSVCRT introducing a slowdown in a beta of VS 20...12? In a beta. And libgit2's not wanting to slow down VS using that beta. (more accurately, libgit2 was included in that version of VS as part of that beta).

That was... a few years ago. OK, a lot of years ago. Some naive changes to just use tolower suggests that this has been fixed. I ran this patch in CI and our CI runs appear to be approximately the same speed that they used to be. I can do a smarter benchmark. But, I guess what I'm wondering is:

Should we just get rid of the git__ things entirely?

commit cb10868cad0b5805e7c90da4f58e123694ffd59a
Author: Edward Thomson <ethomson@edwardthomson.com>
Date:   Tue Dec 19 22:25:47 2023 +0000

    ctype: don't emulate ctype functions on Windows
    
    We previously added some emulation functions for some ctype functions
    (`isalpha`, `tolower`, etc) since an old version of the Microsoft C
    runtime took a lock on the locale to operate which was deeply
    inefficient.
    
    In 2023, this seems to have been resolved. Revert it, and just use the
    standard functions.

diff --git a/src/util/util.h b/src/util/util.h
index 7f178b169..b28e14241 100644
--- a/src/util/util.h
+++ b/src/util/util.h
@@ -52,6 +52,12 @@
 #define CASESELECT(IGNORE_CASE, ICASE, CASE) \
 	((IGNORE_CASE) ? (ICASE) : (CASE))
 
+# define git__tolower(a) tolower(a)
+# define git__isalpha(c) isalpha(c)
+# define git__isdigit(c) isdigit(c)
+# define git__isspace(c) isspace(c)
+# define git__isxdigit(c) isxdigit(c)
+
 extern int git__prefixcmp(const char *str, const char *prefix);
 extern int git__prefixcmp_icase(const char *str, const char *prefix);
 extern int git__prefixncmp(const char *str, size_t str_n, const char *prefix);
@@ -83,15 +89,6 @@ extern char *git__strsep(char **end, const char *sep);
 extern void git__strntolower(char *str, size_t len);
 extern void git__strtolower(char *str);
 
-#ifdef GIT_WIN32
-GIT_INLINE(int) git__tolower(int c)
-{
-	return (c >= 'A' && c <= 'Z') ? (c + 32) : c;
-}
-#else
-# define git__tolower(a) tolower(a)
-#endif
-
 extern size_t git__linenlen(const char *buffer, size_t buffer_len);
 
 GIT_INLINE(const char *) git__next_line(const char *s)
@@ -254,21 +251,6 @@ GIT_INLINE(bool) git__isupper(int c)
 	return (c >= 'A' && c <= 'Z');
 }
 
-GIT_INLINE(bool) git__isalpha(int c)
-{
-	return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'));
-}
-
-GIT_INLINE(bool) git__isdigit(int c)
-{
-	return (c >= '0' && c <= '9');
-}
-
-GIT_INLINE(bool) git__isspace(int c)
-{
-	return (c == ' ' || c == '\t' || c == '\n' || c == '\f' || c == '\r' || c == '\v');
-}
-
 GIT_INLINE(bool) git__isspace_nonlf(int c)
 {
 	return (c == ' ' || c == '\t' || c == '\f' || c == '\r' || c == '\v');
@@ -279,11 +261,6 @@ GIT_INLINE(bool) git__iswildcard(int c)
 	return (c == '*' || c == '?' || c == '[');
 }
 
-GIT_INLINE(bool) git__isxdigit(int c)
-{
-	return ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'));
-}
-
 /*
  * Parse a string value as a boolean, just like Core Git does.
  *

Curious what you think.

boretrk · 2023-12-20T20:17:48Z

Should we just get rid of the git__ things entirely?

I'm very torn on this.

I suspect that we have a bunch of cases where these functions are used on UTF-8-encoded texts.
I have no idea how the standard isalpha() behave if you set a locale with UTF-8 encoding and start feeding isalpha() with non-ascii bytes.
It can clearly not classify a partial character, but as long as it returns false on all non-ascii characters it should be fine I think.

src/util/wildmatch.c have #define ISALPHA(c) (ISASCII(c) && isalpha(c)) that I assume is to solve this, but that just gets exactly what git__isalpha() does.
deps/xdiff/xemit.c just checks for '(isalpha((unsigned char)*rec)', but that is when looking for identifiers. It's not exactly crititcal and may be 'good enough' and people who uses emojis in function names may not expect diff to handle them properly anyway.

Over all I am a big fan of using the ISO-C functions directly whenever possible, but this might not be a case where it is the right thing to do.

The "proper" way would probably be to parse UTF-8 strings with mbrtowc() or something but that is just not fun and it's not like we know that whatever we get is multibyte strings, on some systems it may very well be that everything is single byte strings.

boretrk · 2023-12-21T01:26:57Z

OK, so, if we take a commit made with a system with i18n.commitEncoding set to n_US.utf8 and process it on a system with i18n.commitEncoding set to en_US.iso885915 we will treat every character in the 0xC0-0xFF range as two character of which the first always returns true for isalpha().
But we shouldn't really have to worry about that. For commit messages we should only have to check for whitespace I think?

When parsing digits, hash values, and other control sequences we know that they are in the ASCII set so we don't really have to worry about them. All ctype functions should behave properly there.

It is only a problem if we mix data with different encodings.
So when parsing config files we can probably expect the locale to be set to the encoding that was used to write the config file.

git supports creating branches with names containing non-ascii characters
git branch åiåaäeö
git config branch.åiåaäeö.remote origin

[branch "åiåaäeö"]
        remote = origin

but this doesn't really impact our choice here because isalpha() will not work in either case so any validation has to be done some other way.

So it seems to me that it shouldn't really matter if we use ctype or roll our own that just checks for characters in the asciii set.
The problems that occur should occur either way and has to be solved some other way. (By not relying on isalpha() or tolower() for non-ascii characters.)

ethomson · 2023-12-21T21:32:33Z

🙁 OK. I think that we probably need to dig a little bit deeper on some of this before making any changes. I'm worried that a) we might already be different from git in some circumstances, and b) git itself is not well defined and does things like refname comparisons differently depending on locale, whether it's a loose or packed ref, etc, and c) that some of these comparisons are implicitly LANG=C and some are locale dependent. 😢

Let's take this to fix our immediate problems, and come up with a deeper plan for ctype like functions post-libgit2 v1.8.0.

ethomson · 2023-12-21T21:33:15Z

Thanks for the work here, and for working with me while I muse about our architecture. 🙏

boretrk · 2023-12-22T08:00:54Z

Unless you need this before Christmas I would prefer if you waited a bit.
I have given it some more thought and think that we should probably keep rolling our own "C"-like locale but I just need to comment up the code with something that explains why.

Git actually have a policy on this: https://git-scm.com/docs/git-commit/2.10.5#_discussion

So this PR in its current state is just wrong, we should replace the calls with our custom ones.
All type classification will return false for anything non-ascii and the rest of the code have to make sure that we don't assume that text isn't control characters just because isalnum() returns false.
If we don't systems with a locale set to ISO-8859-n will behave differently from those with it set to UTF-8.

ethomson · 2023-12-22T09:01:25Z

No rush, let's do it right. I appreciate your effort on this. We'll aim for a v1.8 after Christmas.

boretrk mentioned this pull request Dec 14, 2023

1.7.1: self test failures on Cygwin #6655

Open

boretrk force-pushed the ctype branch from c400e55 to dd5e276 Compare December 14, 2023 23:42

boretrk force-pushed the ctype branch from dd5e276 to e619b88 Compare December 16, 2023 22:53

boretrk marked this pull request as draft December 19, 2023 01:33

ethomson mentioned this pull request Dec 30, 2023

Win32: CMake missing secur32 link dependency for auth_sspi #6710

Closed

ethomson mentioned this pull request Mar 18, 2024

cast characters to unsigned when classifying characters #6770

Merged

ethomson merged commit e619b88 into libgit2:main Mar 18, 2024

ethomson mentioned this pull request Mar 18, 2024

Ensure well-defined value range for args to <ctype.h> functions. #6569

Closed

ethomson added bug build and removed bug labels Mar 20, 2024

branchvincent mentioned this pull request Mar 21, 2024

libgit2 1.8.1 libgit2@1.7 1.7.2 (new formula) Homebrew/homebrew-core#166767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ctype: cast characters to unsigned when classifying characters #6679

ctype: cast characters to unsigned when classifying characters #6679

ctype: cast characters to unsigned when classifying characters #6679

ctype: cast characters to unsigned when classifying characters #6679

Conversation