8000 ctype: cast characters to unsigned when classifying characters by boretrk · Pull Request #6679 · libgit2/libgit2 · GitHub
[go: up one dir, main page]

Skip to content

ctype: cast characters to unsigned when classifying characters #6679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy sta 8000 tement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

boretrk
Copy link
Contributor
@boretrk boretrk commented Dec 14, 2023

ctype classification takes an integer in as returned from getc() if we just sign extend characters to integers 128-255 will be misclassified. (255 will become EOF)
Newlib in particular doesn't like this since they uses the value as an index in a lookup table.

@boretrk
Copy link
Contributor Author
boretrk commented Dec 14, 2023

Seems to be partly a duplicate of #6569

@ethomson
Copy link
Member

I don't really remember the rationale for having some of the git__ functions, but ... we do, and maybe we should continue to?

Should we #define git__isalnum(x) isalnum((unsigned char)(x)) instead of casting all over the place? 🤔

@boretrk
Copy link
Contributor Author
boretrk commented Dec 15, 2023

I don't really remember the rationale for having some of the git__ functions, but ... we do, and maybe we should continue to?

Should we #define git__isalnum(x) isalnum((unsigned char)(x)) instead of casting all over the place? 🤔

So, I went through the blame list in util.h to see if I could figure anything out here.

75a4636 git__tolower seems to have been added to solve a performance issue, but can only be used if we know that the text doesn't require localization.

0f49200 @vicent is a bit unclear about what the actual issue was. Could be performance, could be that not casting properly led to weird behavior with non-ascii characters getting random classification.

d34f682 Custom isxdigit added for parse_header_oid() in patch_parse.c
I'm not sure if a separate oid parser is needed here.
Maybe we could use git_parse_advance_oid() or git_object__parse_oid_header() instead. (Unless we have a common function to parse oid pairs in 75a4636f50..0f49200c9a format somewhere.

So I'm a bit reluctant to just switch over the custom git__isalnum() directly. It seems that at least some of them where added to solve specific problems.
If all of them are performance issues on windows then I think using #define git__isalnum(x) isalnum((unsigned char)(x)) and use the ascii-only versions for windows is a good idea.
The only use of getc/getchar/fgetc I can find is en examples so we shouldn't have to worry about handling EOF.

@ethomson
Copy link
Member

Yeah, I'm starting to remember - at least some version of MSVC takes a global lock on something heavy to figure out the locale. 😓

ctype classification takes an integer in as returned from getc()
if we just sign extend characters to integers 128-255 will be
misclassified. (255 will become EOF)
Newlib in particular doesn't like this since they uses the value
as an index in a lookup table.
@boretrk boretrk marked this pull request as draft December 19, 2023 01:33
@ethomson
Copy link
Member
ethomson commented Dec 19, 2023

I think that we should get rid of the ctype emulation functions on Windows — for example, let's update git__tolower and just use tolower.

Creating git__tolower was a reaction to MSVCRT introducing a slowdown in a beta of VS 20...12? In a beta. And libgit2's not wanting to slow down VS using that beta. (more accurately, libgit2 was included in that version of VS as part of that beta).

That was... a few years ago. OK, a lot of years ago. Some naive changes to just use tolower suggests that this has been fixed. I ran this patch in CI and our CI runs appear to be approximately the same speed that they used to be. I can do a smarter benchmark. But, I guess what I'm wondering is:

Should we just get rid of the git__ things entirely?

commit cb10868cad0b5805e7c90da4f58e123694ffd59a
Author: Edward Thomson <ethomson@edwardthomson.com>
Date:   Tue Dec 19 22:25:47 2023 +0000

    ctype: don't emulate ctype functions on Windows
    
    We previously added some emulation functions for some ctype functions
    (`isalpha`, `tolower`, etc) since an old version of the Microsoft C
    runtime took a lock on the locale to operate which was deeply
    inefficient.
    
    In 2023, this seems to have been resolved. Revert it, and just use the
    standard functions.

diff --git a/src/util/util.h b/src/util/util.h
index 7f178b169..b28e14241 100644
--- a/src/util/util.h
+++ b/src/util/util.h
@@ -52,6 +52,12 @@
 #define CASESELECT(IGNORE_CASE, ICASE, CASE) \
 	((IGNORE_CASE) ? (ICASE) : (CASE))
 
+# define git__tolower(a) tolower(a)
+# define git__isalpha(c) isalpha(c)
+# define git__isdigit(c) isdigit(c)
+# define git__isspace(c) isspace(c)
+# define git__isxdigit(c) isxdigit(c)
+
 extern int git__prefixcmp(const char *str, const char *prefix);
 extern int git__prefixcmp_icase(const char *str, const char *prefix);
 extern int git__prefixncmp(const char *str, size_t str_n, const char *prefix);
@@ -83,15 +89,6 @@ extern char *git__strsep(char **end, const char *sep);
 extern void git__strntolower(char *str, size_t len);
 extern void git__strtolower(char *str);
 
-#ifdef GIT_WIN32
-GIT_INLINE(int) git__tolower(int c)
-{
-	return (c >= 'A' && c <= 'Z') ? (c + 32) : c;
-}
-#else
-# define git__tolower(a) tolower(a)
-#endif
-
 extern size_t git__linenlen(const char *buffer, size_t buffer_len);
 
 GIT_INLINE(const char *) git__next_line(const char *s)
@@ -254,21 +251,6 @@ GIT_INLINE(bool) git__isupper(int c)
 	return (c >= 'A' && c <= 'Z');
 }
 
-GIT_INLINE(bool) git__isalpha(int c)
-{
-	return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'));
-}
-
-GIT_INLINE(bool) git__isdigit(int c)
-{
-	return (c >= '0' && c <= '9');
-}
-
-GIT_INLINE(bool) git__isspace(int c)
-{
-	return (c == ' ' || c == '\t' || c == '\n' || c == '\f' || c == '\r' || c == '\v');
-}
-
 GIT_INLINE(bool) git__isspace_nonlf(int c)
 {
 	return (c == ' ' || c == '\t' || c == '\f' || c == '\r' || c == '\v');
@@ -279,11 +261,6 @@ GIT_INLINE(bool) git__iswildcard(int c)
 	return (c == '*' || c == '?' || c == '[');
 }
 
-GIT_INLINE(bool) git__isxdigit(int c)
-{
-	return ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'));
-}
-
 /*
  * Parse a string value as a boolean, just like Core Git does.
  *

Curious what you think.

@boretrk
Copy link
Contributor Author
boretrk commented Dec 20, 2023

Should we just get rid of the git__ things entirely?

I'm very torn on this.

I suspect that we have a bunch of cases where these functions are used on UTF-8-encoded texts.
I have no idea how the standard isalpha() behave if you set a locale with UTF-8 encoding and start feeding isalpha() with non-ascii bytes.
It can clearly not classify a partial character, but as long as it returns false on all non-ascii characters it should be fine I think.

src/util/wildmatch.c have #define ISALPHA(c) (ISASCII(c) && isalpha(c)) that I assume is to solve this, but that just gets exactly what git__isalpha() does.
deps/xdiff/xemit.c just checks for '(isalpha((unsigned char)*rec)', but that is when looking for identifiers. It's not exactly crititcal and may be 'good enough' and people who uses emojis in function names may not expect diff to handle them properly anyway.

Over all I am a big fan of using the ISO-C functions directly whenever possible, but this might not be a case where it is the right thing to do.

The "proper" way would probably be to parse UTF-8 strings with mbrtowc() or something but that is just not fun and it's not like we know that whatever we get is multibyte strings, on some systems it may very well be that everything is single byte strings.

@boretrk
Copy link
Contributor Author
boretrk commented Dec 21, 2023

OK, so, if we take a commit made with a system with i18n.commitEncoding set to n_US.utf8 and process it on a system with i18n.commitEncoding set to en_US.iso885915 we will treat every character in the 0xC0-0xFF range as two character of which the first always returns true for isalpha().
But we shouldn't really have to worry about that. For commit messages we should only have to check for whitespace I think?

When parsing digits, hash values, and other control sequences we know that they are in the ASCII set so we don't really have to worry about them. All ctype functions should behave properly there.

It is only a problem if we mix data with different encodings.
So when parsing config files we can probably expect the locale to be set to the encoding that was used to write the config file.

git supports creating branches with names containing non-ascii characters
git branch åiåaäeö
git config branch.åiåaäeö.remote origin

[branch "åiåaäeö"]
        remote = origin

but this doesn't really impact our choice here because isalpha() will not work in either case so any validation has to be done some other way.

So it seems to me that it shouldn't really matter if we use ctype or roll our own that just checks for characters in the asciii set.
The problems that occur should occur either way and has to be solved some other way. (By not relying on isalpha() or tolower() for non-ascii characters.)

@ethomson
Copy link
Member

🙁 OK. I think that we probably need to dig a little bit deeper on some of this before making any changes. I'm worried that a) we might already be different from git in some circumstances, and b) git itself is not well defined and does things like refname comparisons differently depending on locale, whether it's a loose or packed ref, etc, and c) that some of these comparisons are implicitly LANG=C and some are locale dependent. 😢

Let's take this to fix our immediate problems, and come up with a deeper plan for ctype like functions post-libgit2 v1.8.0.

@ethomson
Copy link
Member

Thanks for the work here, and for working with me while I muse about our architecture. 🙏

@boretrk
Copy link
Contributor Author
boretrk commented Dec 22, 2023

Unless you need this before Christmas I would prefer if you waited a bit.
I have given it some more thought and think that we should probably keep rolling our own "C"-like locale but I just need to comment up the code with something that explains why.

Git actually have a policy on this: https://git-scm.com/docs/git-commit/2.10.5#_discussion

So this PR in its current state is just wrong, we should replace the calls with our custom ones.
All type classification will return false for anything non-ascii and the rest of the code have to make sure that we don't assume that text isn't control characters just because isalnum() returns false.
If we don't systems with a locale set to ISO-8859-n will behave differently from those with it set to UTF-8.

@ethomson
Copy link
Member

No rush, let's do it right. I appreciate your effort on this. We'll aim for a v1.8 after Christmas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0