fix: handle both UTF-8 and ANSI model paths on Windows#3650
Open
corvo007 wants to merge 1 commit intoggml-org:masterfrom
Open
fix: handle both UTF-8 and ANSI model paths on Windows#3650corvo007 wants to merge 1 commit intoggml-org:masterfrom
corvo007 wants to merge 1 commit intoggml-org:masterfrom
Conversation
Replace `std::codecvt_utf8` with `MultiByteToWideChar` for converting model file paths to wide strings on Windows. The previous code assumed `path_model` was always UTF-8 encoded, but when whisper-cli is invoked via `main(argc, argv)`, the MSVC C runtime converts the UTF-16 command line to the system ANSI code page (e.g. CP936 for Chinese Windows), not UTF-8. Passing these ANSI bytes to `codecvt_utf8::from_bytes()` causes `std::range_error`, which triggers STATUS_STACK_BUFFER_OVERRUN (0xC0000409) and crashes the process. The fix tries `MultiByteToWideChar(CP_UTF8)` first, and if the string is not valid UTF-8, falls back to `MultiByteToWideChar(CP_ACP)`. This correctly handles both: - UTF-8 paths (from manifest-enabled or Unicode-aware callers) - ANSI paths (from the default MSVC main() using the system code page) Also changes the guard from `_MSC_VER` to `_WIN32` to cover MinGW/Clang on Windows, and removes the deprecated `<codecvt>` header dependency. Fixes model loading crashes for users with non-ASCII characters in their model file paths (e.g. Chinese, Japanese, Korean, Hebrew, Arabic).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The current code in
whisper_init_from_file_with_params_no_stateandwhisper_vad_init_from_file_with_paramsusesstd::codecvt_utf8<wchar_t>to convert model file paths from narrow string to wide string on Windows. This assumespath_modelis always UTF-8 encoded.However, when
whisper-cliis invoked normally (without a UTF-8 manifest), the MSVC C runtime converts the UTF-16 command line to the system ANSI code page (e.g. CP936/GBK for Chinese Windows, CP1252 for Western Europe) — not UTF-8. Passing these ANSI-encoded bytes tocodecvt_utf8::from_bytes()causesstd::range_error, which triggers STATUS_STACK_BUFFER_OVERRUN (0xC0000409) and crashes the process.This affects any user whose model file path contains non-ASCII characters that are not valid UTF-8 when encoded in their system code page (Chinese, Japanese, Korean paths are most commonly affected).
Changes
Replace
std::codecvt_utf8withMultiByteToWideCharusing a two-step approach:MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...)firstMultiByteToWideChar(CP_ACP, ...)This correctly handles both:
main(argc, argv)which uses the system code pageAdditional changes:
_MSC_VERto_WIN32to also cover MinGW/Clang on Windows<codecvt>header (deprecated since C++17)Relationship to #3610
This PR is complementary to #3610 (UTF-8 manifest for CLI):
Both can coexist: the manifest ensures the CLI sends UTF-8, and this fix ensures the library works correctly regardless.
Testing
Tested with model paths containing Chinese characters on Windows 10 (zh-CN, CP936):
E:\新文件夹 (2)\ggml-small-q8_0.bin— previously crashed with 0xC0000409, now loads correctlyFixes #2052 (for cases where the previous fix was insufficient)