E52F fix: handle both UTF-8 and ANSI model paths on Windows by corvo007 · Pull Request #3650 · ggml-org/whisper.cpp · GitHub
[go: up one dir, main page]

Skip to content

fix: handle both UTF-8 and ANSI model paths on Windows#3650

Open
corvo007 wants to merge 1 commit intoggml-org:masterfrom
corvo007:fix/windows-path-encoding
Open

fix: handle both UTF-8 and ANSI model paths on Windows#3650
corvo007 wants to merge 1 commit intoggml-org:masterfrom
corvo007:fix/windows-path-encoding

Conversation

@corvo007
Copy link
@corvo007 corvo007 commented Feb 6, 2026

Motivation

The current code in whisper_init_from_file_with_params_no_state and whisper_vad_init_from_file_with_params uses std::codecvt_utf8<wchar_t> to convert model file paths from narrow string to wide string on Windows. This assumes path_model is always UTF-8 encoded.

However, when whisper-cli is invoked normally (without a UTF-8 manifest), the MSVC C runtime converts the UTF-16 command line to the system ANSI code page (e.g. CP936/GBK for Chinese Windows, CP1252 for Western Europe) — not UTF-8. Passing these ANSI-encoded bytes to codecvt_utf8::from_bytes() causes std::range_error, which triggers STATUS_STACK_BUFFER_OVERRUN (0xC0000409) and crashes the process.

This affects any user whose model file path contains non-ASCII characters that are not valid UTF-8 when encoded in their system code page (Chinese, Japanese, Korean paths are most commonly affected).

Changes

Replace std::codecvt_utf8 with MultiByteToWideChar using a two-step approach:

  1. Try MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, ...) first
  2. If the string is not valid UTF-8, fall back to MultiByteToWideChar(CP_ACP, ...)

This correctly handles both:

Additional changes:

  • Guard changed from _MSC_VER to _WIN32 to also cover MinGW/Clang on Windows
  • Removes the deprecated <codecvt> header (deprecated since C++17)

Relationship to #3610

This PR is complementary to #3610 (UTF-8 manifest for CLI):

#3610 (Manifest) This PR (Library)
Layer CLI binary (argv encoding) Library (src/whisper.cpp)
Mechanism Makes CRT provide UTF-8 argv Library handles any encoding
Scope Only whisper-cli All callers of libwhisper
Windows version Requires Win10 1903+ All Windows versions
Third-party apps Not covered Covered

Both can coexist: the manifest ensures the CLI sends UTF-8, and this fix ensures the library works correctly regardless.

Testing

Tested with model paths containing Chinese characters on Windows 10 (zh-CN, CP936):

  • E:\新文件夹 (2)\ggml-small-q8_0.bin — previously crashed with 0xC0000409, now loads correctly
  • ASCII-only paths — no behavior change
  • UTF-8 encoded paths (with manifest) — still work correctly

Fixes #2052 (for cases where the previous fix was insufficient)

Replace `std::codecvt_utf8` with `MultiByteToWideChar` for converting
model file paths to wide strings on Windows.

The previous code assumed `path_model` was always UTF-8 encoded, but
when whisper-cli is invoked via `main(argc, argv)`, the MSVC C runtime
converts the UTF-16 command line to the system ANSI code page (e.g.
CP936 for Chinese Windows), not UTF-8. Passing these ANSI bytes to
`codecvt_utf8::from_bytes()` causes `std::range_error`, which triggers
STATUS_STACK_BUFFER_OVERRUN (0xC0000409) and crashes the process.

The fix tries `MultiByteToWideChar(CP_UTF8)` first, and if the string
is not valid UTF-8, falls back to `MultiByteToWideChar(CP_ACP)`. This
correctly handles both:
- UTF-8 paths (from manifest-enabled or Unicode-aware callers)
- ANSI paths (from the default MSVC main() using the system code page)

Also changes the guard from `_MSC_VER` to `_WIN32` to cover MinGW/Clang
on Windows, and removes the deprecated `<codecvt>` header dependency.

Fixes model loading crashes for users with non-ASCII characters in their
model file paths (e.g. Chinese, Japanese, Korean, Hebrew, Arabic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model fail to load if path contains Hebrew characters

1 participant

0