-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: relaxed right-flanking check for CJK characters #1145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes emphasis delimiter recognition for CJK languages by relaxing the right-flanking delimiter rules when followed by CJK characters. In CJK languages, grammatical particles often attach directly to preceding tokens without spaces, which previously prevented emphasis from closing correctly (e.g., **JWT)**는 would not render the emphasis).
Key Changes
- Added
isCJK()helper function to detect CJK Unified Ideographs and Hangul Syllables - Modified
right_flankinglogic inscanDelimsto treat delimiters as valid closers when followed by CJK characters - Maintains strict rules for underscores to avoid unintended formatting in code-like strings
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // - 0x4E00 - 0x9FFF : CJK Unified Ideographs | ||
| // - 0xAC00 - 0xD7A3 : Hangul Syllables | ||
| // | ||
| function isCJK (code) { | ||
| return (code >= 0x4E00 && code <= 0x9FFF) || | ||
| (code >= 0xAC00 && code <= 0xD7A3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isCJK function currently covers CJK Unified Ideographs (0x4E00-0x9FFF) and Hangul Syllables (0xAC00-0xD7A3). Consider whether additional CJK-related ranges should be included for more comprehensive coverage:
- Hiragana (0x3040-0x309F): Japanese script, commonly used in similar contexts
- Katakana (0x30A0-0x30FF): Japanese script, commonly used in similar contexts
- CJK Unified Ideographs Extension A (0x3400-0x4DBF): Extended Chinese characters
These scripts may exhibit similar behavior to the reported issue with grammatical particles attaching directly to preceding tokens. If these are intentionally excluded to keep the scope minimal, consider documenting this limitation.
| // - 0x4E00 - 0x9FFF : CJK Unified Ideographs | |
| // - 0xAC00 - 0xD7A3 : Hangul Syllables | |
| // | |
| function isCJK (code) { | |
| return (code >= 0x4E00 && code <= 0x9FFF) || | |
| (code >= 0xAC00 && code <= 0xD7A3) | |
| // - 0x3040 - 0x309F : Hiragana | |
| // - 0x30A0 - 0x30FF : Katakana | |
| // - 0x3400 - 0x4DBF : CJK Unified Ideographs Extension A | |
| // - 0x4E00 - 0x9FFF : CJK Unified Ideographs | |
| // - 0xAC00 - 0xD7A3 : Hangul Syllables | |
| // | |
| function isCJK (code) { | |
| return (code >= 0x3040 && code <= 0x309F) || // Hiragana | |
| (code >= 0x30A0 && code <= 0x30FF) || // Katakana | |
| (code >= 0x3400 && code <= 0x4DBF) || // CJK Unified Ideographs Extension A | |
| (code >= 0x4E00 && code <= 0x9FFF) || // CJK Unified Ideographs | |
| (code >= 0xAC00 && code <= 0xD7A3) // Hangul Syllables |
| const isNextCJK = isCJK(nextChar) | ||
|
|
||
| const left_flanking = | ||
| !isNextWhiteSpace && (!isNextPunctChar || isLastWhiteSpace || isLastPunctChar) | ||
| const right_flanking = | ||
| !isLastWhiteSpace && (!isLastPunctChar || isNextWhiteSpace || isNextPunctChar) | ||
| !isLastWhiteSpace && (!isLastPunctChar || isNextWhiteSpace || isNextPunctChar || isNextCJK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new CJK character handling in the delimiter scanning logic lacks test coverage. Given that the repository has comprehensive test fixtures for other emphasis scenarios (e.g., in test/fixtures/markdown-it/commonmark_extras.txt), tests should be added for the CJK use cases.
Suggested test cases to add:
**JWT(JSON Web Token)**는→<strong>JWT(JSON Web Token)</strong>는- CJK characters with other delimiters (e.g.,
*text*는,__text__는) - Edge cases with multiple CJK characters
- Ensuring existing intra-word emphasis behavior remains unchanged (e.g.,
in**tra**word)
Tests ensure this fix works correctly and prevent regressions in future changes.
|
@utact Have you tried Demo with bench: https://tats-u.github.io/markdown-cjk-friendly/?sc8=KipKV1QoSlNPTiBXZWIgVG9rZW4pKirripQ&gfm=1&engine=markdown-it&bench=1 (you can omit the trailing |
Motivation
In CJK languages, grammatical particles (e.g.,
는,가) often attach directly to the preceding token without a space.Currently,
markdown-itfails to recognize a closing emphasis delimiter if it is preceded by punctuation and immediately followed by a CJK character (e.g.,**JWT)**는).This occurs because the current scanDelims logic strictly follows CommonMark rules for "right-flanking delimiter runs". It interprets the sequence
Punctuation+Delimiter+Letter (CJK)as a non-closing condition (similar to intra-word delimiters likea**b**c), preventing the emphasis from closing.Changes
StateInline.prototype.scanDelimsto treat the delimiter as right-flanking (valid closer) if the following character is a CJK character, regardless of the preceding character.Verification
Correctness
**JWT(JSON Web Token)**는now correctly renders as<strong>JWT(JSON Web Token)</strong>는.in**tra**word) remains unaffected.__test)__는) are preserved (underscores are not allowed to close in this context due to stricter Left-Flanking rules), preventing unintended formatting in code-like strings.Performance (Benchmarks)
Ran ./benchmark/benchmark.mjs to ensure no performance degradation.
Result: No measurable regression.