10BC0 fix: relaxed right-flanking check for CJK characters by utact · Pull Request #1145 · markdown-it/markdown-it · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@utact
Copy link
@utact utact commented Jan 4, 2026

Motivation

In CJK languages, grammatical particles (e.g., , ) often attach directly to the preceding token without a space.
Currently, markdown-it fails to recognize a closing emphasis delimiter if it is preceded by punctuation and immediately followed by a CJK character (e.g., **JWT)**는).

This occurs because the current scanDelims logic strictly follows CommonMark rules for "right-flanking delimiter runs". It interprets the sequence Punctuation + Delimiter + Letter (CJK) as a non-closing condition (similar to intra-word delimiters like a**b**c), preventing the emphasis from closing.

Changes

  • Introduced an isCJK(code) helper in lib/rules_inline/state_inline.mjs to detect CJK Unified Ideographs and Hangul Syllables.
  • Modified StateInline.prototype.scanDelims to treat the delimiter as right-flanking (valid closer) if the following character is a CJK character, regardless of the preceding character.

Verification

Correctness

  • Issue Case Fixed: **JWT(JSON Web Token)**는 now correctly renders as <strong>JWT(JSON Web Token)</strong>는.
  • Safety Check:
    • Standard intra-word emphasis (e.g., in**tra**word) remains unaffected.
    • Underscore strictness rules (e.g., __test)__는) are preserved (underscores are not allowed to close in this context due to stricter Left-Flanking rules), preventing unintended formatting in code-like strings.
스크린샷 2026-01-04 190315

Performance (Benchmarks)

Ran ./benchmark/benchmark.mjs to ensure no performance degradation.

  • Baseline: ~24,400 ops/sec
  • Patched: ~24,400 ops/sec
    Result: No measurable regression.

Copilot AI review requested due to automatic review settings January 4, 2026 10:06
Copy link
Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes emphasis delimiter recognition for CJK languages by relaxing the right-flanking delimiter rules when followed by CJK characters. In CJK languages, grammatical particles often attach directly to preceding tokens without spaces, which previously prevented emphasis from closing correctly (e.g., **JWT)**는 would not render the emphasis).

Key Changes

  • Added isCJK() helper function to detect CJK Unified Ideographs and Hangul Syllables
  • Modified right_flanking logic in scanDelims to treat delimiters as valid closers when followed by CJK characters
  • Maintains strict rules for underscores to avoid unintended formatting in code-like strings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +14
// - 0x4E00 - 0x9FFF : CJK Unified Ideographs
// - 0xAC00 - 0xD7A3 : Hangul Syllables
//
function isCJK (code) {
return (code >= 0x4E00 && code <= 0x9FFF) ||
(code >= 0xAC00 && code <= 0xD7A3)
Copy link
Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isCJK function currently covers CJK Unified Ideographs (0x4E00-0x9FFF) and Hangul Syllables (0xAC00-0xD7A3). Consider whether additional CJK-related ranges should be included for more comprehensive coverage:

  • Hiragana (0x3040-0x309F): Japanese script, commonly used in similar contexts
  • Katakana (0x30A0-0x30FF): Japanese script, commonly used in similar contexts
  • CJK Unified Ideographs Extension A (0x3400-0x4DBF): Extended Chinese characters

These scripts may exhibit similar behavior to the reported issue with grammatical particles attaching directly to preceding tokens. If these are intentionally excluded to keep the scope minimal, consider documenting this limitation.

Suggested change
// - 0x4E00 - 0x9FFF : CJK Unified Ideographs
// - 0xAC00 - 0xD7A3 : Hangul Syllables
//
function isCJK (code) {
return (code >= 0x4E00 && code <= 0x9FFF) ||
(code >= 0xAC00 && code <= 0xD7A3)
// - 0x3040 - 0x309F : Hiragana
// - 0x30A0 - 0x30FF : Katakana
// - 0x3400 - 0x4DBF : CJK Unified Ideographs Extension A
// - 0x4E00 - 0x9FFF : CJK Unified Ideographs
// - 0xAC00 - 0xD7A3 : Hangul Syllables
//
function isCJK (code) {
return (code >= 0x3040 && code <= 0x309F) || // Hiragana
(code >= 0x30A0 && code <= 0x30FF) || // Katakana
(code >= 0x3400 && code <= 0x4DBF) || // CJK Unified Ideographs Extension A
(code >= 0x4E00 && code <= 0x9FFF) || // CJK Unified Ideographs
(code >= 0xAC00 && code <= 0xD7A3) // Hangul Syllables

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +125
const isNextCJK = isCJK(nextChar)

const left_flanking =
!isNextWhiteSpace && (!isNextPunctChar || isLastWhiteSpace || isLastPunctChar)
const right_flanking =
!isLastWhiteSpace && (!isLastPunctChar || isNextWhiteSpace || isNextPunctChar)
!isLastWhiteSpace && (!isLastPunctChar || isNextWhiteSpace || isNextPunctChar || isNextCJK)
Copy link
Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CJK character handling in the delimiter scanning logic lacks test coverage. Given that the repository has comprehensive test fixtures for other emphasis scenarios (e.g., in test/fixtures/markdown-it/commonmark_extras.txt), tests should be added for the CJK use cases.

Suggested test cases to add:

  • **JWT(JSON Web Token)**는<strong>JWT(JSON Web Token)</strong>는
  • CJK characters with other delimiters (e.g., *text*는, __text__는)
  • Edge cases with multiple CJK characters
  • Ensuring existing intra-word emphasis behavior remains unchanged (e.g., in**tra**word)

Tests ensure this fix works correctly and prevent regressions in future changes.

Copilot uses AI. Check for mistakes.
@tats-u
Copy link
tats-u commented Jan 15, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0