8000 Support basic CJK text sentence chunking by edwinyyyu · Pull Request #936 · MemMachine/MemMachine · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@edwinyyyu
Copy link
Contributor
@edwinyyyu edwinyyyu commented Jan 9, 2026

Purpose of the change

Support splitting text into sentences for more diverse language.

Description

NLTK sent_tokenize works best for English and other European languages.
Notably, it does not split sentences at ideographic full stops and fullwidth punctuation.
Add a regex pattern to further split into sentences.

Want feedback on generality.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

@jealous
Copy link
Contributor
jealous commented Jan 21, 2026

Is it possible to add a test for each case?

@edwinyyyu edwinyyyu force-pushed the cjk_sentences branch 2 times, most recently from 40f028f to 223dc74 Compare January 22, 2026 01:26
@edwinyyyu
Copy link
Contributor Author
edwinyyyu commented Jan 22, 2026

Is it possible to add a test for each case?

Added tests for sentence chunking method.

Signed-off-by: Edwin Yu <edwinyyyu@gmail.com>
@edwinyyyu edwinyyyu marked this pull request as ready for review January 22, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0