8000 WIP: Add LongMemEval evaluation scripts by edwinyyyu · Pull Request #680 · MemMachine/MemMachine · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@edwinyyyu
Copy link
Contributor
@edwinyyyu edwinyyyu commented Dec 4, 2025

Purpose of the change

Add LongMemEval evaluation scripts.

Description

WIP. Uses internal APIs to test long-term memory in isolation.

91.4% LLM judge score on LongMemEval with committed configuration, when combined with #660, #679.

Uses GPT-5 for answering and GPT-4o for judge to be comparable with SuperMemory, which claims 84.6% with the same models, and a maximum of 85.2% with gemini-3-pro-preview answering.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (does not change functionality, e.g., code style improvements, linting)
  • Documentation update
  • Project Maintenance (updates to build scripts, CI, etc., that do not affect the main project)
  • Security (improves security without changing functionality)

How Has This Been Tested?

WIP

  • Unit Test
  • Integration Test
  • End-to-end Test
  • Test Script (please provide)
  • Manual verification (list step-by-step instructions)

Test Results: [Attach logs, screenshots, or relevant output]

Checklist

[Please delete options that are not relevant.]

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

@edwinyyyu edwinyyyu force-pushed the longmemeval-scripts branch 6 times, most recently from d3ad93b to addb80e Compare December 4, 2025 22:58
@edwinyyyu edwinyyyu requested review from tomw-mv and vinares December 5, 2025 00:16
@edwinyyyu edwinyyyu force-pushed the longmemeval-scripts branch 2 times, most recently from cf85229 to a249f82 Compare December 5, 2025 23:51
Signed-off-by: Edwin Yu <edwinyyyu@gmail.com>
@edwinyyyu edwinyyyu force-pushed the longmemeval-scripts branch 5 times, most recently from bf89c4c to 376a028 Compare December 10, 2025 01:58
Signed-off-by: Edwin Yu <edwinyyyu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

0