Add Semantic Similarity chunker#6994
Conversation
@dotnet-policy-service agree |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces new chunking implementations for the DataIngestion library, adding support for splitting documents into chunks based on different strategies: semantic similarity, sections, markdown structure, and token-based chunking. The key changes include:
- New chunker implementations:
SemanticSimilarityChunker,SectionChunker,MarkdownChunker, andDocumentTokenChunker - Comprehensive test coverage for all new chunkers
- Addition of
System.Numerics.Tensorspackage dependency for cosine similarity calculations - Addition of
Microsoft.ML.Tokenizers.Data.O200kBasepackage for tokenization support - Test infrastructure with shared
TestEmbeddingGeneratorfor testing AI-based chunking
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj |
Adds System.Numerics.Tensors package dependency and removes trailing whitespace |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs |
Implements semantic similarity-based chunking using embeddings and cosine similarity |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SectionChunker.cs |
Implements section-based chunking that treats document sections as separate entities |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/MarkdownChunker.cs |
Implements markdown header-based chunking with configurable header levels |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs |
Implements token-based chunking with configurable overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj |
Adds test dependencies and links to shared test helpers |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs |
Comprehensive tests for semantic similarity chunker |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SectionChunkerTests.cs |
Tests for section-based chunking including nested sections |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs |
Tests for markdown chunker with various header configurations |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentTokenChunkerTests.cs |
Base tests for token-based chunking |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/OverlapDocumentTokenChunkerTests.cs |
Tests for token chunking with overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/NoOverlapDocumentTokenChunkerTests.cs |
Tests for token chunking without overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentChunkerTests.cs |
Base test class with common test scenarios |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkAssertions.cs |
Helper class for chunk assertions |
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/MarkdownChunker.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Big thanks for all your hard work and contribution to our product @KrystofS !
I've left some comments, PTAL.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SectionChunker.cs
Outdated
Show resolved
Hide resolved
...raries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/OverlapDocumentTokenChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ies/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Overall looks good to me. PTAL at my comments (mostly nits). Thank you again @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkAssertions.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
LGTM, thank you for your contribution @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nticSimilarityChunker.cs Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
Co-authored-by: Adam Sitnik <adam.sitnik@gmail.com>
20c8bd7 to
78afabb
Compare
Microsoft Reviewers: Open in CodeFlow