-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Transform] Introduce tika transform #9862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new TikaDocument transform that enables extraction of text content and metadata from various document formats (PDF, Word, Excel, PowerPoint, etc.) using Apache Tika. The transform processes binary document data or base64-encoded strings and outputs structured fields containing extracted text, metadata, and document properties.
- Implements Apache Tika integration for document parsing and content extraction
- Provides configurable content processing options (whitespace normalization, empty line removal)
- Includes comprehensive error handling with configurable strategies (skip, fail, null)
Reviewed Changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
seatunnel-transforms-v2/pom.xml | Adds Apache Tika dependencies (core and parsers) |
TikaDocumentTransform.java | Main transform implementation handling document processing and field mapping |
TikaDocumentTransformConfig.java | Configuration class defining all transform options and parsing |
TikaDocumentTransformFactory.java | Factory class for creating transform instances |
TikaDocumentExtractor.java | Apache Tika integration for document content extraction |
DocumentMetadata.java | Data class representing extracted document metadata |
DefaultContentProcessor.java | Implementation for post-processing extracted text content |
E2E test files | Integration tests for single-table and multi-table scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
private static final Set<String> SUPPORTED_MIME_TYPES = | ||
new HashSet<String>() { | ||
{ | ||
add("application/pdf"); | ||
add("application/msword"); | ||
add("application/vnd.openxmlformats-officedocument.wordprocessingml.document"); | ||
add("application/vnd.ms-excel"); | ||
add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); | ||
add("application/vnd.ms-powerpoint"); | ||
add( | ||
"application/vnd.openxmlformats-officedocument.presentationml.presentation"); | ||
add("text/plain"); | ||
add("text/html"); | ||
add("application/rtf"); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Use modern collection initialization syntax instead of anonymous inner class. Replace with Set.of()
or Arrays.asList()
wrapped in new HashSet<>()
for better readability and performance.
Copilot uses AI. Check for mistakes.
.defaultValue( | ||
new HashMap<String, String>() { | ||
{ | ||
put("content", "extracted_text"); | ||
put("content_type", "mime_type"); | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Use modern Map initialization syntax instead of anonymous inner class. Replace with Map.of()
for better readability and performance.
Copilot uses AI. Check for mistakes.
boolean found = false; | ||
for (String expectedName : expectedNames) { | ||
if (expectedName.equals(outputColumns[i].getName())) { | ||
found = true; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Use modern Java 8+ streams or convert expectedNames to a Set for more efficient lookup instead of nested loops. This could be replaced with Set.of(expectedNames).contains(outputColumns[i].getName())
.
Copilot uses AI. Check for mistakes.
// Add all other metadata as custom metadata | ||
DocumentMetadata tempMetadata = builder.build(); | ||
for (String name : metadata.names()) { | ||
if (!isStandardMetadata(name)) { | ||
tempMetadata.setCustomMetadata(name, metadata.get(name)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DocumentMetadata object is built prematurely and then modified, but the builder pattern suggests immutability. The tempMetadata object should not be used to set custom metadata after building. Consider collecting custom metadata before building the final object.
// Add all other metadata as custom metadata | |
DocumentMetadata tempMetadata = builder.build(); | |
for (String name : metadata.names()) { | |
if (!isStandardMetadata(name)) { | |
tempMetadata.setCustomMetadata(name, metadata.get(name)); | |
} | |
} | |
// Collect all other metadata as custom metadata before building | |
java.util.Map<String, String> customMetadata = new java.util.HashMap<>(); | |
for (String name : metadata.names()) { | |
if (!isStandardMetadata(name)) { | |
customMetadata.put(name, metadata.get(name)); | |
} | |
} | |
builder.customMetadata(customMetadata); |
Copilot uses AI. Check for mistakes.
Purpose of this pull request
#9861
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide