8000 [Feature][Transform] Introduce tika transform by liugddx · Pull Request #9862 · apache/seatunnel · GitHub
[go: up one dir, main page]

Skip to content

Conversation

liugddx
Copy link
Member
@liugddx liugddx commented Sep 15, 2025

Purpose of this pull request

#9861

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@liugddx liugddx marked this pull request as draft September 15, 2025 00:36
@nielifeng nielifeng requested a review from Copilot September 15, 2025 05:12
Copy link
Contributor
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new TikaDocument transform that enables extraction of text content and metadata from various document formats (PDF, Word, Excel, PowerPoint, etc.) using Apache Tika. The transform processes binary document data or base64-encoded strings and outputs structured fields containing extracted text, metadata, and document properties.

  • Implements Apache Tika integration for document parsing and content extraction
  • Provides configurable content processing options (whitespace normalization, empty line removal)
  • Includes comprehensive error handling with configurable strategies (skip, fail, null)

Reviewed Changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
seatunnel-transforms-v2/pom.xml Adds Apache Tika dependencies (core and parsers)
TikaDocumentTransform.java Main transform implementation handling document processing and field mapping
TikaDocumentTransformConfig.java Configuration class defining all transform options and parsing
TikaDocumentTransformFactory.java Factory class for creating transform instances
TikaDocumentExtractor.java Apache Tika integration for document content extraction
DocumentMetadata.java Data class representing extracted document metadata
DefaultContentProcessor.java Implementation for post-processing extracted text content
E2E test files Integration tests for single-table and multi-table scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 53 to 68
private static final Set<String> SUPPORTED_MIME_TYPES =
new HashSet<String>() {
{
add("application/pdf");
add("application/msword");
add("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
add("application/vnd.ms-excel");
add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
add("application/vnd.ms-powerpoint");
add(
"application/vnd.openxmlformats-officedocument.presentationml.presentation");
add("text/plain");
add("text/html");
add("application/rtf");
}
};
Copy link
Preview
Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use modern collection initialization syntax instead of anonymous inner class. Replace with Set.of() or Arrays.asList() wrapped in new HashSet<>() for better readability and performance.

Copilot uses AI. Check for mistakes.

Comment on lines 48 to 54
.defaultValue(
new HashMap<String, String>() {
{
put("content", "extracted_text");
put("content_type", "mime_type");
}
})
Copy link
Preview
Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use modern Map initialization syntax instead of anonymous inner class. Replace with Map.of() for better readability and performance.

Copilot uses AI. Check for mistakes.

Comment on lines 116 to 122
boolean found = false;
for (String expectedName : expectedNames) {
if (expectedName.equals(outputColumns[i].getName())) {
found = true;
break;
}
}
Copy link
Preview
Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use modern Java 8+ streams or convert expectedNames to a Set for more efficient lookup instead of nested loops. This could be replaced with Set.of(expectedNames).contains(outputColumns[i].getName()).

Copilot uses AI. Check for mistakes.

Comment on lines 201 to 207
// Add all other metadata as custom metadata
DocumentMetadata tempMetadata = builder.build();
for (String name : metadata.names()) {
if (!isStandardMetadata(name)) {
tempMetadata.setCustomMetadata(name, metadata.get(name));
}
}
Copy link
Preview
Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DocumentMetadata object is built prematurely and then modified, but the builder pattern suggests immutability. The tempMetadata object should not be used to set custom metadata after building. Consider collecting custom metadata before building the final object.

Suggested change
// Add all other metadata as custom metadata
DocumentMetadata tempMetadata = builder.build();
for (String name : metadata.names()) {
if (!isStandardMetadata(name)) {
tempMetadata.setCustomMetadata(name, metadata.get(name));
}
}
// Collect all other metadata as custom metadata before building
java.util.Map<String, String> customMetadata = new java.util.HashMap<>();
for (String name : metadata.names()) {
if (!isStandardMetadata(name)) {
customMetadata.put(name, metadata.get(name));
}
}
builder.customMetadata(customMetadata);

Copilot uses AI. Check for mistakes.

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Sep 17, 2025
@github-actions github-actions bot added the core SeaTunnel core module label Sep 17, 2025
@liugddx liugddx marked this pull request as ready for review September 21, 2025 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core SeaTunnel core module dependencies Pull requests that update a dependency file document e2e Transform-v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0