[Feature][Transform] Introduce tika transform #9862

liugddx · 2025-09-15T00:36:39Z

Purpose of this pull request

#9861

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

Copilot

Pull Request Overview

This PR introduces a new TikaDocument transform that enables extraction of text content and metadata from various document formats (PDF, Word, Excel, PowerPoint, etc.) using Apache Tika. The transform processes binary document data or base64-encoded strings and outputs structured fields containing extracted text, metadata, and document properties.

Implements Apache Tika integration for document parsing and content extraction
Provides configurable content processing options (whitespace normalization, empty line removal)
Includes comprehensive error handling with configurable strategies (skip, fail, null)

Reviewed Changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
seatunnel-transforms-v2/pom.xml	Adds Apache Tika dependencies (core and parsers)
TikaDocumentTransform.java	Main transform implementation handling document processing and field mapping
TikaDocumentTransformConfig.java	Configuration class defining all transform options and parsing
TikaDocumentTransformFactory.java	Factory class for creating transform instances
TikaDocumentExtractor.java	Apache Tika integration for document content extraction
DocumentMetadata.java	Data class representing extracted document metadata
DefaultContentProcessor.java	Implementation for post-processing extracted text content
E2E test files	Integration tests for single-table and multi-table scenarios

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-15T05:14:03Z

...c/main/java/org/apache/seatunnel/transform/tikadocument/extractor/TikaDocumentExtractor.java

+    private static final Set<String> SUPPORTED_MIME_TYPES =
+            new HashSet<String>() {
+                {
+                    add("application/pdf");
+                    add("application/msword");
+                    add("application/vnd.openxmlformats-officedocument.wordprocessingml.document");
+                    add("application/vnd.ms-excel");
+                    add("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
+                    add("application/vnd.ms-powerpoint");
+                    add(
+                            "application/vnd.openxmlformats-officedocument.presentationml.presentation");
+                    add("text/plain");
+                    add("text/html");
+                    add("application/rtf");
+                }
+            };


[nitpick] Use modern collection initialization syntax instead of anonymous inner class. Replace with Set.of() or Arrays.asList() wrapped in new HashSet<>() for better readability and performance.

Copilot · 2025-09-15T05:14:03Z

...2/src/main/java/org/apache/seatunnel/transform/tikadocument/TikaDocumentTransformConfig.java

+                    .defaultValue(
+                            new HashMap<String, String>() {
+                                {
+                                    put("content", "extracted_text");
+                                    put("content_type", "mime_type");
+                                }
+                            })


[nitpick] Use modern Map initialization syntax instead of anonymous inner class. Replace with Map.of() for better readability and performance.

Copilot · 2025-09-15T05:14:03Z

...-v2/src/test/java/org/apache/seatunnel/transform/tikadocument/TikaDocumentTransformTest.java

+            boolean found = false;
+            for (String expectedName : expectedNames) {
+                if (expectedName.equals(outputColumns[i].getName())) {
+                    found = true;
+                    break;
+                }
+            }


[nitpick] Use modern Java 8+ streams or convert expectedNames to a Set for more efficient lookup instead of nested loops. This could be replaced with Set.of(expectedNames).contains(outputColumns[i].getName()).

Copilot · 2025-09-15T05:14:03Z

...c/main/java/org/apache/seatunnel/transform/tikadocument/extractor/TikaDocumentExtractor.java

+        // Add all other metadata as custom metadata
+        DocumentMetadata tempMetadata = builder.build();
+        for (String name : metadata.names()) {
+            if (!isStandardMetadata(name)) {
+                tempMetadata.setCustomMetadata(name, metadata.get(name));
+            }
+        }


The DocumentMetadata object is built prematurely and then modified, but the builder pattern suggests immutability. The tempMetadata object should not be used to set custom metadata after building. Consider collecting custom metadata before building the final object.

Suggested change

// Add all other metadata as custom metadata

DocumentMetadata tempMetadata = builder.build();

for (String name : metadata.names()) {

if (!isStandardMetadata(name)) {

tempMetadata.setCustomMetadata(name, metadata.get(name));

}

}

// Collect all other metadata as custom metadata before building

java.util.Map<String, String> customMetadata = new java.util.HashMap<>();

for (String name : metadata.names()) {

if (!isStandardMetadata(name)) {

customMetadata.put(name, metadata.get(name));

}

}

builder.customMetadata(customMetadata);

docs/zh/transform-v2/tikadocument.md

docs/en/transform-v2/tikadocument.md

1

0f8b714

github-actions bot added Transform-v2 e2e labels Sep 15, 2025

liugddx marked this pull request as draft September 15, 2025 00:36

nielifeng requested a review from Copilot September 15, 2025 05:12

Copilot AI reviewed Sep 15, 2025

View reviewed changes

liugddx added 2 commits September 15, 2025 22:21

1

d4ec6e7

1

5c95db0

github-actions bot added the document label Sep 15, 2025

liugddx added 4 commits September 15, 2025 22:56

1

978a922

1

974755c

1

f05b5f9

1

7945427

github-actions bot added the dependencies Pull requests that update a dependency file label Sep 17, 2025

1

0a7b676

github-actions bot added the core SeaTunnel core module label Sep 17, 2025

liugddx added 5 commits September 17, 2025 23:10

1

a60b130

1

0a3a509

1

5fbcb75

1

efa581f

1

a1c3958

liugddx marked this pull request as ready for review September 21, 2025 11:49

liugddx added 2 commits September 21, 2025 22:36

1

6cc038a

1

7a66ed3

Hisoka-X reviewed Sep 22, 2025

View reviewed changes

docs/zh/transform-v2/tikadocument.md Show resolved Hide resolved

Hisoka-X reviewed Sep 22, 2025

View reviewed changes

docs/en/transform-v2/tikadocument.md Outdated Show resolved Hide resolved

1

f5cf207

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Transform] Introduce tika transform #9862

[Feature][Transform] Introduce tika transform #9862

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Feature][Transform] Introduce tika transform #9862

Are you sure you want to change the base?

[Feature][Transform] Introduce tika transform #9862

Uh oh!

Conversation

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!