8000 docs(clp-package): Add instructions for ingesting logs from S3-compatible object storage. by hoophalab · Pull Request #1796 · y-scope/clp · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@hoophalab
Copy link
Contributor
@hoophalab hoophalab commented Dec 18, 2025

Description

Add instructions for ingesting logs from S3 compatible object storage.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

task docs:serve serves the doc correctly

Summary by CodeRabbit

  • Documentation
    • Added guidance accepting HTTP-style endpoints for S3-compatible object storage (object and key-prefix modes), with examples for host/port, bucket and key formats.
    • Clarified CLP will auto-detect appropriate endpoint URLs during compression and reiterated required bucket policy considerations.
    • Noted Web UI currently cannot display extracted streams from custom S3 endpoints; this limitation will be addressed in a future release.

✏️ Tip: You can customize this high-level summary in your review settings.

@hoophalab hoophalab requested a review from a team as a code owner December 18, 2025 05:13
@coderabbitai
Copy link
Contributor
coderabbitai bot commented Dec 18, 2025

Walkthrough

Adds documentation clarifications for S3-compatible endpoints: accepts HTTP-style object/key-prefix URLs for s3-object and s3-key-prefix modes, notes CLP auto-determines endpoint URL during compression, and documents a Web UI limitation for extracted streams on custom S3 endpoints.

Changes

Cohort / File(s) Change Summary
Object storage usage guide
docs/src/user-docs/guides-using-object-storage/clp-usage.md
Adds acceptance of HTTP S3-compatible object URLs (e.g., http://<host>:<port>/<bucket-name>/<object-key>) and similar HTTP key-prefix URL examples for s3-key-prefix mode; clarifies formats only, no control-flow or validation changes.
Object storage configuration guide
docs/src/user-docs/guides-using-object-storage/object-storage-config.md
Adds guidance to use equivalent bucket policies for S3-compatible storages; states CLP will automatically determine the appropriate endpoint URL during compression; adds note that the Web UI currently cannot view extracted streams from custom S3 endpoints.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5–10 minutes

  • Documentation-only edits; no functional code changes.
  • Review focus: wording accuracy, example correctness, and the Web UI limitation note.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding documentation for S3-compatible object storage ingestion, which aligns with the file changes across both documentation files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 266353e and 82a1b97.

📒 Files selected for processing (1)
  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md
📚 Learning: 2025-06-18T20:39:05.899Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 968
File: docs/src/user-guide/quick-start/overview.md:73-109
Timestamp: 2025-06-18T20:39:05.899Z
Learning: The CLP project team prefers to use video content to demonstrate detailed procedural steps (like tarball extraction) rather than including every step in the written documentation, keeping the docs focused on conceptual guidance.

Applied to files:

  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (2)
docs/src/user-docs/guides-using-object-storage/object-storage-config.md (2)

98-99: Clear and appropriate limitation note.

The documentation appropriately documents the Web UI limitation for custom S3 endpoints and sets proper user expectations by noting it will be addressed in a future release. The placement within the stream storage configuration section is logical.


4-6: The documentation accurately describes S3-compatible storage configuration. S3-compatible systems like MinIO use policy-based access control compatible with AWS IAM policy syntax, structure, and behavior, so "equivalent bucket policies" refers to policies using standard AWS IAM JSON format, which these systems support natively. No clarification is needed on this point.

The statement about CLP automatically determining the appropriate endpoint URL during compression aligns with the custom S3 endpoint support added in PR #1767. The Web UI limitation note appropriately sets expectations for users.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hoophalab hoophalab requested a review from junhaoliao December 18, 2025 05:13
Copy link
Contributor
@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9650b8c and c6cf9a3.

📒 Files selected for processing (2)
  • docs/src/user-docs/guides-using-object-storage/clp-usage.md (2 hunks)
  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • docs/src/user-docs/guides-using-object-storage/clp-usage.md
  • docs/src/user-docs/guides-using-object-storage/object-storage-config.md
🪛 LanguageTool
docs/src/user-docs/guides-using-object-storage/clp-usage.md

[style] ~34-~34: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...et-name>/*` can also be an object stored in an S3-compa...

(MISSING_IT_THERE)


[style] ~89-~89: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...et-name>/*` can also be an object stored in an S3-compa...

(MISSING_IT_THERE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (1)
docs/src/user-docs/guides-using-object-storage/object-storage-config.md (1)

3-6: Documentation additions properly positioned and consistent with S3-compatible endpoint support.

The clarification about S3-compatible storages using equivalent bucket policies and automatic endpoint URL determination during compression is well-placed and aligns with the custom S3 endpoint support added in PR #1767 (per learnings, where region_code is now optional for non-AWS endpoints).

Comment on lines +34 to +35
* `<object-url>` can also be an object stored in an S3-compatible storage:
* `http://<host>:<port>/<bucket-name>/<object-key>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix incomplete sentence structure on line 34.

The bullet point should form a complete sentence. The current phrasing "* <object-url> can also be an object stored..." lacks proper sentence structure.

🔎 Suggested fix:
- * `<object-url>` can also be an object stored in an S3-compatible storage:
+ * `<object-url>` can also point to an object stored in an S3-compatible storage:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* `<object-url>` can also be an object stored in an S3-compatible storage:
* `http://<host>:<port>/<bucket-name>/<object-key>`
* `<object-url>` can also point to an object stored in an S3-compatible storage:
* `http://<host>:<port>/<bucket-name>/<object-key>`
🧰 Tools
🪛 LanguageTool

[style] ~34-~34: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...et-name>/*` can also be an object stored in an S3-compa...

(MISSING_IT_THERE)

🤖 Prompt for AI Agents
In docs/src/user-docs/guides-using-object-storage/clp-usage.md around lines
34–35, the bullet is a fragment; rewrite it as a complete sentence that states
that the object URL can reference an object stored in S3-compatible storage and
include the example URL format (host:port/bucket-name/object-key) inline so the
bullet reads as a full, self-contained sentence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to the current PR, but i found we haven't update the webui code to support custom s3 endpoints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though technically we haven't been mentioning the custom endpoint feature can be used for streams... lol. so long we're explicit that the custom endpoint works only for logs inputs, i think it's fine to skip the webui support for now

anyways, i created an issue to track: #1797

Copy link
Contributor Author
@hoophalab hoophalab Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we don't officially support storing archives and streams on S3 compatible storages. The PR title and feature request were Add support for ingesting logs from S3 compatible endpoints.

With that being said, the users can actually add an "endpoint_url: ..." field in clp-config.yaml to enable this feature. I did a very quick test, and the only issue I discovered was extracting stream didn't work in webui.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. let's mention the temporary webui limitation in docs/src/user-docs/guides-using-object-storage/object-storage-config.md if we don't plan to add the webui config in the coming release

@hoophalab hoophalab changed the title docs(clp-package): Add instructions for using S3-compatible endpoints. docs(clp-package): Add instructions for ingesting logs from S3-compatible object storage. Dec 18, 2025
@hoophalab hoophalab requested a review from junhaoliao December 18, 2025 18:53
@hoophalab
Copy link
Contributor Author

@junhaoliao updated based on our offline discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0