-
Notifications
You must be signed in to change notification settings - Fork 808
fix(openai): make streaming telemetry monotonic + privacy-safe #3405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(openai): make streaming telemetry monotonic + privacy-safe #3405
Conversation
kevin-ops seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
WalkthroughAdds monotonic timing for chat/completion wrappers, prompt preview and SHA-256 hashing, header scrubbing for sensitive headers, new span attributes (receipt ID, reproducible run, content hash), response ID propagation, and related unit tests. Updates CHANGELOG and semantic conventions with new constants. Streaming accumulation logic tightened and cleanup improved. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant App
participant Wrapper as OpenAI Wrapper
participant Clock as MonotonicClock
participant OTEL as Span
App->>Wrapper: create chat/completion request (content, headers, seed?)
Wrapper->>Clock: now()
Clock-->>Wrapper: t0 (monotonic)
Note over Wrapper: Compute prompt preview + SHA-256 hash
Wrapper->>OTEL: set LLM_PROMPTS.* + content_hash
Wrapper->>Wrapper: scrub headers (auth/proxy-auth redacted)
alt headers present
Wrapper->>OTEL: set LLM_HEADERS (scrubbed)
end
alt seed provided
Wrapper->>OTEL: set llm.reproducible_run=true
end
App->>Wrapper: stream chunks
loop each chunk
Wrapper->>Wrapper: accumulate delta (preserve model/id if first)
Wrapper->>Clock: now()
Clock-->>Wrapper: tN
Wrapper->>OTEL: update streaming timings (tN - t0)
end
Wrapper->>Clock: now()
Clock-->>Wrapper: tEnd
Wrapper->>OTEL: set duration (tEnd - t0), response.id
Wrapper->>OTEL: set gen_ai.response_id + llm.receipt_id
Wrapper-->>App: final response
sequenceDiagram
participant Stream as ChatStream
participant Guard as CleanupGuard
participant OTEL as Span
participant Clock as MonotonicClock
Stream->>Clock: now() for first token
Stream->>Stream: accumulate(model/id/content) safely
note over Stream: Handle list/dict deltas → text
Stream->>Guard: cleanup()
alt success
Guard->>OTEL: end span (idempotent)
else failure
Guard->>OTEL: end span (best-effort)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to df3643c in 1 minute and 33 seconds. Click for details.
- Reviewed
554
lines of code in6
files - Skipped
3
files when reviewing. - Skipped posting
7
draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py:29
- Draft comment:
Good use of a set for sensitive header keys. Consider adding a brief comment about why only these keys are scrubbed (e.g. privacy concerns) and whether any additional headers might be needed in future. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the suggestion to document security-related code is generally good practice, this comment is more of a nice-to-have suggestion rather than identifying a clear issue that needs fixing. The headers being scrubbed (authorization and proxy-authorization) are obviously sensitive by their names alone. The comment also speculates about future headers which violates our rule about speculative comments. Documentation about security decisions can be important. Maybe there's a real security risk in not documenting why only these specific headers are scrubbed. The headers being scrubbed are standard authorization headers with obvious sensitivity. Additional documentation would be redundant and the speculation about future headers isn't helpful. Delete this comment as it's more of a nice-to-have suggestion rather than identifying a clear issue, and it includes speculative elements about future headers.
2. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py:160
- Draft comment:
Nice merging of headers from 'headers' and 'extra_headers' with scrubbing and sorted JSON. Adding a brief comment to clarify that this ensures low cardinality and privacy-safe telemetry would help future maintainers. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
The comment is purely informative and suggests adding a comment for future maintainers. It doesn't provide a specific code suggestion or point out a potential issue with the code.
3. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:68
- Draft comment:
Using perf_counter via _monotonic_now() for timing is a solid improvement over time.time(), ensuring measurements are monotonic even if the system clock skews. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is purely informative, praising the use of perf_counter over time.time(). It doesn't provide any actionable feedback or suggestions for improvement.
4. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py:72
- Draft comment:
The _prompt_preview_and_hash function effectively produces a 128-character preview and a SHA-256 hash. Ensure that slicing the Unicode string does not inadvertently split a multi-byte character (though Python slicing is generally safe, a comment could clarify this behavior). - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =30%
<= threshold50%
The comment is informative and suggests ensuring that slicing does not split a multi-byte character, which is a valid concern in some contexts. However, it also acknowledges that Python slicing is generally safe, making the comment more informative than actionable. It does not directly suggest a code change or ask for a test, which would make it more useful.
5. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py:48
- Draft comment:
The completion wrapper now reuses the prompt preview and hash logic for input sanitization. This is consistent with the chat wrapper. A comment noting that this prevents leaking full prompt details and avoids high-cardinality attributes would be beneficial. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
The comment is purely informative, explaining what the code does and suggesting adding a comment for clarity. It doesn't ask for a specific code change or test, nor does it point out a potential issue or ask for confirmation of intent. It violates the rule against making purely informative comments.
6. packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py:72
- Draft comment:
Tests now cover prompt truncation and hash generation for both chat and completion wrappers. This is a robust validation for privacy-safe prompt handling. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is purely informative, describing what the tests now cover. It doesn't provide any actionable feedback or suggestions for improvement.
7. packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py:94
- Draft comment:
New constants (LLM_RECEIPT_ID, LLM_REPRODUCIBLE_RUN, LLM_CONTENT_HASH_ATTRIBUTE) are added. Their naming and placement follow the conventions. Ensure all instrumentation elsewhere is updated to reference these consistently. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
The comment is asking the PR author to ensure that all instrumentation is updated to reference the new constants consistently. This falls under the rule of not asking the author to ensure or double-check things. The comment does not provide a specific suggestion or point out a specific issue with the code.
Workflow ID: wflow_oN9z8Bb1X089nluM
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py (1)
45-51
: LGTM! Prompt preview and hashing helper is correct.The function properly:
- Encodes content as UTF-8 before hashing (correct for byte-level hashing)
- Computes SHA-256 digest (appropriate for content fingerprinting)
- Returns truncated preview (128 chars) and hex digest
Note: This function and
PROMPT_PREVIEW_LIMIT
are duplicated inchat_wrappers.py
. Consider extracting to a shared utility module as an optional refactor to reduce duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (3)
1.png
is excluded by!**/*.png
2.png
is excluded by!**/*.png
3.png
is excluded by!**/*.png
📒 Files selected for processing (6)
CHANGELOG.md
(1 hunks)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py
(3 hunks)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py
(17 hunks)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py
(4 hunks)packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py
(1 hunks)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py
: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py
packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py
🧬 Code graph analysis (4)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py (3)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (1)
_prompt_preview_and_hash
(72-75)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes
(64-264)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
_set_span_attribute
(54-61)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes
(64-264)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (3)
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py (1)
_prompt_preview_and_hash
(48-51)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (1)
_set_span_attribute
(54-61)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes
(64-264)
packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py (4)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes
(64-264)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (2)
_accumulate_stream_items
(1148-1225)_set_prompts
(453-512)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py (2)
_set_prompts
(166-188)_accumulate_streaming_response
(276-302)packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (2)
_set_request_attributes
(128-221)_set_response_attributes
(225-275)
🪛 Ruff (0.13.3)
packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py
32-32: Unused lambda argument: args
(ARG005)
32-32: Unused lambda argument: kwargs
(ARG005)
33-33: Unused lambda argument: instance
(ARG005)
34-34: Unused lambda argument: base_url
(ARG005)
🔇 Additional comments (16)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
95-97
: LGTM! New semantic convention attributes align with PR objectives.The three new constants support receipt tracking, reproducibility signaling, and content hashing as described in the PR. The naming follows existing conventions and the values are appropriate for span attributes.
CHANGELOG.md (1)
1-5
: LGTM! Changelog entry accurately describes the changes.The entry correctly summarizes the three main improvements: monotonic timing, header redaction, and prompt hash/receipt attributes. The categorization as a Fix is appropriate.
packages/opentelemetry-instrumentation-openai/tests/traces/test_chat_wrappers_unit.py (2)
30-40
: LGTM! Test fixture correctly mocks shared helpers.The autouse fixture ensures tests are isolated from external dependencies. The unused lambda arguments flagged by static analysis are intentional no-ops for mocking, so those warnings are false positives.
43-148
: Excellent test coverage for new functionality.The six test functions comprehensively validate:
- Streaming accumulation preserving model/id and concatenating content (chat and completion)
- Prompt truncation to 128 chars with SHA-256 hashing
- Per-prompt hashing for batch prompts
- Header scrubbing (Authorization redacted, others preserved)
- Reproducible run flag when seed is present
- Receipt ID propagation from response
Tests are well-structured and verify the core behaviors described in the PR objectives.
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/__init__.py (3)
29-51
: LGTM! Header scrubbing implementation is sound.The function correctly:
- Returns None for empty/invalid input
- Uses case-insensitive comparison for header names (appropriate for HTTP)
- Redacts sensitive headers (authorization, proxy-authorization)
- Preserves other headers unchanged
This maintains privacy while enabling debugging with request metadata.
159-175
: LGTM! Request attribute handling improved with header scrubbing and reproducibility tracking.The changes:
- Merge and scrub headers from both
headers
andextra_headers
kwargs- Only emit
LLM_HEADERS
when there's actual header data (good defensive programming)- Set
LLM_REPRODUCIBLE_RUN=True
when a seed is provided (enables rerun correlation)- Use sorted JSON for deterministic header serialization
These align with the PR's privacy and reproducibility objectives.
241-243
: LGTM! Dual receipt attribution enables correlation.Setting both
GEN_AI_RESPONSE_ID
(OpenTelemetry standard) andLLM_RECEIPT_ID
(custom) from the same response ID enables correlation between OpenTelemetry traces and provider-specific receipts for debugging and postmortems.packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/completion_wrappers.py (2)
166-188
: LGTM! Per-prompt hashing enables batch prompt correlation.The updated logic correctly:
- Handles both single prompts and batch prompts (lists)
- Computes independent preview+hash for each prompt item
- Uses dynamic attribute names (
LLM_PROMPTS.{idx}.user
,LLM_PROMPTS.{idx}.content_hash
)This enables correlating prompts across requests without storing full text, maintaining privacy while enabling debugging.
276-285
: Good defensive programming: conditional model/id assignment.Checking
if model:
andif response_id:
before assignment prevents overwriting existing values with empty/None values from subsequent chunks. This preserves the first non-empty model and ID during streaming accumulation.packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/shared/chat_wrappers.py (7)
65-75
: LGTM! Monotonic timing and hashing utilities are well-structured.The additions:
_monotonic_now()
usesperf_counter()
(correct choice for measuring durations immune to system clock adjustments)_prompt_preview_and_hash()
provides privacy-safe content fingerprinting with 128-char preview + SHA-256 digestThese utilities support the PR's monotonic timing and privacy objectives. Note: Minor duplication with
completion_wrappers.py
(see earlier comment).
108-112
: Excellent: Monotonic timing prevents negative span durations.Replacing
time.time()
withperf_counter()
for start/end times ensures durations are immune to system clock adjustments (NTP corrections, daylight saving time, manual changes). This eliminates negative spans and histogram spikes in telemetry, enabling accurate latency SLOs.Applied consistently across sync and async wrappers.
Also applies to: 207-211
452-487
: LGTM! Content serialization and hashing maintain privacy while enabling correlation.The enhanced logic:
- Serializes complex content (lists → JSON, dicts → JSON, others → string) before hashing
- Processes base64 images (if configured) before serialization
- Computes 128-char preview + SHA-256 digest
- Stores preview in
content
attribute and digest incontent_hash
attributeThis approach:
- Reduces cardinality (fixed-length preview + deterministic hash)
- Maintains privacy (no full prompt text in spans)
- Enables correlation (hash matching across requests)
Aligns perfectly with the PR's privacy and low-cardinality objectives.
537-537
: Good fix: Continue ensures full choice iteration.Changing
return
tocontinue
ensures the loop processes all choices even when:
- Line 537: A choice has
finish_reason="content_filter"
(filtered content)- Line 541: A choice has no message (edge case)
This prevents early exit and ensures completions for all choices are recorded when possible.
Also applies to: 541-541
741-741
: Systematic monotonic timing applied across all timing measurements.All duration calculations now use
_monotonic_now()
:
- Span lifetime (duration histogram)
- Time-to-first-token (TTFT)
- Streaming time-to-generate
- Partial metrics in cleanup
This comprehensive change ensures all telemetry timing is immune to clock adjustments, providing stable and accurate latency measurements for SLOs and monitoring.
Also applies to: 780-780, 789-789, 849-849, 904-904, 931-931, 937-937, 975-975, 1002-1002, 1008-1008
1152-1196
: Robust streaming accumulation with metadata retention and multimodal support.The enhanced logic:
- Preserves first non-empty
model
andid
(Lines 1152-1157): Prevents overwriting with empty values from later chunks- Handles multimodal content chunks (Lines 1188-1196): When
delta.content
is a list (e.g., mixed text/image), joins parts into text by extracting"text"
keys or stringifyingThis addresses the PR's "streaming metadata retention" objective and handles edge cases like multimodal deltas gracefully.
806-843
: Thread-safe cleanup with comprehensive error handling.The
_ensure_cleanup()
method:
- Uses
_cleanup_lock
for thread safety- Prevents duplicate cleanup with
_cleanup_completed
flag- Records partial metrics before closing span (ensures data is captured even on early cleanup)
- Has fallback error handling to always mark cleanup as complete (prevents infinite loops)
This robust design ensures spans are always closed and metrics are recorded even in error scenarios, improving telemetry reliability.
This PR addresses three ongoing telemetry issues in the OpenAI wrapper: inaccurate latency measurements, loss of streaming model metadata, and raw prompt/header payloads in span attributes.
What ships
now reads from perf_counter(). Clock slews no longer create negative spans
or spike histograms.
across chunks, handle multimodal deltas gracefully, and continue collecting
completions when a filtered choice appears.
plus SHA-256 hash so teams can correlate traffic without leaking full text
or exploding cardinality.
llm.reproducible_run=true is emitted when a seed is present, and
response IDs are mirrored into llm.receipt_id. Added the new constants
to SpanAttributes.
prompt hashing, header scrubbing, receipt propagation, and streaming
accumulation logic.
Why it matters
operators can trust the telemetry.
debugging remain accurate.
reducing downstream risk.
this call” tooling.
Validation
python3 -m pytest packages/opentelemetry-instrumentation-openai/tests/traces/
test_chat_wrappers_unit.py
No flags required; behaviour improves on upgrade.
Important
Fix telemetry issues in OpenAI wrapper by implementing monotonic timing, metadata retention, prompt privacy, and header scrubbing, with added unit tests.
perf_counter()
for monotonic timing inchat_wrappers.py
andcompletion_wrappers.py
to prevent negative spans.chat_wrappers.py
andcompletion_wrappers.py
.chat_wrappers.py
andcompletion_wrappers.py
.llm.reproducible_run
andllm.receipt_id
attributes inshared/__init__.py
.test_chat_wrappers_unit.py
to test model persistence, prompt hashing, header scrubbing, and receipt propagation.LLM_RECEIPT_ID
,LLM_REPRODUCIBLE_RUN
, andLLM_CONTENT_HASH_ATTRIBUTE
toSpanAttributes
insemconv_ai/__init__.py
.This description was created by
for df3643c. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit