8000 Ensure context cancellation during metric pipeline produce does not corrupt data by alecholmes · Pull Request #6914 · open-telemetry/opentelemetry-go · GitHub
[go: up one dir, main page]

Skip to content

Conversation

alecholmes
Copy link
Contributor

This fixes issue #6344, in which an attempted clean shutdown of the metrics SDK using PeriodicReader's Shutdown method can produce and emit wildly incorrect data point values that are orders of magnitude too large.

The root of the issue is that the pipeline produce method changed in this PR cannot safely return early, which was happening in the callback loops that were checking for context cancellation. Early return is not safe since callbacks and aggregations are tightly coupled in practice: invoking callbacks without also invoking aggregations corrupts internal data point value accounting.

The linked issue more concretely walks through the sequence of steps that were causing this issue.

Copy link
linux-foundation-easycla bot commented Jun 18, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor
@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. Ideally, callbacks would themselves check to see if the context is cancelled and return early. But we do need to ensure that we call compAgg on instruments that are measured.

But we should make sure we get reviews from others as well, as this could have unintended side-effects

@dashpole
Copy link
Contributor

(This needs a changelog entry)

Copy link
codecov bot commented Jun 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.9%. Comparing base (1dc9644) to head (76222fc).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #6914   +/-   ##
=====================================
  Coverage   82.9%   82.9%           
=====================================
  Files        261     261           
  Lines      24262   24256    -6     
=====================================
+ Hits       20128   20130    +2     
+ Misses      3757    3751    -6     
+ Partials     377     375    -2     
Files with missing lines Coverage Δ
sdk/metric/pipeline.go 89.2% <100.0%> (+2.1%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alecholmes
Copy link
Contributor Author

This makes sense to me. Ideally, callbacks would themselves check to see if the context is cancelled and return early. But we do need to ensure that we call compAgg on instruments that are measured.

But we should make sure we get reviews from others as well, as this could have unintended side-effects

@dashpole Thanks for taking the time to review. I pushed a commit addressing your comments. Hopefully there is no subtle history I'm missing around the original implementation.

Co-authored-by: Damien Mathieu <42@dmathieu.com>
@dmathieu dmathieu changed the title Context cancellation during metric pipeline produce does not corrupt data Ensure context cancellation during metric pipeline produce does not corrupt data Jun 19, 2025
@dmathieu dmathieu merged commit 2da77b1 into open-telemetry:main Jun 20, 2025
32 checks passed
@pellared pellared added this to the v1.37.0 milestone Jun 24, 2025
pellared added a commit that referenced this pull request Jun 25, 2025
### Added

- The `go.opentelemetry.io/otel/semconv/v1.33.0` package.
The package contains semantic conventions from the `v1.33.0` version of
the OpenTelemetry Semantic Conventions.
See the [migration documentation](./semconv/v1.33.0/MIGRATION.md) for
information on how to upgrade from
`go.opentelemetry.io/otel/semconv/v1.32.0.`(#6799)
- The `go.opentelemetry.io/otel/semconv/v1.34.0` package.
The package contains semantic conventions from the `v1.34.0` version of
the OpenTelemetry Semantic Conventions. (#6812)
- Add metric's schema URL as `otel_scope_schema_url` label in
`go.opentelemetry.io/otel/exporters/prometheus`. (#5947)
- Add metric's scope attributes as `otel_scope_[attribute]` labels in
`go.opentelemetry.io/otel/exporters/prometheus`. (#5947)
- Add `EventName` to `EnabledParameters` in
`go.opentelemetry.io/otel/log`. (#6825)
- Add `EventName` to `EnabledParameters` in
`go.opentelemetry.io/otel/sdk/log`. (#6825)
- Changed handling of `go.opentelemetry.io/otel/exporters/prometheus`
metric renaming to add unit suffixes when it doesn't match one of the
pre-defined values in the unit suffix map. (#6839)

### Changed

- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/bridge/opentracing`. (#6827)
- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/exporters/zipkin`. (#6829)
- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/metric`. (#6832)
- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/sdk/resource`. (#6834)
- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/sdk/trace`. (#6835)
- The semantic conventions have been upgraded from `v1.26.0` to
`v1.34.0` in `go.opentelemetry.io/otel/trace`. (#6836)
- `Record.Resource` now returns `*resource.Resource` instead of
`resource.Resource` in `go.opentelemetry.io/otel/sdk/log`. (#6864)
- Retry now shows error cause for context timeout in
`go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc`,
`go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc`,
`go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc`,
`go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp`,
`go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp`,
`go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp`. (#6898)

### Fixed

- Stop stripping trailing slashes from configured endpoint URL in
`go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc`.
(#6710)
- Stop stripping trailing slashes from configured endpoint URL in
`go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp`.
(#6710)
- Stop stripping trailing slashes from configured endpoint URL in
`go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc`.
(#6710)
- Stop stripping trailing slashes from configured endpoint URL in
`go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp`.
(#6710)
- Validate exponential histogram scale range for Prometheus
compatibility in `go.opentelemetry.io/otel/exporters/prometheus`.
(#6822)
- Context cancellation during metric pipeline produce does not corrupt
data in `go.opentelemetry.io/otel/sdk/metric`. (#6914)

### Removed

- `go.opentelemetry.io/otel/exporters/prometheus` no longer exports
`otel_scope_info` metric. (#6770)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0