-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Ensure context cancellation during metric pipeline produce does not corrupt data #6914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure context cancellation during metric pipeline produce does not corrupt data #6914
Conversation
…aggregation state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. Ideally, callbacks would themselves check to see if the context is cancelled and return early. But we do need to ensure that we call compAgg on instruments that are measured.
But we should make sure we get reviews from others as well, as this could have unintended side-effects
(This needs a changelog entry) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6914 +/- ##
=====================================
Coverage 82.9% 82.9%
=====================================
Files 261 261
Lines 24262 24256 -6
=====================================
+ Hits 20128 20130 +2
+ Misses 3757 3751 -6
+ Partials 377 375 -2
🚀 New features to boost your workflow:
|
@dashpole Thanks for taking the time to review. I pushed a commit addressing your comments. Hopefully there is no subtle history I'm missing around the original implementation. |
Co-authored-by: Damien Mathieu <42@dmathieu.com>
### Added - The `go.opentelemetry.io/otel/semconv/v1.33.0` package. The package contains semantic conventions from the `v1.33.0` version of the OpenTelemetry Semantic Conventions. See the [migration documentation](./semconv/v1.33.0/MIGRATION.md) for information on how to upgrade from `go.opentelemetry.io/otel/semconv/v1.32.0.`(#6799) - The `go.opentelemetry.io/otel/semconv/v1.34.0` package. The package contains semantic conventions from the `v1.34.0` version of the OpenTelemetry Semantic Conventions. (#6812) - Add metric's schema URL as `otel_scope_schema_url` label in `go.opentelemetry.io/otel/exporters/prometheus`. (#5947) - Add metric's scope attributes as `otel_scope_[attribute]` labels in `go.opentelemetry.io/otel/exporters/prometheus`. (#5947) - Add `EventName` to `EnabledParameters` in `go.opentelemetry.io/otel/log`. (#6825) - Add `EventName` to `EnabledParameters` in `go.opentelemetry.io/otel/sdk/log`. (#6825) - Changed handling of `go.opentelemetry.io/otel/exporters/prometheus` metric renaming to add unit suffixes when it doesn't match one of the pre-defined values in the unit suffix map. (#6839) ### Changed - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/bridge/opentracing`. (#6827) - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/exporters/zipkin`. (#6829) - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/metric`. (#6832) - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/sdk/resource`. (#6834) - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/sdk/trace`. (#6835) - The semantic conventions have been upgraded from `v1.26.0` to `v1.34.0` in `go.opentelemetry.io/otel/trace`. (#6836) - `Record.Resource` now returns `*resource.Resource` instead of `resource.Resource` in `go.opentelemetry.io/otel/sdk/log`. (#6864) - Retry now shows error cause for context timeout in `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc`, `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc`, `go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploggrpc`, `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp`, `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp`, `go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp`. (#6898) ### Fixed - Stop stripping trailing slashes from configured endpoint URL in `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc`. (#6710) - Stop stripping trailing slashes from configured endpoint URL in `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp`. (#6710) - Stop stripping trailing slashes from configured endpoint URL in `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc`. (#6710) - Stop stripping trailing slashes from configured endpoint URL in `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp`. (#6710) - Validate exponential histogram scale range for Prometheus compatibility in `go.opentelemetry.io/otel/exporters/prometheus`. (#6822) - Context cancellation during metric pipeline produce does not corrupt data in `go.opentelemetry.io/otel/sdk/metric`. (#6914) ### Removed - `go.opentelemetry.io/otel/exporters/prometheus` no longer exports `otel_scope_info` metric. (#6770)
This fixes issue #6344, in which an attempted clean shutdown of the metrics SDK using PeriodicReader's Shutdown method can produce and emit wildly incorrect data point values that are orders of magnitude too large.
The root of the issue is that the pipeline produce method changed in this PR cannot safely return early, which was happening in the callback loops that were checking for context cancellation. Early return is not safe since callbacks and aggregations are tightly coupled in practice: invoking callbacks without also invoking aggregations corrupts internal data point value accounting.
The linked issue more concretely walks through the sequence of steps that were causing this issue.