8000 Timeout and retry of exporters is broken · Issue #4043 · open-telemetry/opentelemetry-python · GitHub
[go: up one dir, main page]

Skip to content
Timeout and retry of exporters is broken #4043
@LarsMichelsen

Description

@LarsMichelsen

Describe your environment

OS: Ubuntu 24.04
Python version: Python 3.12
SDK version: 1.25.1
API version: 1.25.1

What happened?

I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.

I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.

I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:

  1. The exporter uses self._timeout (based on OTEL_EXPORTER_OTLP_TIMEOUT, defaults to 10 sec) as maximum time to spend on an export.
  2. The exponential backoff retry is kept, but stays within the boundaries of self._timeout.

On a more technical level:

  1. At the beginning of OTLPExporterMixin._export a deadline is calculated which is then the base for the further execution of this function.
  2. The time needed for each self._client.Export call is subtracted from the deadline.
  3. As long as the deadline is not reached, we do retries with growing waits in between.
  4. In case shutdown is called, the signal is stored in self._shutdown.
  5. Now, once the shutdown signal is set, we can either continue the export until it's deadline, incl. additional retries, and then shutdown or interrupt the export the next time _export gets control over it (when it returns from self._client.Export or we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.

This way, we would have a clear maximum time to shutdown. What do you think?

I also propose to change the semantic of shutdown. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to _export to pick up the signal as needed. The caller of the shutdown function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?

Steps to Reproduce

  1. Configure OTLP target
  2. Export a span
  3. Try to stop the application

Expected Result

The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.

Actual Result

A hanging application which would not shut down in a few seconds as before.

Additional context

No response

Would you like to implement a fix?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0