-
Notifications
You must be signed in to change notification settings - Fork 718
Description
Describe your environment
OS: Ubuntu 24.04
Python version: Python 3.12
SDK version: 1.25.1
API version: 1.25.1
What happened?
I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.
I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.
I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:
- The exporter uses
self._timeout
(based onOTEL_EXPORTER_OTLP_TIMEOUT
, defaults to 10 sec) as maximum time to spend on an export. - The exponential backoff retry is kept, but stays within the boundaries of
self._timeout
.
On a more technical level:
- At the beginning of
OTLPExporterMixin._export
a deadline is calculated which is then the base for the further execution of this function. - The time needed for each
self._client.Export
call is subtracted from the deadline. - As long as the deadline is not reached, we do retries with growing waits in between.
- In case
shutdown
is called, the signal is stored inself._shutdown
. - Now, once the shutdown signal is set, we can either continue the export until it's deadline, incl. additional retries, and then shutdown or interrupt the export the next time
_export
gets control over it (when it returns fromself._client.Export
or we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.
This way, we would have a clear maximum time to shutdown. What do you think?
I also propose to change the semantic of shutdown
. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to _export
to pick up the signal as needed. The caller of the shutdown
function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?
Steps to Reproduce
- Configure OTLP target
- Export a span
- Try to stop the application
Expected Result
The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.
Actual Result
A hanging application which would not shut down in a few seconds as before.
Additional context
No response
Would you like to implement a fix?
Yes