[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
On the main branch:
An update thread does the following:
MetricPoint
arrayReferenceCount
for theMetricPoint
at that indexMetricPoint
is valid for useMetricPointStatus
toCollectPending
ReferenceCount
for theMetricPoint
The collect thread does the following:
MetricPointStatus
for a givenMetricPoint
isNoCollectPending
, then check if it can be marked invalidMetricPoint
invalid (this happens by setting theReferenceCount
toint.MinValue
when no one is using it)MetricPoint
Where's the race condition?
Consider this sequence of steps where both the Update thread and the Collect thread are working on the same MetricPoint
MetricPoint
indexMetricPoints
MetricPointStatus
for theMetricPoint
MetricPointStatus
isNoCollectPending
ReferenceCount
to1
MetricPoint
is valid for useMetricPointStatus
toCollectPending
(This is the update that we would miss)ReferenceCount
to0
ReferenceCount
toint.MinValue
as it finds theReferenceCount
to be0
MetricPoint
and misses the update that happened at T7Fix
I'm introducing a double-checked locking type construct to recheck if the
MetricPointStatus
was changed toCollectPending
before the Collect thread could mark the MetricPoint invalid for use. When that happens, we would now callSnapshot
for thatMetricPoint
and mark it to be reclaimed in the next Collect cycle. Note that theMetricPoint
would remain invalid for use until the next Collect cycle reclaims it.