8000 Make _bt_killitems drop pins it acquired itself. · theory/postgres@7c319f5 · GitHub
[go: up one dir, main page]

Skip to content

Commit 7c319f5

Browse files
Make _bt_killitems drop pins it acquired itself.
Teach nbtree's _bt_killitems to leave the so->currPos page that it sets LP_DEAD items on in whatever state it was in when _bt_killitems was called. In particular, make sure that so->dropPin scans don't acquire a pin whose reference is saved in so->currPos.buf. Allowing _bt_killitems to change so->currPos.buf like this is wrong. The immediate consequence of allowing it is that code in _bt_steppage (that copies so->currPos into so->markPos) will behave as if the scan is a !so->dropPin scan. so->markPos will therefore retain the buffer pin indefinitely, even though _bt_killitems only needs to acquire a pin (along with a lock) for long enough to mark known-dead items LP_DEAD. This issue came to light following a report of a failure of an assertion from recent commit e6eed40. The test case in question involves the use of mark and restore. An initial call to _bt_killitems takes place that leaves so->currPos.buf in a state that is inconsistent with the scan being so->dropPin. A subsequent call to _bt_killitems for the same position (following so->currPos being saved in so->markPos, and then restored as so->currPos) resulted in the failure of an assertion that tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin (non-assert builds got a "resource was not closed" WARNING instead). The same problem exists on earlier releases, though the issue is far more subtle there. Recent commit e6eed40 introduced the so->dropPin field as a partial replacement for testing so->currPos.buf directly. Earlier releases won't get an assertion failure (or buffer pin leak), but they will allow the second _bt_killitems call from the test case to behave as if a buffer pin was consistently held since the original call to _bt_readpage. This is wrong; there will have been an initial window during which no pin was held on the so->currPos page, and yet the second _bt_killitems call will neglect to check if so->currPos.lsn continues to match the page's now-current LSN. As a result of all this, it's just about possible that _bt_killitems will set the wrong items LP_DEAD (on release branches). This could only happen with merge joins (the sole user of nbtree mark/restore support), when a concurrently inserted index tuple used a recently-recycled TID (and only when the new tuple was inserted onto the same page as a distinct concurrently-removed tuple with the same TID). This is exactly the scenario that _bt_killitems' check of the page's now-current LSN against the LSN stashed in currPos was supposed to prevent. A follow-up commit will make nbtree completely stop conditioning whether or not a position's pin needs to be dropped on whether the 'buf' field is set. All call sites that might need to drop a still-held pin will be taught to rely on the scan-level so->dropPin field recently introduced by commit e6eed40. That will make bugs of the same general nature as this one impossible (or make them much easier to detect, at least). Author: Peter Geoghegan <pg@bowt.ie> Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com Backpatch-through: 13
1 parent 3614995 commit 7c319f5

File tree

2 files changed

+28
-21
lines changed

2 files changed

+28
-21
lines changed

src/backend/access/nbtree/nbtree.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -417,6 +417,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
417417
* way, so we might as well avoid wasting cycles on acquiring page LSNs.
418418
*
419419
* See nbtree/README section on making concurrent TID recycling safe.
420+
*
421+
* Note: so->dropPin should never change across rescans.
420422
*/
421423
so->dropPin = (!scan->xs_want_itup &&
422424
IsMVCCSnapshot(scan->xs_snapshot) &&

src/backend/access/nbtree/nbtutils.c

Lines changed: 26 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3323,24 +3323,26 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
33233323
* current page and killed tuples thereon (generally, this should only be
33243324
* called if so->numKilled > 0).
33253325
*
3326-
* The caller does not have a lock on the page and may or may not have the
3327-
* page pinned in a buffer. Note that read-lock is sufficient for setting
3328-
* LP_DEAD status (which is only a hint).
3326+
* Caller should not have a lock on the so->currPos page, but must hold a
3327+
* buffer pin when !so->dropPin. When we return, it still won't be locked.
3328+
* It'll continue to hold whatever pins were held before calling here.
33293329
*
3330-
* We match items by heap TID before assuming they are the right ones to
3331-
* delete. We cope with cases where items have moved right due to insertions.
3332-
* If an item has moved off the current page due to a split, we'll fail to
3333-
* find it and do nothing (this is not an error case --- we assume the item
3334-
* will eventually get marked in a future indexscan).
3335-
*
3336-
* Note that if we hold a pin on the target page continuously from initially
3337-
* reading the items until applying this function, VACUUM cannot have deleted
3338-
* any items on the page, so the page's TIDs can't have been recycled by now.
3339-
* There's no risk that we'll confuse a new index tuple that happens to use a
3340-
* recycled TID with a now-removed tuple with the same TID (that used to be on
3341-
* this same page). We can't rely on that during scans that drop pins eagerly
3330+
* We match items by heap TID before assuming they are the right ones to set
3331+
* LP_DEAD. If the scan is one that holds a buffer pin on the target page
3332+
* continuously from initially reading the items until applying this function
3333+
* (if it is a !so->dropPin scan), VACUUM cannot have deleted any items on the
3334+
* page, so the page's TIDs can't have been recycled by now. There's no risk
3335+
* that we'll confuse a new index tuple that happens to use a recycled TID
3336+
* with a now-removed tuple with the same TID (that used to be on this same
3337+
* page). We can't rely on that during scans that drop buffer pins eagerly
33423338
* (so->dropPin scans), though, so we must condition setting LP_DEAD bits on
33433339
* the page LSN having not changed since back when _bt_readpage saw the page.
3340+
* We totally give up on setting LP_DEAD bits when the page LSN changed.
3341+
*
3342+
* We give up much less often during !so->dropPin scans, but it still happens.
3343+
* We cope with cases where items have moved right due to insertions. If an
3344+
* item has moved off the current page due to a split, we'll fail to find it
3345+
* and just give up on it.
33443346
*/
33453347
void
33463348
_bt_killitems(IndexScanDesc scan)
@@ -3353,6 +3355,7 @@ _bt_killitems(IndexScanDesc scan)
33533355
OffsetNumber maxoff;
33543356
int numKilled = so->numKilled;
33553357
bool killedsomething = false;
3358+
Buffer buf;
33563359

33573360
Assert(numKilled > 0);
33583361
Assert(BTScanPosIsValid(so->currPos));
@@ -3369,11 +3372,11 @@ _bt_killitems(IndexScanDesc scan)
33693372
* concurrent VACUUMs from recycling any of the TIDs on the page.
33703373
*/
33713374
Assert(BTScanPosIsPinned(so->currPos));
3372-
_bt_lockbuf(rel, so->currPos.buf, BT_READ);
3375+
buf = so->currPos.buf;
3376+
_bt_lockbuf(rel, buf, BT_READ);
33733377
}
33743378
else
33753379
{
3376-
Buffer buf;
33773380
XLogRecPtr latestlsn;
33783381

33793382
Assert(!BTScanPosIsPinned(so->currPos));
@@ -3391,10 +3394,9 @@ _bt_killitems(IndexScanDesc scan)
33913394
}
33923395

33933396
/* Unmodified, hinting is safe */
3394-
so->currPos.buf = buf;
33953397
}
33963398

3397-
page = BufferGetPage(so->currPos.buf);
3399+
page = BufferGetPage(buf);
33983400
opaque = BTPageGetOpaque(page);
33993401
minoff = P_FIRSTDATAKEY(opaque);
34003402
maxoff = PageGetMaxOffsetNumber(page);
@@ -3511,10 +3513,13 @@ _bt_killitems(IndexScanDesc scan)
35113513
if (killedsomething)
35123514
{
3513< 7D50 /td>3515
opaque->btpo_flags |= BTP_HAS_GARBAGE;
3514-
MarkBufferDirtyHint(so->currPos.buf, true);
3516+
MarkBufferDirtyHint(buf, true);
35153517
}
35163518

3517-
_bt_unlockbuf(rel, so->currPos.buf);
3519+
if (!so->dropPin)
3520+
_bt_unlockbuf(rel, buf);
3521+
else
3522+
_bt_relbuf(rel, buf);
35183523
}
35193524

35203525

0 commit comments

Comments
 (0)
0