@@ -200,3 +200,131 @@ is no explicit prohibition on SRFs in UPDATE, but the net effect will be
8000
200
200
that only the first result row of an SRF counts, because all subsequent
201
201
rows will result in attempts to re-update an already updated target row.
202
202
This is historical behavior and seems not worth changing.)
203
+
204
+ Speculative insertion
205
+ ---------------------
206
+
207
+ Speculative insertion is a process that the executor manages for the benefit of
208
+ INSERT...ON CONFLICT UPDATE/IGNORE. Supported indexes include nbtree unique
209
+ indexes (nbtree is currently the only amcanunique index access method), or
210
+ exclusion constraint indexes (exclusion constraints are considered a
211
+ generalization of unique constraints). Only ON CONFLICT IGNORE is supported
212
+ with exclusion constraints.
213
+
214
+ The primary user-visible goal for INSERT...ON CONFLICT UPDATE is to guarantee
215
+ either an insert or update under normal operating conditions in READ COMMITTED
216
+ mode (where serialization failures are just as unacceptable as they are with
217
+ regular UPDATEs). A would-be conflict (and the associated index) are the
218
+ arbiters of whether or not the alternative (UPDATE/IGNORE) path is taken. The
219
+ implementation more or less tries to update or insert until one or the other of
220
+ those two outcomes occurs successfully. There are some non-obvious hazards
221
+ involved that are carefully avoided. These hazards relate to concurrent
222
+ activity causing conflict
8000
s for the implementation, which must be handled.
223
+
224
+ The index is the authoritative source of truth for whether there is or is not a
225
+ conflict, for unique index enforcement in general, and for speculative
226
+ insertion in particular. The heap must still be considered, though, not least
227
+ since it alone has authoritative visibility information. Through looping, we
228
+ hope to overcome the disconnect between the heap and the arbiter index. We
229
+ must lock the row, and then verify that there is no conflict. Only then do we
230
+ UPDATE. Theoretically, some individual session could loop forever, although
231
+ under high concurrency one session always proceeds.
232
+
233
+ There are 2 sources of conflicts for ON CONFLICT UPDATE:
234
+
235
+ 1. Conflicts from going to update (having found a conflict during the
236
+ pre-check), and finding the tuple changed (which may or may not involve new,
237
+ distinct constrained values in later tuple versions -- for simplicity, we don't
238
+ bother with considering that). This is not a conflict that the IGNORE variant
239
+ considers.
240
+
241
+ 2. Conflicts from inserting a tuple (having not found a conflict during the
242
+ pre-check), and only then finding a conflict at insertion time (when inserting
243
+ index tuples, and finding a conflicting one when a buffer lock is held on an
244
+ index page in the ordinary course of insertion). This can happen if a
245
+ concurrent insertion occurs after the pre-check, but before physical index
246
+ tuple insertion.
247
+
248
+ The first step in the loop is to perform a pre-check. The indexes are scanned
249
+ for existing conflicting values. At this point, we may have to wait until the
250
+ end of another xact (or xact's promise token -- more on that later), iff it
251
+ isn't immediately conclusive that there is or is not a conflict (when we finish
252
+ the pre-check, there is a preliminary conclusion about there either being or
253
+ not being a conflict -- but the conclusion only holds if there are no
254
+ subsequent concurrent conflicts). If a conclusively committed conflict tuple
255
+ is detected during the first step, the executor goes to lock and update the row
256
+ (for ON CONFLICT UPDATE -- otherwise, for ON CONFLICT IGNORE, we're done). The
257
+ TID to lock (and potentially UPDATE) can only be determined during the first
258
+ step. If locking the row finds a concurrent conflict (which may be from a
259
+ concurrent UPDATE that hasn't even physically inspected the arbiter index yet)
260
+ then we restart the loop from the very beginning. We restart from scratch
261
+ because all bets are off; it's possible that the process will find no conflict
262
+ the second time around, and will successfully insert, or will UPDATE another
263
+ tuple that is not even part of the same UPDATE chain as first time around.
264
+
265
+ The second step (skipped when a conflict is found) is to insert a heap tuple
266
+ and related index tuples opportunistically. This uses the same mechanism as
267
+ deferred unique indexes, and so we never wait for a possibly conflicting xact
268
+ to commit or abort (unlike with conventional unique index insertion) -- we
269
+ simply detect a possible conflict.
270
+
271
+ When opportunistically inserting during the second step, we are not logically
272
+ inserting a tuple as such. Rather, the process is somewhat similar to the
273
+ conventional unique index insertion steps taken within the nbtree AM, where we
274
+ must briefly lock the *value* being inserted: in that codepath, the value
275
+ proposed for insertion is for an instant locked *in the abstract*, by way of a
276
+ buffer lock on "the first leaf page the value could be on". Then, having
277
+ established the right to physically insert, do so (or throw an error). For
278
+ speculative insertion, if no conflict occurs during the insertion (which is
279
+ usually the case, since it was just determined in the first step that there was
280
+ no conflict), then we're done. Otherwise, we must restart (and likely find the
281
+ same conflict tuple during the first step of the new iteration). But a
282
+ counter-intuitive step must be taken first (which is what makes this whole
283
+ dance similar to conventional nbtree "value locking").
284
+
285
+ We must "super delete" the tuple when the opportunistic insertion finds a
286
+ conflict. This means that it immediately becomes invisible to all snapshot
287
+ types, and immediately becomes reclaimable by VACUUM. Other backends
288
+ (speculative inserters or ordinary inserters) know to not wait on our
289
+ transaction end when they encounter an optimistically inserted "promise tuple".
290
+ Rather, they wait on a corresponding promise token lock, which we hold only for
291
+ as long as opportunistically inserting. We release the lock when done
292
+ opportunistically inserting (and after "super deleting", if that proved
293
+ necessary), releasing our waiters (who will ordinarily re-find our promise
294
+ tuple as a bona fide tuple, or occasionally will find that they can insert
295
+ after all). It's important that other xacts not wait on the end of our xact
296
+ until we've established that we've successfully and conclusively inserted
297
+ logically (or established that there was an insertion conflict, and cleaned up
298
+ after it by "super deleting"). Otherwise, concurrent speculative inserters
299
+ could be involved in "unprincipled deadlocks": deadlocks where there is no
300
+ user-visible mutual dependency, and yet an implementation related mutual
301
+ dependency is unexpectedly introduced. The user might be left with no
302
+ reasonable way of avoiding these deadlocks, which would not be okay.
303
+
304
+ Speculative insertion and EvalPlanQual()
305
+ ----------------------------------------
306
+
307
+ Updating the tuple involves locking it first (to establish a definitive tuple
308
+ to consider evaluating the additional UPDATE qual against). The EvalPlanQual()
309
+ mechanism (or, rather, some associated infrastructure) is reused for the
310
+ benefit of auxiliary UPDATE expression evaluation.
311
+
312
+ Locking first deviates from how conventional UPDATEs work, but allows the
313
+ implementation to consider the possibility of conflicts first, and then, having
314
+ reached a definitive conclusion, separately evaluate.
315
+
316
+ ExecLockUpdateTuple() is somewhat similar to EvalPlanQual(), except it locks
317
+ the TID reported as conflicting, and upon successfully locking, installs that
318
+ into the UPDATE's EPQ slot. There is no UPDATE chain to walk -- rather, new
319
+ tuples to check the qual against come from continuous attempts at locking a
320
+ tuple conclusively (avoiding conflicts). The qual (if any) is then evaluated.
321
+ Note that at READ COMMITTED, it's possible that *no* version of the tuple is
322
+ visible, and yet it may still be updated. Similarly, since we do not walk the
323
+ UPDATE chain, concurrent READ COMMITTED INSERT ... ON CONFLICT UPDATE sessions
324
+ always attempt to lock the conclusively visible tuple, without regard to any
325
+ other tuple version (repeatable read isolation level and up must consider MVCC
326
+ visibility, though). A further implication of this is that the
327
+ MVCC-snapshot-visible row version is denied the opportunity to prevent the
328
+ UPDATE from taking place, should it not pass our qual (while a later version
329
+ does pass it). This is fundamentally similar to updating a tuple when no
330
+ version is visible, though.
0 commit comments