petergeoghegan
diff --git a/‎src/backend/executor/README
Lines changed: 128 additions & 0 deletions b/‎src/backend/executor/README
Lines changed: 128 additions & 0 deletions
@@ -200,3 +200,131 @@ is no explicit prohibition on SRFs in UPDATE, but the net effect will be
8000
 that only the first result row of an SRF counts, because all subsequent
 rows will result in attempts to re-update an already updated target row.
 This is historical behavior and seems not worth changing.)
+
+Speculative insertion
+---------------------
+
+Speculative insertion is a process that the executor manages for the benefit of
+INSERT...ON CONFLICT UPDATE/IGNORE.  Supported indexes include nbtree unique
+indexes (nbtree is currently the only amcanunique index access method), or
+exclusion constraint indexes (exclusion constraints are considered a
+generalization of unique constraints).  Only ON CONFLICT IGNORE is supported
+with exclusion constraints.
+
+The primary user-visible goal for INSERT...ON CONFLICT UPDATE is to guarantee
+either an insert or update under normal operating conditions in READ COMMITTED
+mode (where serialization failures are just as unacceptable as they are with
+regular UPDATEs).  A would-be conflict (and the associated index) are the
+arbiters of whether or not the alternative (UPDATE/IGNORE) path is taken.  The
+implementation more or less tries to update or insert until one or the other of
+those two outcomes occurs successfully.  There are some non-obvious hazards
+involved that are carefully avoided.  These hazards relate to concurrent
+activity causing conflict
8000
s for the implementation, which must be handled.
+
+The index is the authoritative source of truth for whether there is or is not a
+conflict, for unique index enforcement in general, and for speculative
+insertion in particular.  The heap must still be considered, though, not least
+since it alone has authoritative visibility information.  Through looping, we
+hope to overcome the disconnect between the heap and the arbiter index.  We
+must lock the row, and then verify that there is no conflict.  Only then do we
+UPDATE.  Theoretically, some individual session could loop forever, although
+under high concurrency one session always proceeds.
+
+There are 2 sources of conflicts for ON CONFLICT UPDATE:
+
+1. Conflicts from going to update (having found a conflict during the
+pre-check), and finding the tuple changed (which may or may not involve new,
+distinct constrained values in later tuple versions -- for simplicity, we don't
+bother with considering that).  This is not a conflict that the IGNORE variant
+considers.
+
+2. Conflicts from inserting a tuple (having not found a conflict during the
+pre-check), and only then finding a conflict at insertion time (when inserting
+index tuples, and finding a conflicting one when a buffer lock is held on an
+index page in the ordinary course of insertion).  This can happen if a
+concurrent insertion occurs after the pre-check, but before physical index
+tuple insertion.
+
+The first step in the loop is to perform a pre-check.  The indexes are scanned
+for existing conflicting values.  At this point, we may have to wait until the
+end of another xact (or xact's promise token -- more on that later), iff it
+isn't immediately conclusive that there is or is not a conflict (when we finish
+the pre-check, there is a preliminary conclusion about there either being or
+not being a conflict -- but the conclusion only holds if there are no
+subsequent concurrent conflicts).  If a conclusively committed conflict tuple
+is detected during the first step, the executor goes to lock and update the row
+(for ON CONFLICT UPDATE -- otherwise, for ON CONFLICT IGNORE, we're done).  The
+TID to lock (and potentially UPDATE) can only be determined during the first
+step.  If locking the row finds a concurrent conflict (which may be from a
+concurrent UPDATE that hasn't even physically inspected the arbiter index yet)
+then we restart the loop from the very beginning.  We restart from scratch
+because all bets are off;  it's possible that the process will find no conflict
+the second time around, and will successfully insert, or will UPDATE another
+tuple that is not even part of the same UPDATE chain as first time around.
+
+The second step (skipped when a conflict is found) is to insert a heap tuple
+and related index tuples opportunistically.  This uses the same mechanism as
+deferred unique indexes, and so we never wait for a possibly conflicting xact
+to commit or abort (unlike with conventional unique index insertion) -- we
+simply detect a possible conflict.
+
+When opportunistically inserting during the second step, we are not logically
+inserting a tuple as such.  Rather, the process is somewhat similar to the
+conventional unique index insertion steps taken within the nbtree AM, where we
+must briefly lock the *value* being inserted:  in that codepath, the value
+proposed for insertion is for an instant locked *in the abstract*, by way of a
+buffer lock on "the first leaf page the value could be on".  Then, having
+established the right to physically insert, do so (or throw an error).  For
+speculative insertion, if no conflict occurs during the insertion (which is
+usually the case, since it was just determined in the first step that there was
+no conflict), then we're done.  Otherwise, we must restart (and likely find the
+same conflict tuple during the first step of the new iteration). But a
+counter-intuitive step must be taken first (which is what makes this whole
+dance similar to conventional nbtree "value locking").
+
+We must "super delete" the tuple when the opportunistic insertion finds a
+conflict.  This means that it immediately becomes invisible to all snapshot
+types, and immediately becomes reclaimable by VACUUM.  Other backends
+(speculative inserters or ordinary inserters) know to not wait on our
+transaction end when they encounter an optimistically inserted "promise tuple".
+Rather, they wait on a corresponding promise token lock, which we hold only for
+as long as opportunistically inserting.  We release the lock when done
+opportunistically inserting (and after "super deleting", if that proved
+necessary), releasing our waiters (who will ordinarily re-find our promise
+tuple as a bona fide tuple, or occasionally will find that they can insert
+after all).  It's important that other xacts not wait on the end of our xact
+until we've established that we've successfully and conclusively inserted
+logically (or established that there was an insertion conflict, and cleaned up
+after it by "super deleting").  Otherwise, concurrent speculative inserters
+could be involved in "unprincipled deadlocks":  deadlocks where there is no
+user-visible mutual dependency, and yet an implementation related mutual
+dependency is unexpectedly introduced.  The user might be left with no
+reasonable way of avoiding these deadlocks, which would not be okay.
+
+Speculative insertion and EvalPlanQual()
+----------------------------------------
+
+Updating the tuple involves locking it first (to establish a definitive tuple
+to consider evaluating the additional UPDATE qual against).  The EvalPlanQual()
+mechanism (or, rather, some associated infrastructure) is reused for the
+benefit of auxiliary UPDATE expression evaluation.
+
+Locking first deviates from how conventional UPDATEs work, but allows the
+implementation to consider the possibility of conflicts first, and then, having
+reached a definitive conclusion, separately evaluate.
+
+ExecLockUpdateTuple() is somewhat similar to EvalPlanQual(), except it locks
+the TID reported as conflicting, and upon successfully locking, installs that
+into the UPDATE's EPQ slot.  There is no UPDATE chain to walk -- rather, new
+tuples to check the qual against come from continuous attempts at locking a
+tuple conclusively (avoiding conflicts).  The qual (if any) is then evaluated.
+Note that at READ COMMITTED, it's possible that *no* version of the tuple is
+visible, and yet it may still be updated.  Similarly,  since we do not walk the
+UPDATE chain, concurrent READ COMMITTED INSERT ... ON CONFLICT UPDATE sessions
+always attempt to lock the conclusively visible tuple, without regard to any
+other tuple version (repeatable read isolation level and up must consider MVCC
+visibility, though).  A further implication of this  is that the
+MVCC-snapshot-visible row version is denied the opportunity to prevent the
+UPDATE from taking place, should it not pass our qual (while a later version
+does pass it).  This is fundamentally similar to updating a tuple when no
+version is visible, though.