Logical replication, large transaction streaming #8

ololobus · 2018-11-13T11:33:18Z

No description provided.

Instead of deciding to serialize a transaction merely based on the number of changes in that xact (toplevel or subxact), this makes the decisions based on amount of memory consumed by the changes. The amount of memory is defined by a new logical_work_mem GUC, so for example we can do this SET logical_work_mem = '128kB' to trigger very aggressive streaming. The minimum value is 64kB (i.e. it's lower than minimum value for maintenance_work_mem, which is 1MB). When adding a change to a transaction, we account for the size in two places. Firstly, in the ReorderBuffer, which is then used to decide if we reached the total memory limit. And secondly in the transaction the change belongs to, so that we can pick the largest transaction to evict (and serialize to disk). We still use max_changes_in_memory when loading changes serialized to disk. The trouble is we can't use the memory limit directly as there might be multiple subxact serialized, we need to read all of them but we don't know how many are there (and which subxact to read first). We do not serialize the ReorderBufferTXN entries, so if there is a transaction with many subxacts, most memory may be in this type of objects. Those records are not included in the memory accounting. We also do not account for INTERNAL_TUPLECID changes, which are kept in a separate list and not evicted from memory. Transactions with many CTID changes may consume significant amounts of memory, but we can't really do much about that. The current eviction algorithm is very simple - the transaction is picked merely by size, while it might be useful to also consider age (LSN) of the changes for example. With the new Generational memory allocator, evicting the oldest changes would make it more likely the memory gets actually pfreed. The logical_work_mem may be changed in two ways. In postgres.conf, which serves as the default for all publishers on that instance, and when creating the subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

The logical decoding infrastructure needs to know which top-level transaction the subxact belongs to, in order to decode all the changes. Until now that might be delayed until commit, due to the caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring incremental decoding. So instead we write the assignment info into WAL immediately, as part of the next WAL record (to minimize overhead).

When wal_level=logical, write individual invalidations into WAL so that decoding can use this information. We still add the invalidations to the cache, and write them to WAL at commit time in RecordTransactionCommit(). This uses the existing XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource manager (see LogStandbyInvalidations for details). So existing code relying on those 8000 invalidations (e.g. redo) does not need to be changed. The individual invalidations are written are written using a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See LogLogicalInvalidations for details. These new xlog records are ignored by existing redo procedures, which still rely on the invalidations written to commit records. The invalidations are decoded and added as a new ReorderBufferChange type (REORDER_BUFFER_CHANGE_INVALIDATION), and then executed during replay, unlike the existing invalidations (which are either decoded as part of commit record, or executed immediately during decoding and not added to reorderbuffer at all). LogStandbyInvalidations was accumulating all the invalidations in memory, and then only wrote them once at commit time, which may reduce the performance impact by amortizing the overhead and deduplicating the invalidations. The new invalidations are written to WAL immediately, without any such caching. Perhaps it would be possible to add similar caching, e.g. at the command level, or something like that?

This adds four methods the output plugin API, adding support for streaming changes for large transactions. * stream_message * stream_change * stream_abort * stream_commit * stream_start * stream_stop Most of this is a simple extension of the existing methods, with the semantic difference that the transaction (or subtransaction) is incomplete and may be aborted later (which is something the regular API does not really need to deal with). This also extends the 'test_decoding' plugin, implementing these new stream methods. The stream_start/start_stop are used to demarcate the a chunk of changes streamed for a particular toplevel transaction.

Instead of serializing the transaction to disk after reaching the maximum number of changes in memory (4096 changes), we consume the changes we have in memory and invoke new stream API methods. This happens in ReorderBufferStreamTXN() using about the same logic as in ReorderBufferCommit() logic. We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages. This adds a second iterator for the streaming case, without the spill-to-disk functionality and only processing changes currently in memory. Theoretically, we could get rid of the k-way merge, and append the changes to the toplevel xact directly (and remember the position in the list in case the subxact gets aborted later). It also adds ReorderBufferTXN pointer to two places: * ReorderBufferChange, so that we know which xact it belongs to * ReorderBufferTXN, pointing to toplevel xact (from subxact) The output plugin can use this to decide which changes to discard in case of stream_abort_cb (e.g. when a subxact gets discarded).

To add support for streaming of in-progress transactions into the built-in transaction, we need to do three things: * Extend the logical replication protocol, so identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). * Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol. * Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit. We however must explicitly disable streaming replication during replication slot creation, even if the plugin supports it. We don't need to replicate the changes accumulated during this phase, and moreover we don't have a replication connection open so we don't have where to send the data anyway.

ololobus force-pushed the master branch 2 times, most recently from d31e023 to e718d23 Compare November 13, 2018 13:39

ololobus force-pushed the logic-stream branch from 48454a2 to 06b299a Compare November 13, 2018 13:39

Travis-CI integration

5c4df12

ololobus force-pushed the master branch from 5fcc467 to 5c4df12 Compare November 20, 2018 08:56

Tomas Vondra and others added 10 commits November 20, 2018 11:56

Track statistics for streaming/spilling

a476464

BUGFIX: set final_lsn for subxacts before cleanup

7c22037

Typo in docs

1b4d9c2

Fixed ReorderBufferAssignChild and tap tests numbering

3d0c054

ololobus force-pushed the logic-stream branch from 06b299a to 3d0c054 Compare November 20, 2018 09:05

ololobus force-pushed the master branch 2 times, most recently from d2d0925 to aebce07 Compare December 20, 2018 15:13

ololobus force-pushed the master branch from aebce07 to 4395381 Compare December 26, 2018 14:25

ololobus force-pushed the master branch from 4395381 to e56697e Compare January 21, 2019 16:41

ololobus force-pushed the master branch 10 times, most recently from 7c01f90 to 16485ab Compare February 8, 2019 15:40

ololobus force-pushed the master branch 3 times, most recently from 8542a98 to 2603ea1 Compare March 26, 2020 16:21

ololobus force-pushed the master branch 4 times, most recently from 7c657da to 0e5f4f7 Compare April 6, 2020 11:24

ololobus force-pushed the master branch 2 times, most recently from 28db880 to 7f8e356 Compare April 8, 2020 11:33

ololobus force-pushed the master branch from 7f8e356 to aadf0de Compare May 25, 2020 11:00

ololobus force-pushed the master branch 2 times, most recently from b32538b to fb6f525 Compare June 25, 2020 16:39

ololobus force-pushed the master branch 2 times, most recently from 5d7849e to afd25cc Compare July 23, 2020 14:51

ololobus force-pushed the master branch 2 times, most recently from 2843bba to 9fcc5b8 Compare August 17, 2020 18:09

ololobus force-pushed the master branch 3 times, most recently from 7fb7114 to 9dd5d33 Compare September 2, 2020 20:15

ololobus force-pushed the master branch 2 times, most recently from 2435350 to 5a0dfc6 Compare September 23, 2020 14:34

ololobus force-pushed the master branch 2 times, most recently from 5a5411e to 63ab42a Compare November 10, 2020 16:10

ololobus force-pushed the master branch 2 times, most recently from 2943f01 to c87ec8b Compare November 17, 2020 11:12

ololobus force-pushed the master branch from c87ec8b to eab045d Compare January 21, 2021 19:47

ololobus force-pushed the master branch from eab045d to e53ae53 Compare February 1, 2021 10:13

ololobus force-pushed the master branch from e53ae53 to 075aa7e Compare March 22, 2021 14:50

ololobus force-pushed the master branch from 075aa7e to 3e59f3c Compare May 19, 2021 15:45

ololobus force-pushed the master branch from 3e59f3c to 4628678 Compare August 27, 2021 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logical replication, large transaction streaming #8

Logical replication, large transaction streaming #8

Logical replication, large transaction streaming #8

Are you sure you want to change the base?

Logical replication, large transaction streaming #8

Conversation