[MDEV-30671] innodb_undo_log_truncate=ON fails to wait for purge of transaction history - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.11, 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 11.0(EOL)
Fix Version/s: 10.11.3, 11.0.2, 10.5.20, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4
Component/s: Storage Engine - InnoDB
Labels:
- rr-profile-analyzed

Description

mleich produced an rr replay trace of an execution where InnoDB crashes soon after being started up on a freshly initialized database, without involving any crash recovery, soon after an undo tablespace was truncated:

10.5 5300c0fb7632e7e49a22997297bf731e691d3c24
Thread 5 hit Breakpoint 7, sql_print_information (format=format@entry=0x56404df92008 "InnoDB: %s") at /data/Server/bb-10.5-MDEV-30657/sql/log.cc:9283
9283 in /data/Server/bb-10.5-MDEV-30657/sql/log.cc
1: x/i $pc
=> 0x56404d8a3e20 <sql_print_information(char const*, ...)>: endbr64
(rr)
Continuing.
2023-02-15 19:11:55 0 [Note] InnoDB: Truncated .//undo002

Thread 14 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 3065874.3070558]
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:497
497 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
1: x/i $pc
=> 0x7f90c8b77955 <__memmove_avx_unaligned_erms+741>: vmovdqu (%rsi),%ymm0
(rr) bt
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:497
#1 0x000056404db25250 in memcpy (__len=<optimized out>, __src=0x7f90bb9123f7, __dest=0x7f8f8bf92078) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34
#2 mem_heap_dup (len=<optimized out>, data=0x7f90bb9123f7, heap=0x7f90a40643a0) at /data/Server/bb-10.5-MDEV-30657/storage/innobase/include/mem0mem.h:240
#3 trx_undo_rec_copy (heap=0x7f90a40643a0, undo_rec=0x7f90bb9123f7 "") at /data/Server/bb-10.5-MDEV-30657/storage/innobase/include/trx0rec.inl:70
#4 trx_undo_get_undo_rec_low (roll_ptr=<optimized out>, heap=0x7f90a40643a0) at /data/Server/bb-10.5-MDEV-30657/storage/innobase/trx/trx0rec.cc:1998
#5 0x000056404db27513 in trx_undo_get_undo_rec (undo_rec=<synthetic pointer>, name=..., trx_id=<optimized out>, heap=<optimized out>, roll_ptr=19984723349668855)
at /data/Server/bb-10.5-MDEV-30657/storage/innobase/trx/trx0rec.cc:2030
#6 trx_undo_prev_version_build (index_rec=index_rec@entry=0x7f90bbd98088 "\200", index_mtr=index_mtr@entry=0x7f909c074370, rec=rec@entry=0x7f90bbd98088 "\200", index=index@entry=0x7f90a805a198,
offsets=0x7f909c073d30, heap=heap@entry=0x7f90a40643a0, old_vers=0x7f909c073b98, v_heap=0x0, vrow=0x0, v_status=0) at /data/Server/bb-10.5-MDEV-30657/storage/innobase/trx/trx0rec.cc:2116
#7 0x000056404db0ce3c in row_vers_build_for_consistent_read (rec=0x7f90bbd98088 "\200", mtr=0x7f909c074370, index=0x7f90a805a198, offsets=0x7f909c073cf8, view=0x7f90c803e470, offset_heap=0x7f909c073cf0,
in_heap=0x7f90b407df70, old_vers=0x7f909c073f90, vrow=0x0) at /data/Server/bb-10.5-MDEV-30657/storage/innobase/row/row0vers.cc:1161
#8 0x000056404db016dc in row_search_mvcc (buf=0x7f9094059680 "\377\377\377", mode=PAGE_CUR_G, mode@entry=PAGE_CUR_UNSUPP, prebuilt=0x7f9094071b28, match_mode=<optimized out>, direction=1)
at /data/Server/bb-10.5-MDEV-30657/storage/innobase/row/row0sel.cc:5220

After extensive debugging, I concluded that the problem is that undo tablespace truncation only waits until there are no active transactions (which could be potentially rolled back) in any rollback segment that is stored in the undo tablespace that is being considered for truncation. There is a check like this in trx_purge_truncate_history():

    for (ulint i= 0; i < TRX_SYS_N_RSEGS; ++i)

      trx_rseg_t *rseg= trx_sys.rseg_array[i];

      if (!rseg || rseg->space != &space)

        continue;

      mutex_enter(&rseg->mutex);

      ut_ad(rseg->skip_allocation);

      ut_ad(rseg->is_persistent());

      if (rseg->trx_ref_count)

not_free:

        mutex_exit(&rseg->mutex);

        return;

At this point of time, rseg->needs_purge, which was added in ~~MDEV-13536~~, can still hold. The transaction reference count would be decremented in trx_t::commit_in_memory().

At the time of the crash, the DB_TRX_ID=0x2aea of the clustered index record whose history we are attempting to fetch, is explicitly listed in the purge_sys.view, meaning that its history must not be purged yet:

(rr) f 7

#7  0x000056404db0ce3c in row_vers_build_for_consistent_read (rec=0x7f90bbd98088 "\200", mtr=0x7f909c074370, index=0x7f90a805a198, offsets=0x7f909c073cf8, view=0x7f90c803e470, offset_heap=0x7f909c073cf0,

    in_heap=0x7f90b407df70, old_vers=0x7f909c073f90, vrow=0x0) at /data/Server/bb-10.5-MDEV-30657/storage/innobase/row/row0vers.cc:1161

1161	/data/Server/bb-10.5-MDEV-30657/storage/innobase/row/row0vers.cc: No such file or directory.

(rr) p/x *rec@17

$38 = {0x80, 0x0, 0x5, 0x37, 0x0, 0x0, 0x0, 0x0, 0x2a, 0xea, 0x47, 0x0, 0x0, 0x0, 0x31, 0x3, 0xf7}

(rr) p/x purge_sys.view

$39 = {m_low_limit_id = 0x2b26, m_up_limit_id = 0x2ad5, m_ids = std::vector of length 8, capacity 41 = {0x2ad5, 0x2aea, 0x2b0a, 0x2b0f, 0x2b1b, 0x2b22, 0x2b24, 0x2b25}, m_low_limit_no = 0x2b26}

I think that the following could fix the bug:

diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc

index 4d84f295c0b..39095013457 100644

--- a/storage/innobase/trx/trx0purge.cc

+++ b/storage/innobase/trx/trx0purge.cc

@@ -627,7 +627,7 @@ static void trx_purge_truncate_history()

       mutex_enter(&rseg->mutex);

       ut_ad(rseg->skip_allocation);

       ut_ad(rseg->is_persistent());

-      if (rseg->trx_ref_count)

+      if (rseg->needs_purge || rseg->trx_ref_count)

 not_free:

         mutex_exit(&rseg->mutex);

@@ -753,6 +753,8 @@ static void trx_purge_truncate_history()

       ut_ad(rseg->id == i);

       ut_ad(rseg->is_persistent());

+      ut_ad(!rseg->trx_ref_count);

+      ut_ad(!rseg->needs_purge);

       ut_d(const auto old_page= rseg->page_no);

       buf_block_t *rblock= trx_rseg_header_create(&space, i,

Attachments

Issue Links

blocks

MDEV-29593 Purge misses a chance to free not-yet-reused undo pages

Closed

causes

MDEV-30863 Server freeze, all threads in trx_assign_rseg_low

Closed

MDEV-31234 InnoDB does not free UNDO after the fix of MDEV-30671, thus shared tablespace (ibdata1) may grow indefinitely for no good reason

Closed

relates to

MDEV-30753 Possible corruption due to trx_purge_free_segment()

Closed

MDEV-31355 innodb_undo_log_truncate=ON fails to wait for purge of enough transaction history

Closed

MDEV-11802 innodb.innodb_bug14676111 fails in buildbot due to InnoDB purge failing to start when there is work to do

Closed

MDEV-15608 Crash during transaction rollback when using optimistic parallel replication, few threads, non-durable configuration.

Closed

MDEV-22718 InnoDB: purge_sys.low_limit_no() is not protected

Stalled

(3 relates to)

Activity

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2023-02-17 11:21

Updated:: 2024-10-21 11:57

Resolved:: 2023-02-27 12:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.