8000 more tests for replication consistency by jsteemann · Pull Request #13507 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

more tests for replication consistency #13507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Feb 10, 2021

Conversation

jsteemann
Copy link
Contributor
@jsteemann jsteemann commented Feb 8, 2021

Scope & Purpose

Fixes replication bug which leads to data corruption. In the Merkle sync protocol there is the following problem: When there are large amounts of differences between source and destination, then the revisions in the source are shipped in batches to the destination. In that case, it is possible that a smaller (=earlier) revision is inserted in the destination before a larger (=later) revision with the same primary key is removed. As such this would not be a problem and is ultimately inevitable. However, some performance optimizations lead to the fact that this condition is not detected and so the data structures are corrupted. Specifically, the operations flag ignoreUniqueConstraint so far prevented any unique conflict detection in the primary index or any other unique constraint index. Furthermore, even if the conflict would be detected, another optimization (DisableIndexing) prevented the transaction from seeing its own writes, so the delete of the wrong revision was not detected and would lead to another unique constraint violation.

Finally, the case of multiple unique constraints being violated in this case was not considered. In the process of hunting down these problems some more minor problems (like a buffer overrun in ridBuffers), a bug in RocksDB which leads to an assertion failure and a bug in document count accounting which can lead to an assertion failure were found and are fixed in this PR.

Adds more tests for incremental sync.
Additional tests can be started as follows:

scripts/unittest replication_sync --test malarkey
  • 💩 Bugfix (requires CHANGELOG entry)
  • 🍕 New feature (requires CHANGELOG entry, feature documentation and release notes)
  • 🔥 Performance improvement
  • 🔨 Refactoring/simplification
  • 📖 CHANGELOG entry made

Backports:

Testing & Verification

(Please pick either of the following options)

  • The behavior in this PR was manually tested
  • This PR adds tests that were used to verify all changes:
    • Added new integration tests (e.g. in replication_sync)

Link to Jenkins PR run:
http://172.16.10.101:8080/view/PR/job/arangodb-matrix-pr/13892/

@jsteemann jsteemann added this to the devel milestone Feb 8, 2021
@jsteemann jsteemann requested a review from neunhoef February 8, 2021 01:29
@jsteemann jsteemann marked this pull request as draft February 8, 2021 01:37
@jsteemann jsteemann added the 9 WIP label Feb 8, 2021
Copy link
Member
@neunhoef neunhoef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: LGTM, see three minor comments.

Result res;

// non-unique indexes will not cause any constraint violation
if (_unique) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do:

if (!_unique) {
  return res;
}

to get fewer indentations.

@neunhoef
Copy link
Member
neunhoef commented Feb 9, 2021

One thing is still on my list: There is a possible assertion failure in maintainer mode if a thread tries to add the first document of a collection at the same time and another thread tries to remove it immediately. In this case the remove can "overtake" the insert and move the document count below zero for a brief moment, enough to trigger the assertion. Do we want to address this in this PR, too?

@jsteemann
Copy link
Contributor Author

@jsteemann jsteemann marked this pull request as ready for review February 9, 2021 22:10
@jsteemann
Copy link
Contributor Author

Copy link
Member
@neunhoef neunhoef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jsteemann added a commit that referenced this pull request Feb 10, 2021
This is a preparation PR for backporting some further replication tests
and bugfixes from devel to 3.7.
It isolates some of the changes done in
#13507 into a small PR to
facilitate reviewing. A further PR with the added replication tests and
a bugfix for 3.7 will follow up.

This PR extends the length of some internal byte buffers from 11 to 21,
so that they cannot overflow, regardless of what input is coming in. The
overflow was theoretically possible before, but only with manipulated
timestamp input.

The PR also fixes an issue in a RocksDB heap helper class that does a
potentially unsafe self-move assignment. This issue was raised when
running tests with a debug STL build (`_GLIBCXX_DEBUG`).
@jsteemann jsteemann removed the 9 WIP label Feb 10, 2021
@jsteemann
Copy link
Contributor Author

Tests blue

@jsteemann jsteemann merged commit b699e89 into devel Feb 10, 2021
@jsteemann jsteemann deleted the experimental/replication-key-tests branch February 10, 2021 01:59
KVS85 added a commit that referenced this pull request Feb 11, 2021
* backport some changes from #13507

This is a preparation PR for backporting some further replication tests
and bugfixes from devel to 3.7.
It isolates some of the changes done in
#13507 into a small PR to
facilitate reviewing. A further PR with the added replication tests and
a bugfix for 3.7 will follow up.

This PR extends the length of some internal byte buffers from 11 to 21,
so that they cannot overflow, regardless of what input is coming in. The
overflow was theoretically possible before, but only with manipulated
timestamp input.

The PR also fixes an issue in a RocksDB heap helper class that does a
potentially unsafe self-move assignment. This issue was raised when
running tests with a debug STL build (`_GLIBCXX_DEBUG`).

* Update CHANGELOG

Co-authored-by: Vadim <vadim@arangodb.com>
elfringham pushed a commit to elfringham/arangodb that referenced this pull request Apr 20, 2021
Co-authored-by: Max Neunhöffer <max@arangodb.com>
Co-authored-by: Max Neunhoeffer <max@arangodb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0