8000 fix: Handle max concurrent queries errors by erezrokah · Pull Request #20907 · cloudquery/cloudquery · GitHub
[go: up one dir, main page]

Skip to content

fix: Handle max concurrent queries errors #20907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 2, 2025

Conversation

erezrokah
Copy link
Member
@erezrokah erezrokah commented Jun 24, 2025

Summary

This PR fixes an issue that's more evident on older ClickHouse versions (e.g. 22 that we don't support anymore) where max_concurrent_queries defaulted to 100. On newer versions the default is 1500.

The result of hitting the limit is that some tables can be dropped unexpectedly, since checkPartitionOrOrderByChanged called from checkForced here

if err := c.checkForced(ctx, have, want, messages); err != nil {
will not return an error while checkPartitionOrOrderByChanged can return an error when called further down the line here
if unsafe := unsafeChanges(changes); len(unsafe) > 0 || c.checkPartitionOrOrderByChanged(ctx, want, c.spec.Partition, c.spec.OrderBy) != nil {

Had to override the max_concurrent_queries configuration so using the docker compose file for that.

The solution implemented in this PR has 2 parts:

  1. Save the state from when we check which tables need to be forced migration so we don't re-run the same queries again
  2. Retry ClickHouse operations (tried to create a connection wrapper for this purpose, but that's a bit tricky since the error can be on connection.Exec row.Scan and so on, so I'd basically need to wrap all ClickHouse's driver interfaces.

@cq-bot
Copy link
Contributor
cq-bot commented Jun 24, 2025

@erezrokah erezrokah force-pushed the fix/handle_force_migration_false_positives branch 3 times, most recently from 6778728 to 56298c6 Compare June 30, 2025 11:18
@erezrokah erezrokah marked this pull request as draft July 1, 2025 15:12
@erezrokah erezrokah removed the request for review from maaarcelino July 1, 2025 15:13
@erezrokah erezrokah force-pushed the fix/handle_force_migration_false_positives branch from 56298c6 to 10badf5 Compare July 1, 2025 17:55
@erezrokah erezrokah changed the title test: Add test that simulates multiple concurrent syncs fix: Handle max concurrent queries errors Jul 1, 2025
return nil
}

have := have.Get(want.Name)
if have == nil {
tableName := want.Name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part about re-using the tables changes state

@@ -103,23 +142,21 @@ func unsafeChanges(changes []schema.TableColumnChange) []schema.TableColumnChang
}

func (c *Client) createTable(ctx context.Context, table *schema.Table, partition []spec.PartitionStrategy, orderBy []spec.OrderByStrategy) (err error) {
c.logger.Debug().Str("table", table.Name).Msg("Table doesn't exist, creating")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was confusing as we called dropTable then createTable when doing force migration, so even if the table existed this log was printed

@erezrokah erezrokah marked this pull request as ready for review July 1, 2025 18:10
Copy link
Contributor
@marianogappa marianogappa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on finding and fixing this bug 👍

My only concern here is the retry logic.

Obviously, without more context there's no smartness possible to the retry, like we do on InsertSplitter, so we agree that it's not gonna get a lot smarter than "try running the same query again later".

However, retrying up to 10 times, 1s +/- .5s later in the case of "Too many simultaneous queries" sounds 💣 🤔 no? I think this will by default increase congestion. My proposed remediation isn't very smart...I'd 10x the delay or something like that.

@erezrokah
Copy link
Member Author
erezrokah commented Jul 1, 2025

However, retrying up to 10 times, 1s +/- .5s later in the case of "Too many simultaneous queries" sounds 💣 🤔 no? I think this will by default increase congestion. My proposed remediation isn't very smart...I'd 10x the delay or something like that.

The default strategy is backoff + random jitter, but I don't mind increasing the retry delay. Based on my tests with the current values and 2000 concurrent syncs (from the test) we hardly every retry more than twice

Comment on lines +29 to +31
retry.Attempts(5),
retry.Delay(3 * time.Second),
retry.MaxJitter(1 * time.Second),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to expose these variables in the spec? Not blocking, just thinking out loud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure actually I think we can expose based on user feedback. Could be too much control. The default retry logic is backoff + random jitter so I think that should cover quite a lot of cases (especially on newer versions)

Copy link
Contributor
@maaarcelino maaarcelino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor non-blocking suggestion

@erezrokah erezrokah added the automerge Automatically merge once required checks pass label Jul 2, 2025
@kodiakhq kodiakhq bot merged commit 92c2827 into main Jul 2, 2025
17 checks passed
@kodiakhq kodiakhq bot deleted the fix/handle_force_migration_false_positives branch July 2, 2025 17:02
kodiakhq bot pushed a commit that referenced this pull request Jul 2, 2025
🤖 I have created a release *beep* *boop*
---


## [7.1.1](plugins-destination-clickhouse-v7.1.0...plugins-destination-clickhouse-v7.1.1) (2025-07-02)


### Bug Fixes

* **deps:** Update golang.org/x/exp digest to b7579e2 ([#20935](#20935)) ([aac340d](aac340d))
* **deps:** Update module github.com/cloudquery/codegen to v0.3.29 ([#20947](#20947)) ([af179be](af179be))
* Handle max concurrent queries errors ([#20907](#20907)) ([92c2827](92c2827))

---
This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci area/plugin/destination/clickhouse automerge Automatically merge once required checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0