[ESM] Improve SQS batch collection and flushing #12002

gregfurman · 2024-12-09T10:29:22Z

Motivation

This PR allows for the ESM poller to collect and send batches of SQS messages based on the number of messages collected and the duration of the batching event (up to 300s).

The motivations were as follows:

The feature request bug: lambda event source mapping for non-FIFO SQS queue has invalid max batch size #5042 required LocalStack to support batch size over 10, which in turn would depend on MaximumBatchingWindowInSeconds being larger than 1. While offering CRUD support for this parameter, LocalStack did not actually collect a batch for the specified duration. Hence, these functional changes mean that we can now collect records until the window elapses.
Under high volumes of requests, we expect the latency of LocalStack's gateway to degrade due. By supporting window batching, we have also introduced the ability for ESM pollers to long-poll an SQS event source for batches of data -- which should help alleviate some pressure on the gateway.

Changes

Adds long-polling for SQS ReceiveMessage that will continuously poll an SQS queue until either a desired number of messages BatchSize has been collected or the minimum between MaximumBatchingWindowInSeconds has elapsed.
Once collected, this batch is potentially further split into chunks of up to 2.5K records each in lieu of proper flushing based on Lambda payload quotas (see docs).

Testing

Unskips the test_sqs_event_source_mapping_batching_reserved_concurrency since we can now collect and send batches of more than 10 records at a time.

Benchmarking

Some benchmarks were run to assess how the long-polling changes would the latency of the LocalStack gateway. See benchmarking results here.

tl;dr Long polling saw fewer requests being made and, in some cases, had improved latencies at higher percentiles P(>95).

Assumptions

A higher load on the gateway means fewer available threads for processing new requests. Therefore, we'd expect short-polling calls with many poll-misses to add unnecessary load to the gateway (due to each processing call requiring a worker) that equivalent long-polling calls should be able to circumvent.
We poll an event source at an interval of once per second. Each long-polling call will block a single gateway thread (there are 1K available) while waiting for a request to complete. Therefore, we expect that (in extremis) many ESM pollers all running long-polling calls will cause resource contention and degrade performance.
For most cases, however, we assume that long-polling calls should reduce the number of requests made to the gateway within a given duration, therefore improving the availability of workers to process other requests.

Results

Performance with and without long polling was fairly similar in terms of latency across all experiments, with differences of between 50-300ms between the two approaches.
Large batch sizes coupled with high batch windows saw the largest improvements in performance.
Long polling resulted in fewer requests (~7%) being processed within the 5m test interval - likely due to high resource contention while long-polling calls were being made. However, this did not hold true for maximising Batch Size (10k) and Batch Window (300s) with long-polling which saw the highest throughput and best latency improvements.

Conclusions

The effects of long-polling were sometimes seen at higher latency percentiles -- where long polling runs had less latency than short polling. However, no notable changes in performance were observed between long and short polling.
Performance was optimized in both cases by using larger batch sizes and longer batch windows, with the combination of long polling + large batches + long windows showing the best overall efficiency.
Importantly, while introducing long-polling did not dramatically effect performance (positively or negatively), this provides LocalStack with an avenue for optimising

github-actions · 2024-12-09T11:22:12Z

LocalStack Community integration with Pro

2 files ± 0 2 suites ±0 1h 30m 45s ⏱️ - 23m 17s
3 110 tests - 993 2 890 ✅ - 880 220 💤 - 113 0 ❌ ±0
3 112 runs - 993 2 890 ✅ - 880 222 💤 - 113 0 ❌ ±0

Results for commit 23d6cb6. ± Comparison against base commit 6f32581.

This pull request removes 994 and adds 1 tests. Note that renamed tests count towards both.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

tests.aws.services.lambda_.event_source_mapping.test_lambda_integration_sqs.TestSQSEventSourceMapping ‑ test_sqs_event_source_mapping_batching_window_size_override

♻️ This comment has been updated with latest results.

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

dfangl

This already looks quite good, but i still have a concern:

We do not have an upper limit for the sleep times. As per your table, a queue which gets 1 message/s, and a batching window of 300s set, and a batch size of 1, triggers only once every 150s (on average, with no latency), which causes a massive queue buildup. If someone sets a 300s batch window, and a burst of messages occurs, we should still process them timely. I think AWS would do so as well.

Also please add unit descriptions to the table, as it makes it clearer :)

gregfurman · 2025-01-09T16:18:16Z

@dfangl Thanks for the review here. I've removed the adaptive backoff since we're looking at providing an overwrite header in the boto3 requests to circumvent this 10 message limit.

Let me know if the (simplified) changes are alright!

joe4dev

The simplified message collection is much clearer 👏 Thank you for these refinements @gregfurman

I'm looking forward to supporting batching windows and batching by size 🙌
However, I'm concerned about flooding the LS gateway within the while loop (see ⚠️). How do you assess the scenario outlined in the comment?

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py

joe4dev

Just sharing some minor documentation suggestions to encode the lessons learned.

The code changes look good to me 👏👏👏 🚀

Two things:

Do we have a successful ext run?
I wanna look into the performance test results (haven't gotten to it yet ...)

joe4dev · 2025-02-12T09:39:23Z

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

@@ -97,28 +108,47 @@ def event_source(self) -> str:
    def poll_events(self) -> None:
        # SQS pipe source: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-pipes-sqs.html
        # "The 9 Ways an SQS Message can be Deleted": https://lucvandonkersgoed.com/2022/01/20/the-9-ways-an-sqs-message-can-be-deleted/
-        # TODO: implement batch window expires based on MaximumBatchingWindowInSeconds
        # TODO: implement invocation payload size quota
        # TODO: consider long-polling vs. short-polling trade-off. AWS uses long-polling:


docs: I would love to see this TODO resolved and replaced with a paragraph explaining the trade-offs and lessons learned. Our future selves thank us 😃

Suggestion:

# We adopted long-polling for the SQS poll operation `ReceiveMessage` for improved performance. # * PR (2025-02): https://github.com/localstack/localstack/pull/12002 # * ESM blog mentioning long-polling: https://aws.amazon.com/de/blogs/aws/aws-lambda-adds-amazon-simple-queue-service-to-supported-event-sources/ # * Amazon SQS short and long polling: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-short-and-long-polling.html # + Reduces latency because the `ReceiveMessage` call immediately returns once we reach the desired `BatchSize` or the `WaitTimeSeconds` elapses. # + Matches the AWS behavior also using long-polling # - Blocks a LocalStack gateway thread (default 1k) for every open connection, which could lead to resource contention if used at scale. # * LocalStack shutdown works because the LocalStack gateway shuts down and terminates the open connection. # * Our LS-internal optimizations using custom headers reduce the load on the LocalStack gateway by allowing for larger batch sizes and longer wait times than the AWS API.

joe4dev · 2025-02-12T09:41:18Z

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

-            LOG.debug("Polled %d events from %s", len(messages), self.source_arn)
+
+        messages = response.get("Messages", [])
+        LOG.debug("Polled %d events from %s", len(messages), self.source_arn)


It's probably good for debugging to be explicit here. I'm just wondering whether it's intentional to log empty polls as well?

I was thinking it could be useful for debugging to log explicitly whether nothing was polled from the event source. Perhaps we can distinguish this better with a "Polled no events from %s" -- wdyt?

My main thought is around avoiding log pollution (imagine 100 ESMs printing every second), but it's probably worth keeping for now. For example: it would help to identify whether jitter around the 1s interval is needed 💡 .

The format is fine, being consistent is good 👍

I think we should be very careful about this - many people have DEBUG=1 by default, and this can be a lot. I agree it is not urgent to remove - but especially the message for no events could be removed in the future.

joe4dev · 2025-02-12T09:43:28Z

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

+
+        messages = response.get("Messages", [])
+        LOG.debug("Polled %d events from %s", len(messages), self.source_arn)
+        # NOTE: Split up a batch into mini-batches of up to 2.5K records each. This is to prevent exceeding the 6MB size-limit


nit: we could move the # TODO: implement invocation payload size quota here, clarifying that's only a heuristic and not a perfect parity implementation

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

joe4dev

Thank you for the unlimited persistence and going the extra mile @gregfurman to gain more confidence and learn more about potential limitations of this impactful change 🙇

The code changes look good, we have iterated on them and all suggestions are taken into consideration 👏👏👏 .

The benchmark results indicate that this change does not negatively impact the LocalStack gateway latency, but we should keep an 👂 open to potential user feedback in large (>100 ESMs) and high-performance (>100 users) environments.

I love the out-of-the-box thinking around custom headers to go beyond AWS API restrictions to reduce the number of requests on LocalStack while still keeping remote API compatibility 🧠 💯

Strictly speaking, we don't have explicit aws-validated tests for the new MaximumBatchingWindowInSeconds feature, but I think that's fine here given how the cost/value trade-off. It would be hard to test against AWS because their internal behavior is somewhat unpredictable (depending on internal performance optimizations) and probably not guaranteed (i.e., max is an upper bound). It would likely require multiple iterations of time-consuming testing.

Good to 🚢 from my side 🚀

nit: Any idea where that weird line mismatch (995 vs 999) comes from in the failing Lambda test?
It's clearly unrelated to these changes.

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

dfangl

I think this is good to go, let's get it in!
I agree with Joel's final commend wholeheartedly, great work and patience with this PR!

dfangl · 2025-02-25T14:50:49Z

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py

-            LOG.debug("Polled %d events from %s", len(messages), self.source_arn)
+
+        messages = response.get("Messages", [])
+        LOG.debug("Polled %d events from %s", len(messages), self.source_arn)


I think we should be very careful about this - many people have DEBUG=1 by default, and this can be a lot. I agree it is not urgent to remove - but especially the message for no events could be removed in the future.

dfangl · 2025-02-25T14:52:39Z

tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py

@@ -1617,17 +1608,72 @@ def test_sqs_event_source_mapping_batch_size_override(
        cleanups.append(lambda: aws_client.lambda_.delete_event_source_mapping(UUID=mapping_uuid))
        _await_event_source_mapping_enabled(aws_client.lambda_, mapping_uuid)

+        expected_invocations = -(batch_size // -2500)  # converts floor division to ceil


Couldn't we just add math.ceil here? If I have to explain what a line does in a comment, it is not really well readable :)

Lol yeah. Lemme change this now...

joe4dev · 2025-02-25T16:07:11Z

@gregfurman Do we know why test_sqs_event_source_mapping_batching_reserved_concurrency fails? Is it flaky?
https://app.circleci.com/pipelines/github/localstack/localstack/31294/workflows/85e2e6dd-0f6e-4014-9612-2bbad5b4e266/jobs/278937/tests

gregfurman · 2025-02-25T16:31:20Z

@joe4dev Just ran some tests for this locally. There's a chance that the batch window of 10s exceeds before all 30 records have been processed and we end up flushing 3 times instead of twice. If we make the window larger this can be mitigated else if we load up the SQS queue before polling begins then the flake should disappear

gregfurman self-assigned this Dec 9, 2024

gregfurman added semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) labels Dec 9, 2024

gregfurman added this to the Playground milestone Dec 9, 2024

gregfurman marked this pull request as ready for review December 9, 2024 11:43

gregfurman requested review from joe4dev, dominikschubert and dfangl as code owners December 9, 2024 11:43

gregfurman commented Dec 9, 2024

View reviewed changes

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Outdated Show resolved Hide resolved

gregfurman force-pushed the fix/esm/batching branch from d7a668c to 3e04203 Compare December 10, 2024 08:46

joe4dev reviewed Dec 11, 2024

View reviewed changes

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Outdated Show resolved Hide resolved

dfangl reviewed Dec 11, 2024

View reviewed changes

gregfurman added 5 commits January 9, 2025 13:30

[ESM] Create SQS message collector for batched processing

0c13876

[ESM] Change batch collection method to use threading

795ad69

Remove old implementation

3dd840f

Address comments

8f162b6

Remove adaptive backoff and add size-based flushing

0237cab

gregfurman force-pushed the fix/esm/batching branch from aab6e03 to 0237cab Compare January 9, 2025 15:26

gregfurman requested review from dfangl and joe4dev January 9, 2025 16:21

gregfurman mentioned this pull request Jan 9, 2025

[ESM] Handle polling of batches exceeding SQS message limits #12118

Merged

1 task

joe4dev reviewed Jan 20, 2025

View reviewed changes

gregfurman and others added 2 commits January 21, 2025 11:57

[ESM] Handle polling of batches exceeding SQS message limits (#12118)

6dd8fc5

[ESM] Use SQS long polling, override parameter, and set boto timeout

e1ca826

gregfurman requested review from thrau and baermat as code owners January 21, 2025 16:00

gregfurman added 2 commits January 21, 2025 23:45

Skip failing test

d600448

Merge branch 'master' into fix/esm/batching

4d9e087

gregfurman added 10 commits January 30, 2025 18:30

Merge remote-tracking branch 'origin/master' into fix/esm/batching

fbefdd6

Merge branch 'master' into fix/esm/batching

e597a3c

Merge branch 'master' into fix/esm/batching

3216d7a

WIP: address comments

4e65ee0

Merge branch 'master' into fix/esm/batching

f5ed105

Fix logging of mini-batches

3684339

Merge remote-tracking branch 'origin/master' into fix/esm/batching

397d354

Add batching window override for SQS long polling

f747b4a

remove unnecessary enumerate

fa1a10a

Integrate with new SQS changes

275f942

gregfurman force-pushed the fix/esm/batching branch from b49705f to 275f942 Compare February 10, 2025 20:28

Allow outer loop to handle exception

fa3cc11

gregfurman requested a review from joe4dev February 12, 2025 08:30

joe4dev reviewed Feb 12, 2025

View reviewed changes

docs: Add documentation on performance optimisations

95782dc

gregfurman requested a review from joe4dev February 12, 2025 15:46

gregfurman commented Feb 24, 2025

View reviewed changes

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Outdated Show resolved Hide resolved

gregfurman commented Feb 24, 2025

View reviewed changes

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Outdated Show resolved Hide resolved

Some documentation clarifications

5edb709

joe4dev approved these changes Feb 24, 2025

View reviewed changes

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Outdated Show resolved Hide resolved

localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py Show resolved Hide resolved

joe4dev mentioned this pull request Feb 24, 2025

Fix failing lambda error reporting test #12300

Merged

dfangl approved these changes Feb 25, 2025

View reviewed changes

gregfurman added 2 commits February 25, 2025 17:16

Merge branch 'master' into fix/esm/batching

6ebbf2f

Address final comments and rebase

72ea973

fix: Load up SQS queue prior to polling in flaky test

23d6cb6

gregfurman merged commit e509f9d into master Feb 26, 2025
31 checks passed

gregfurman deleted the fix/esm/batching branch February 26, 2025 09:27

joe4dev mentioned this pull request Feb 28, 2025

[ESM] Fix constantly triggering SQS back-off #12319

Merged

Uh oh!

[ESM] Improve SQS batch collection and flushing #12002

[ESM] Improve SQS batch collection and flushing #12002

Uh oh!

Conversation

Uh oh!

Motivation

Changes

Testing

Benchmarking

Assumptions

Results

Conclusions

Uh oh!

Uh oh!

LocalStack Community integration with Pro

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!