[WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer #52539

jiateoh · 2025-10-08T00:02:32Z

What changes were proposed in this pull request?

Update the TransformWithState StateServer's proto message reading to fully read the desired message using the correct readFully API rather than read which may not fully consume all data. We should instead rely on readFully which will fill up the allocated buffer with required length as per the InputStream API.

Why are the changes needed?

For large state values used in the TransformWithState API, inputStream.read is not guaranteed to read messageLen's bytes of data as per the InputStream API. For large values, read will return prematurely and the messageBytes will only be partially filled, yielding an incorrect and likely unparseable proto message.

This is not a common scenario, as testing also indicated that the actual proto messages had to be fairly large (MB) to consistently trigger this error.

Does this PR introduce any user-facing change?

No

How was this patch tested?

python/run-tests --testnames 'pyspark.sql.tests.pandas.test_pandas_transform_with_state TransformWithStateInPandasTests'
python/run-tests --testnames 'pyspark.sql.tests.pandas.test_pandas_transform_with_state TransformWithStateInPySparkTests'

The configured data size (4MB) was chosen to ensure that the fix is required and that there's a bit of buffer for machine differences:
Below is sample/testing results from using read only and adding a check on message length vs read bytes.

TransformWithStateInPySparkTests
        pyspark.errors.exceptions.base.PySparkRuntimeError: Error updating value state: TESTING: Failed to read message bytes: expected 4194356 bytes, but only read 2617220 bytes
    TransformWithStateInPandasTests
        pyspark.errors.exceptions.base.PySparkRuntimeError: Error updating value state: TESTING: Failed to read message bytes: expected 4194356 bytes, but only read 1856872 bytes

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-5-20250929)

…ateServer Use readFully() instead of read() to ensure the entire protobuf message is read from the input stream. The read() method may only read partial results (experimentally 8KB), which can cause failures when processing large state values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add test case to verify that map state can handle large string values (10 KB) without issues. This test exercises the code path that requires the readFully() fix to prevent partial reads when processing large state values. The test: - Creates a 10 KB string - Updates map state with the large string value - Verifies the retrieved value matches the original in both length and content - Tests both Pandas and Row-based implementations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Enhance the test to comprehensively test all three state types (value, list, map) with large values instead of just map state: - Renamed MapStateLargeStringProcessor to LargeValueStatefulProcessor to reflect broader coverage - Test now verifies ValueState, ListState, and MapState all handle 10 KB string values correctly - Simplified assertions: processors return retrieved values instead of asserting internally, test compares actual vs expected values - ListState now returns full contents (comma-separated) instead of just count for better verification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…fix for verification

This reverts commit 2af2eb0.

This reverts commit 898561d.

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Oct 8, 2025

jiateoh and others added 4 commits October 7, 2025 17:04

Update mapstate output to reflect map contents too

35a7976

jiateoh force-pushed the tws_readFully_fix branch from aee4e8c to 0ad023e Compare October 9, 2025 20:35

jiateoh added 9 commits October 9, 2025 16:34

Increase large values payload size, include debugging checks without …

bec886e

…fix for verification

reapply fix

8c50b4e

Revert "reapply fix"

aee97d9

This reverts commit 2af2eb0.

Reorder class insertion

2f5d3a6

lint reformatting

b305534

Bump from 512KB -> 4MB

ea0723b

testing-only: use read and error when mismatched

86c5007

temp skip tests

f99c25c

Revert "temp skip tests"

fe1a90f

This reverts commit 898561d.

jiateoh force-pushed the tws_readFully_fix branch from 5044b4a to fe1a90f Compare October 10, 2025 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer #52539

[WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer #52539

Uh oh!

Uh oh!

[WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer #52539

Are you sure you want to change the base?

[WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer #52539

Conversation

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!