8000 [WIP][SPARK-XXXXX][PS][SS] Fix partial read bug for large state values in TransformWithStateInPySparkStateServer by jiateoh · Pull Request #52539 · apache/spark · GitHub
[go: up one dir, main page]

Skip to content

Conversation

jiateoh
Copy link
@jiateoh jiateoh commented Oct 8, 2025

What changes were proposed in this pull request?

Update the TransformWithState StateServer's proto message reading to fully read the desired message using the correct readFully API rather than read which may not fully consume all data. We should instead rely on readFully which will fill up the allocated buffer with required length as per the InputStream API.

Why are the changes needed?

For large state values used in the TransformWithState API, inputStream.read is not guaranteed to read messageLen's bytes of data as per the InputStream API. For large values, read will return prematurely and the messageBytes will only be partially filled, yielding an incorrect and likely unparseable proto message.

This is not a common scenario, as testing also indicated that the actual proto messages had to be fairly large (MB) to consistently trigger this error.

Does this PR introduce any user-facing change?

No

How was this patch tested?

python/run-tests --testnames 'pyspark.sql.tests.pandas.test_pandas_transform_with_state TransformWithStateInPandasTests'
python/run-tests --testnames 'pyspark.sql.tests.pandas.test_pandas_transform_with_state TransformWithStateInPySparkTests'

The configured data size (4MB) was chosen to ensure that the fix is required and that there's a bit of buffer for machine differences:
Below is sample/testing results from using read only and adding a check on message length vs read bytes.

TransformWithStateInPySparkTests
        pyspark.errors.exceptions.base.PySparkRuntimeError: Error updating value state: TESTING: Failed to read message bytes: expected 4194356 bytes, but only read 2617220 bytes
    TransformWithStateInPandasTests
        pyspark.errors.exceptions.base.PySparkRuntimeError: Error updating value state: TESTING: Failed to read message bytes: expected 4194356 bytes, but only read 1856872 bytes

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-sonnet-4-5-20250929)

jiateoh and others added 4 commits October 7, 2025 17:04
…ateServer

Use readFully() instead of read() to ensure the entire protobuf message
is read from the input stream. The read() method may only read partial
results (experimentally 8KB), which can cause failures when processing
large state values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add test case to verify that map state can handle large string values
(10 KB) without issues. This test exercises the code path that requires
the readFully() fix to prevent partial reads when processing large
state values.

The test:
- Creates a 10 KB string
- Updates map state with the large string value
- Verifies the retrieved value matches the original in both length and content
- Tests both Pandas and Row-based implementations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhance the test to comprehensively test all three state types (value,
list, map) with large values instead of just map state:

- Renamed MapStateLargeStringProcessor to LargeValueStatefulProcessor
  to reflect broader coverage
- Test now verifies ValueState, ListState, and MapState all handle
  10 KB string values correctly
- Simplified assertions: processors return retrieved values instead of
  asserting internally, test compares actual vs expected values
- ListState now returns full contents (comma-separated) instead of
  just count for better verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jiateoh jiateoh force-pushed the tws_readFully_fix branch from aee4e8c to 0ad023e Compare October 9, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0