[XLA:CPU] Optimize XTile bufferization.#112153
Draft
copybara-service[bot] wants to merge 1 commit intomasterfrom
Draft
[XLA:CPU] Optimize XTile bufferization.#112153copybara-service[bot] wants to merge 1 commit intomasterfrom
copybara-service[bot] wants to merge 1 commit intomasterfrom
Conversation
- Implement static full-tile detection to avoid runtime bounds checks. - Refactor InsertTileOp bufferization to enable compute buffer elision by the one-shot bufferizer. - Allow unit-strided subviews with dynamic offsets to prevent forced allocations. - Up to ~5x speedup on elementwise benchmarks: ``` name time/op time/op vs base BM_AddF32/128/process_time 21.22µ ± 17% 14.57µ ± 25% -31.35% (p=0.000 n=40) BM_AddF32/256/process_time 36.68µ ± 11% 27.67µ ± 19% -24.57% (p=0.000 n=40) BM_AddF32/512/process_time 43.26µ ± 17% 44.67µ ± 42% ~ (p=0.146 n=40) BM_AddF32/1024/process_time 70.37µ ± 31% 57.72µ ± 27% -17.97% (p=0.003 n=40) BM_AddF32/8192/process_time 559.1µ ± 6% 524.3µ ± 5% -6.24% (p=0.021 n=40) BM_AddF32/16384/process_time 1.385m ± 7% 1.340m ± 5% ~ (p=0.084 n=40) BM_AddF32/32768/process_time 2.930m ± 8% 2.834m ± 5% -3.30% (p=0.006 n=40) BM_AddBF16/128/process_time 27.25µ ± 4% 18.83µ ± 1% -30.89% (p=0.000 n=40) BM_AddBF16/256/process_time 39.97µ ± 8% 30.62µ ± 6% -23.40% (p=0.000 n=40) BM_AddBF16/512/process_time 49.63µ ± 9% 37.54µ ± 8% -24.36% (p=0.000 n=40) BM_AddBF16/1024/process_time 65.73µ ± 17% 50.87µ ± 10% -22.61% (p=0.000 n=40) BM_AddBF16/8192/process_time 352.2µ ± 3% 259.8µ ± 5% -26.23% (p=0.000 n=40) BM_AddBF16/16384/process_time 730.0µ ± 5% 587.6µ ± 4% -19.51% (p=0.000 n=40) BM_AddBF16/32768/process_time 1.565m ± 3% 1.430m ± 4% -8.58% (p=0.000 n=40) BM_ConvertF32ToBF16/128/process_time 22.46µ ± 1% 11.83µ ± 3% -47.35% (p=0.000 n=40) BM_ConvertF32ToBF16/256/process_time 33.77µ ± 27% 21.84µ ± 24% -35.31% (p=0.000 n=40) BM_ConvertF32ToBF16/512/process_time 42.03µ ± 41% 28.66µ ± 10% -31.81% (p=0.000 n=40) BM_ConvertF32ToBF16/1024/process_time 62.92µ ± 68% 41.27µ ± 58% -34.41% (p=0.000 n=40) BM_ConvertF32ToBF16/8192/process_time 294.6µ ± 5% 221.5µ ± 8% -24.81% (p=0.000 n=40) BM_ConvertF32ToBF16/16384/process_time 645.2µ ± 5% 572.3µ ± 8% -11.30% (p=0.000 n=40) BM_ConvertF32ToBF16/32768/process_time 1.457m ± 5% 1.381m ± 4% -5.22% (p=0.002 n=40) geomean 148.9µ 116.9µ -21.48% name INSTRUCTIONS/op INSTRUCTIONS/op vs base BM_AddF32/128/process_time 238.5k ± 5% 119.5k ± 10% -49.89% (p=0.000 n=40) BM_AddF32/256/process_time 535.0k ± 1% 236.4k ± 7% -55.82% (p=0.000 n=40) BM_AddF32/512/process_time 847.5k ± 3% 367.5k ± 3% -56.64% (p=0.000 n=40) BM_AddF32/1024/process_time 1507.4k ± 2% 507.8k ± 4% -66.31% (p=0.000 n=40) BM_AddF32/8192/process_time 10.249M ± 1% 2.278M ± 2% -77.77% (p=0.000 n=40) BM_AddF32/16384/process_time 20.021M ± 0% 4.162M ± 1% -79.21% (p=0.000 n=40) BM_AddF32/32768/process_time 39.466M ± 0% 7.762M ± 0% -80.33% (p=0.000 n=40) BM_AddBF16/128/process_time 193.2k ± 0% 131.7k ± 0% -31.86% (p=0.000 n=40) BM_AddBF16/256/process_time 464.7k ± 3% 334.9k ± 2% -27.94% (p=0.000 n=40) BM_AddBF16/512/process_time 975.0k ± 4% 720.4k ± 1% -26.11% (p=0.000 n=40) BM_AddBF16/1024/process_time 1.699M ± 3% 1.213M ± 2% -28.62% (p=0.000 n=40) BM_AddBF16/8192/process_time 12.083M ± 0% 8.109M ± 0% -32.89% (p=0.000 n=40) BM_AddBF16/16384/process_time 23.80M ± 0% 16.05M ± 0% -32.57% (p=0.000 n=40) BM_AddBF16/32768/process_time 47.20M ± 0% 31.58M ± 0% -33.09% (p=0.000 n=40) BM_ConvertF32ToBF16/128/process_time 154.67k ± 0% 96.94k ± 0% -37.32% (p=0.000 n=40) BM_ConvertF32ToBF16/256/process_time 382.0k ± 5% 264.4k ± 3% -30.79% (p=0.000 n=40) BM_ConvertF32ToBF16/512/process_time 811.9k ± 4% 554.7k ± 1% -31.68% (p=0.000 n=40) BM_ConvertF32ToBF16/1024/process_time 1428.8k ± 2% 939.3k ± 4% -34.26% (p=0.000 n=40) BM_ConvertF32ToBF16/8192/process_time 9.736M ± 1% 6.024M ± 0% -38.13% (p=0.000 n=40) BM_ConvertF32ToBF16/16384/process_time 19.17M ± 0% 11.89M ± 1% -37.97% (p=0.000 n=40) BM_ConvertF32ToBF16/32768/process_time 37.81M ± 0% 23.15M ± 0% -38.77% (p=0.000 n=40) geomean 2.715M 1.410M -48.07% ``` ``` ``` PiperOrigin-RevId: 881452648
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[XLA:CPU] Optimize XTile bufferization.