8000 [XLA:CPU] Optimize XTile bufferization. by copybara-service[bot] · Pull Request #112153 · tensorflow/tensorflow · GitHub
[go: up one dir, main page]

Skip to content

[XLA:CPU] Optimize XTile bufferization.#112153

Draft
copybara-service[bot] wants to merge 1 commit intomasterfrom
exported_pr_881452648
Draft

[XLA:CPU] Optimize XTile bufferization.#112153
copybara-service[bot] wants to merge 1 commit intomasterfrom
exported_pr_881452648

Conversation

@copybara-service
Copy link

[XLA:CPU] Optimize XTile bufferization.

  • Implement static full-tile detection to avoid runtime bounds checks.
  • Refactor InsertTileOp bufferization to enable compute buffer elision by the one-shot bufferizer.
  • Allow unit-strided subviews with dynamic offsets to prevent forced allocations.
  • Up to ~5x speedup on elementwise benchmarks:
name                                     time/op        time/op     vs base                
BM_AddF32/128/process_time                21.22µ ± 17%   14.57µ ± 25%  -31.35% (p=0.000 n=40)
BM_AddF32/256/process_time                36.68µ ± 11%   27.67µ ± 19%  -24.57% (p=0.000 n=40)
BM_AddF32/512/process_time                43.26µ ± 17%   44.67µ ± 42%        ~ (p=0.146 n=40)
BM_AddF32/1024/process_time               70.37µ ± 31%   57.72µ ± 27%  -17.97% (p=0.003 n=40)
BM_AddF32/8192/process_time               559.1µ ±  6%   524.3µ ±  5%   -6.24% (p=0.021 n=40)
BM_AddF32/16384/process_time              1.385m ±  7%   1.340m ±  5%        ~ (p=0.084 n=40)
BM_AddF32/32768/process_time              2.930m ±  8%   2.834m ±  5%   -3.30% (p=0.006 n=40)
BM_AddBF16/128/process_time               27.25µ ±  4%   18.83µ ±  1%  -30.89% (p=0.000 n=40)
BM_AddBF16/256/process_time               39.97µ ±  8%   30.62µ ±  6%  -23.40% (p=0.000 n=40)
BM_AddBF16/512/process_time               49.63µ ±  9%   37.54µ ±  8%  -24.36% (p=0.000 n=40)
BM_AddBF16/1024/process_time              65.73µ ± 17%   50.87µ ± 10%  -22.61% (p=0.000 n=40)
BM_AddBF16/8192/process_time              352.2µ ±  3%   259.8µ ±  5%  -26.23% (p=0.000 n=40)
BM_AddBF16/16384/process_time             730.0µ ±  5%   587.6µ ±  4%  -19.51% (p=0.000 n=40)
BM_AddBF16/32768/process_time             1.565m ±  3%   1.430m ±  4%   -8.58% (p=0.000 n=40)
BM_ConvertF32ToBF16/128/process_time      22.46µ ±  1%   11.83µ ±  3%  -47.35% (p=0.000 n=40)
BM_ConvertF32ToBF16/256/process_time      33.77µ ± 27%   21.84µ ± 24%  -35.31% (p=0.000 n=40)
BM_ConvertF32ToBF16/512/process_time      42.03µ ± 41%   28.66µ ± 10%  -31.81% (p=0.000 n=40)
BM_ConvertF32ToBF16/1024/process_time     62.92µ ± 68%   41.27µ ± 58%  -34.41% (p=0.000 n=40)
BM_ConvertF32ToBF16/8192/process_time     294.6µ ±  5%   221.5µ ±  8%  -24.81% (p=0.000 n=40)
BM_ConvertF32ToBF16/16384/process_time    645.2µ ±  5%   572.3µ ±  8%  -11.30% (p=0.000 n=40)
BM_ConvertF32ToBF16/32768/process_time    1.457m ±  5%   1.381m ±  4%   -5.22% (p=0.002 n=40)
geomean                                  148.9µ         116.9µ        -21.48%

name                                     INSTRUCTIONS/op  INSTRUCTIONS/op  vs base                
BM_AddF32/128/process_time                238.5k ± 5%      119.5k ± 10%  -49.89% (p=0.000 n=40)
BM_AddF32/256/process_time                535.0k ± 1%      236.4k ±  7%  -55.82% (p=0.000 n=40)
BM_AddF32/512/process_time                847.5k ± 3%      367.5k ±  3%  -56.64% (p=0.000 n=40)
BM_AddF32/1024/process_time              1507.4k ± 2%      507.8k ±  4%  -66.31% (p=0.000 n=40)
BM_AddF32/8192/process_time              10.249M ± 1%      2.278M ±  2%  -77.77% (p=0.000 n=40)
BM_AddF32/16384/process_time             20.021M ± 0%      4.162M ±  1%  -79.21% (p=0.000 n=40)
BM_AddF32/32768/process_time             39.466M ± 0%      7.762M ±  0%  -80.33% (p=0.000 n=40)
BM_AddBF16/128/process_time               193.2k ± 0%      131.7k ±  0%  -31.86% (p=0.000 n=40)
BM_AddBF16/256/process_time               464.7k ± 3%      334.9k ±  2%  -27.94% (p=0.000 n=40)
BM_AddBF16/512/process_time               975.0k ± 4%      720.4k ±  1%  -26.11% (p=0.000 n=40)
BM_AddBF16/1024/process_time              1.699M ± 3%      1.213M ±  2%  -28.62% (p=0.000 n=40)
BM_AddBF16/8192/process_time             12.083M ± 0%      8.109M ±  0%  -32.89% (p=0.000 n=40)
BM_AddBF16/16384/process_time             23.80M ± 0%      16.05M ±  0%  -32.57% (p=0.000 n=40)
BM_AddBF16/32768/process_time             47.20M ± 0%      31.58M ±  0%  -33.09% (p=0.000 n=40)
BM_ConvertF32ToBF16/128/process_time     154.67k ± 0%      96.94k ±  0%  -37.32% (p=0.000 n=40)
BM_ConvertF32ToBF16/256/process_time      382.0k ± 5%      264.4k ±  3%  -30.79% (p=0.000 n=40)
BM_ConvertF32ToBF16/512/process_time      811.9k ± 4%      554.7k ±  1%  -31.68% (p=0.000 n=40)
BM_ConvertF32ToBF16/1024/process_time    1428.8k ± 2%      939.3k ±  4%  -34.26% (p=0.000 n=40)
BM_ConvertF32ToBF16/8192/process_time     9.736M ± 1%      6.024M ±  0%  -38.13% (p=0.000 n=40)
BM_ConvertF32ToBF16/16384/process_time    19.17M ± 0%      11.89M ±  1%  -37.97% (p=0.000 n=40)
BM_ConvertF32ToBF16/32768/process_time    37.81M ± 0%      23.15M ±  0%  -38.77% (p=0.000 n=40)
geomean                                  2.715M           1.410M        -48.07%

- Implement static full-tile detection to avoid runtime bounds checks.
- Refactor InsertTileOp bufferization to enable compute buffer elision by the one-shot bufferizer.
- Allow unit-strided subviews with dynamic offsets to prevent forced allocations.
- Up to ~5x speedup on elementwise benchmarks:

```
name                                     time/op        time/op     vs base
BM_AddF32/128/process_time                21.22µ ± 17%   14.57µ ± 25%  -31.35% (p=0.000 n=40)
BM_AddF32/256/process_time                36.68µ ± 11%   27.67µ ± 19%  -24.57% (p=0.000 n=40)
BM_AddF32/512/process_time                43.26µ ± 17%   44.67µ ± 42%        ~ (p=0.146 n=40)
BM_AddF32/1024/process_time               70.37µ ± 31%   57.72µ ± 27%  -17.97% (p=0.003 n=40)
BM_AddF32/8192/process_time               559.1µ ±  6%   524.3µ ±  5%   -6.24% (p=0.021 n=40)
BM_AddF32/16384/process_time              1.385m ±  7%   1.340m ±  5%        ~ (p=0.084 n=40)
BM_AddF32/32768/process_time              2.930m ±  8%   2.834m ±  5%   -3.30% (p=0.006 n=40)
BM_AddBF16/128/process_time               27.25µ ±  4%   18.83µ ±  1%  -30.89% (p=0.000 n=40)
BM_AddBF16/256/process_time               39.97µ ±  8%   30.62µ ±  6%  -23.40% (p=0.000 n=40)
BM_AddBF16/512/process_time               49.63µ ±  9%   37.54µ ±  8%  -24.36% (p=0.000 n=40)
BM_AddBF16/1024/process_time              65.73µ ± 17%   50.87µ ± 10%  -22.61% (p=0.000 n=40)
BM_AddBF16/8192/process_time              352.2µ ±  3%   259.8µ ±  5%  -26.23% (p=0.000 n=40)
BM_AddBF16/16384/process_time             730.0µ ±  5%   587.6µ ±  4%  -19.51% (p=0.000 n=40)
BM_AddBF16/32768/process_time             1.565m ±  3%   1.430m ±  4%   -8.58% (p=0.000 n=40)
BM_ConvertF32ToBF16/128/process_time      22.46µ ±  1%   11.83µ ±  3%  -47.35% (p=0.000 n=40)
BM_ConvertF32ToBF16/256/process_time      33.77µ ± 27%   21.84µ ± 24%  -35.31% (p=0.000 n=40)
BM_ConvertF32ToBF16/512/process_time      42.03µ ± 41%   28.66µ ± 10%  -31.81% (p=0.000 n=40)
BM_ConvertF32ToBF16/1024/process_time     62.92µ ± 68%   41.27µ ± 58%  -34.41% (p=0.000 n=40)
BM_ConvertF32ToBF16/8192/process_time     294.6µ ±  5%   221.5µ ±  8%  -24.81% (p=0.000 n=40)
BM_ConvertF32ToBF16/16384/process_time    645.2µ ±  5%   572.3µ ±  8%  -11.30% (p=0.000 n=40)
BM_ConvertF32ToBF16/32768/process_time    1.457m ±  5%   1.381m ±  4%   -5.22% (p=0.002 n=40)
geomean                                  148.9µ         116.9µ        -21.48%

name                                     INSTRUCTIONS/op  INSTRUCTIONS/op  vs base
BM_AddF32/128/process_time                238.5k ± 5%      119.5k ± 10%  -49.89% (p=0.000 n=40)
BM_AddF32/256/process_time                535.0k ± 1%      236.4k ±  7%  -55.82% (p=0.000 n=40)
BM_AddF32/512/process_time                847.5k ± 3%      367.5k ±  3%  -56.64% (p=0.000 n=40)
BM_AddF32/1024/process_time              1507.4k ± 2%      507.8k ±  4%  -66.31% (p=0.000 n=40)
BM_AddF32/8192/process_time              10.249M ± 1%      2.278M ±  2%  -77.77% (p=0.000 n=40)
BM_AddF32/16384/process_time             20.021M ± 0%      4.162M ±  1%  -79.21% (p=0.000 n=40)
BM_AddF32/32768/process_time             39.466M ± 0%      7.762M ±  0%  -80.33% (p=0.000 n=40)
BM_AddBF16/128/process_time               193.2k ± 0%      131.7k ±  0%  -31.86% (p=0.000 n=40)
BM_AddBF16/256/process_time               464.7k ± 3%      334.9k ±  2%  -27.94% (p=0.000 n=40)
BM_AddBF16/512/process_time               975.0k ± 4%      720.4k ±  1%  -26.11% (p=0.000 n=40)
BM_AddBF16/1024/process_time              1.699M ± 3%      1.213M ±  2%  -28.62% (p=0.000 n=40)
BM_AddBF16/8192/process_time             12.083M ± 0%      8.109M ±  0%  -32.89% (p=0.000 n=40)
BM_AddBF16/16384/process_time             23.80M ± 0%      16.05M ±  0%  -32.57% (p=0.000 n=40)
BM_AddBF16/32768/process_time             47.20M ± 0%      31.58M ±  0%  -33.09% (p=0.000 n=40)
BM_ConvertF32ToBF16/128/process_time     154.67k ± 0%      96.94k ±  0%  -37.32% (p=0.000 n=40)
BM_ConvertF32ToBF16/256/process_time      382.0k ± 5%      264.4k ±  3%  -30.79% (p=0.000 n=40)
BM_ConvertF32ToBF16/512/process_time      811.9k ± 4%      554.7k ±  1%  -31.68% (p=0.000 n=40)
BM_ConvertF32ToBF16/1024/process_time    1428.8k ± 2%      939.3k ±  4%  -34.26% (p=0.000 n=40)
BM_ConvertF32ToBF16/8192/process_time     9.736M ± 1%      6.024M ±  0%  -38.13% (p=0.000 n=40)
BM_ConvertF32ToBF16/16384/process_time    19.17M ± 0%      11.89M ±  1%  -37.97% (p=0.000 n=40)
BM_ConvertF32ToBF16/32768/process_time    37.81M ± 0%      23.15M ±  0%  -38.77% (p=0.000 n=40)
geomean                                  2.715M           1.410M        -48.07%
```

```

```

PiperOrigin-RevId: 881452648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

0