DTensor support for fused qkv matmul #140069

HDCharles · 2024-11-08T00:12:56Z

🚀 The feature, motivation and pitch

For transformer architecture (for example https://github.com/pytorch-labs/gpt-fast/blob/main/model.py#L195-L211) it tends to be most performant to merge the qkv matrices together. If you try to shard this concatenated tensor then the subsequent SDPA op won't be shared correctly since you need each column of q sharded with the corresponding columns of k and v [q1,k1,v1,...], but by default the sharding will be [q1, q2, q3...] When not using DTensor this is relatively easy to get to work: https://github.com/pytorch-labs/gpt-fast/blob/main/tp.py#L73

but for DTensor the way to enable this is really unclear. is there a way to handle this type of operation with DTensor parallelization or should we just stick to normal tensor parallel support and figure out how to get it to work with our APIs?

This is currently blocking tensor parallel support in torchAO so i wanted to centralize discussion to a single location.

Alternatives

don't use DTensor for tensor parallel

Additional context

No response

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @tianyu-l @XilunWu

ad8e · 2024-11-08T00:23:03Z

Depends on DTensor Strided Sharding:
#88838 (comment)
#129627

gnadathur · 2024-12-16T18:57:26Z

cc: @XilunWu

XilunWu · 2024-12-16T20:15:11Z

@HDCharles sorry for catching this issue late. As @ad8e said, this can be done via Strided Sharding feature in DTensor. I'll try the TP code you shared in DTensor and see if there's any implementation gap. Thanks for submitting the feature request!

yzhangcs · 2025-02-13T19:27:46Z

@XilunWu Hey, wondering any progress on this feature? Thank you!

XilunWu · 2025-02-14T19:28:59Z

Hi @yzhangcs sorry currently we haven't made a plan to support this but I'll notify if we have more updates.

tianyu-l · 2025-02-16T06:42:32Z

@yzhangcs
How is this related to pytorch/torchtitan#827?
From your description over there it looks existing mechanism in DTensor / TP can already support your fused FFN layer.

For fused QKV, indeed you might need strided sharding as a way to express things.
The expression itself (_StridedShard) is already available in pytorch: https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/placement_types.py#L375

It is likely that for your use case, some op support related to _StridedShard is missing, but you can try adding by yourself, even if we don't have bandwidth to do so.

wanchaol · 2025-02-25T19:58:39Z

Sorry I think this might be because I mentioned strided sharding could support fused QKV before, which might led the discussions to strided sharding. After looking into this more, I think strided sharding and fused QKV sharding are two orthogonal problems.

Strided sharding only describes the sharding order when sharding one tensor on multiple device mesh dimensions.

Fused QKV sharding is different, it is trying to shard a combined linear (one big tensor) but as if it shards there linears separately on one device mesh dimension. So essentially it tries to shard three tensor into one big sharded tensor. I think this is not a regular sharded tensor but one might want to treat it as a regular sharded tensor during runtime. More over the three tensors might have different shapes which need to be traced manually.

Given this is not a regular sharded tensor, we should not try to force it into a regular one during sharding initialization, instead we should treat it faithfully as shard three tensors to one big sharded tensor, and just treat it as a single big DTensor during runtime.

We can easily implement the fused QKV feature in the TP layer, i.e. sth like below should work:

wq, wk, wv = linear.weight.split(fused_qkv_splits)
# distribute there
sharded_wq = distribute_tensor(wq, ...).to_local()
...
sharded_fused_qkv = torch.cat([sharded_wq, sharded_wk..])
dt = DTensor.from_local(sharded_fused_qkv, ..)
linear.register_paramter("weight", dt)

cc @HDCharles @ad8e

ad8e · 2025-02-25T21:44:25Z

My understanding of the proposed scheme is: the most representative shape of self.weight is (TP dimension, fused qkv for one shard, input), where dim=0 is the TP dimension. The TP dimension can't be changed, since that would cause the heads to rearrange. If the first two dimensions were merged, producing the total shape (qkv output size, qkv input size), the TP-shape would still be baked into the weight matrix, but it would become implicit.
The TP-handling mechanism would then be outside of DTensor and DCP; we would rearrange heads with surgery.
This seems ok. We need the same logic with regular torch tensors, so nothing will change.

wanchaol · 2025-02-28T20:17:51Z

@ad8e Yep that is right! Fused QKV sharding really need to align between pretraining -> finetune or inference stage (normal sharding does not need to), i.e. if pretrain use fused qkv sharding, then finetune/inference requires no change. But if pretrain use separate qkv sharding, there needs to be a surgery converting the sharding, as fused qkv trying to treat the whole qkv as one weight, while non-fused one keeping as three weights separately. From checkpointable states prospective they are different

This PR adds fused QKV sharding in the TP layer. There should be no "strided" sharding involved as fused QKV linear layer is more about combining three layers into one. See design and discussions: #140069 (comment) resolves #140069

bdhirsh added oncall: distributed Add this issue/PR to distributed oncall triage queue module: dtensor distributed tensor tag labels Nov 8, 2024

XilunWu self-assigned this Dec 16, 2024

wanchaol linked a pull request Mar 21, 2025 that will close this issue

[TP] add support for fused QKV Sharding #149701

Open

wanchaol assigned wanchaol and unassigned XilunWu Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTensor support for fused qkv matmul #140069

DTensor support for fused qkv matmul #140069

DTensor support for fused qkv matmul #140069

DTensor support for fused qkv matmul #140069

Comments

🚀 The feature, motivation and pitch

Alternatives

Additional context