-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop interleave placement in QKV matrix #1013
base: main
Are you sure you want to change the base?
Drop interleave placement in QKV matrix #1013
Conversation
This might actually fix my OLMo PR in #927 😅 |
You might want to merge OLMo with an interleaving conversion step because this PR is very risky and a breaking change for all existing checkpoints |
@Andrei-Aksionov We need to evaluate if we want to make this change. Especially if there are any performance differences and whether the risk is worth it. But there are two important considerations:
|
These are all good questions.
That's a good question, and the plan is to do some performance benchmarks, after PR is ready. So, in overall, the performance should be in the same ballpark, but I'll verify it with benchmarks, to be on the safe side. Additionally, the current code also caches keys and values after they are expanded, making it not very efficient.
Yep, agree.
I think I can make it by the end of Friday. 🤞 |
One recommendation, if we want to merge it in some future.
In my opinion, this variant looks better:
Looks like a win-win situation. |
The name is directly inherited from https://github.com/karpathy/nanoGPT/blob/master/model.py#L35. We took the liberty of dropping out the convolutional past "c_" |
I've just called Andrej, and he doesn't mind if we rename it to |
Hope he approves the PR then |
Hi there 👋
This PR changes placement of$[Q,Q,...,K,K,...,V,V,...]$ , but in-fact they are currently placed in interleaved one $[Q, K, V, Q, K, V, ..., Q, K, V]$ .
query
,key
andvalue
weights in the combined QKV matrix.Intuitively one might assume that weights in the QKV matrix are combined in a sequential order
I believe this placement was introduced by models that used GPTNeoX MLP layers (Pythia and Falcon), but all the latest models doesn't use such interleaved placement.
That means that:
Benchmarks
Benchmarks were done on a
1xA10G
with the code from main and this PR branches.Train
was done5 times
for each model/branch and the results were averaged. Args: 1 epoch, epoch size is 1k. Other args were by default.Generate
was done100 times
andGenereate with max_new_tokens=1950
-10 times
.Models: two variants with 1 billion of parameters, simply to speed up benchmarks.
Pythia represents a model with multi-head attention, while TinyLlaMA - grouped-query attention. For multi-query attention there should not be any significant difference.
In training mode the PR version is slightly faster, but for TinyLlama VRAM consumption was slightly larger, oddly enough. I tried to do
unsqueeze(...).expand(...).reshape(...)
but the VRAM consumption stayed the same.In inference mode the PR version is again faster, but the biggest difference is in VRAM consumption, since we don't need to cache KV after the expansion, like it's done in the current main.