-
Notifications
You must be signed in to change notification settings - Fork 12k
vulkan: readd GGML_VULKAN_PERF #13761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The changes look reasonable to me. But is there a reason to prefer this over |
This basically summarizes what's happening inside a real model whereas test-backend-ops is more for checking specific ops. I get to see the exact ops that are run along with the matrix sizes and all that. It's also a easy way to make sure that most of the time is spent in mat mul or mat vec rather than doing something else. |
I think the overhead of submitting each node will affect the results, particularly for small nodes. Timestamp queries should be able to give more precise results. Can you try this branch? master...jeffbolznv:llama.cpp:query_pool. (I also noticed the units were wrong while doing this). |
I basically tested my PR by running test-backend-ops and comparing the numbers. So for mul_mat it looks something like this:
I'm not sure why it shows a 2x at the end but the seconds per run are pretty close. For short prompts or text generation it definitely runs way slower than it should, and as you said it's probably due to the fact that we're submitting one op at a time. With your branch everything runs at their normal speeds but the numbers are way off. I would be ecstatic if my GPU could do a 4096x512x14336 matrix in 618 us but sadly that's not the case 😞
|
Strange, I get accurate looking numbers from the timestamps. Can you check if it's any better with eComputeShader or eBottomOfPipe? It sounds like a driver bug... |
Nope nothing changed when I replaced eAllCommands with those.
It's worth mentioning that my Intel integrated chip also has those really low timing numbers, so if it's a bug its a pretty widespread one. Then again this might be a mesa thing. |
Oh, I forgot to multiply by the timestampPeriod (which is 1.0 on NVIDIA), maybe that explains it. Can you try the latest again? |
Sorry, I failed to push the branch when I commented a half hour ago, it's updated now. |
Yeah, that was it. timestampPeriod is Interestingly, my "crude" method of measuring still gives a correct general direction, at least for MUL_MAT(_VEC). For smaller ops it overestimates a lot due to submission/synchronization overhead. Good idea to use timestamp queries, I hadn't even heard of that. The only thing that you could improve is to leave out noops, they don't provide any useful data (always something close to 1us). |
I didn't avoid inserting timestamps for the noops because it keeps the logic simple for finding the before/after timestamps. But I can probably use ggml_vk_is_empty before passing it to the logger. |
ok, I've filtered them out in the latest commit. If we want to go ahead with the timestamp queries, I'll create a PR for it. |
It's definitely a smarter way of doing this, so if it's alright with @netrunnereve it should be preferred. Let's also wait for them to test it. |
It works great now, let me close this and @jeffbolznv can go submit a new PR. I also agree that it's smarter to get the data from the driver without affecting how the ops are run. BTW I find it strange that "x" got replace with "in", as that implies that the time indicated is the total time needed to complete all the ops. Like 10 in 100us implies that the entire block of 10 occurrences finished in 100us, whereas 10 x 100us implies that each occurrence completed in 100us. |
You're right, I thought it was a total, but it's an average. I'll revert that part. |
This readds the GGML_VULKAN_PERF feature after it got removed in #9118. It's set up to submit the ops individually like GGML_VULKAN_CHECK_RESULTS (I think I did that right).
The tests are passing with GGML_VULKAN_PERF turned on.