-
Notifications
You must be signed in to change notification settings - Fork 11.9k
common/llama: align structures for reduce cacheline size on 64bit platforms #13710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@USBhost, if you're not too lazy, can you test it? I'll try to post my results later, my PC configuration is not much powerful. |
Are you sure this is something we need to handle ourselves? AFAIK all modern compilers already done this under the hood |
@ngxson, modern compiler may not always optimize, so we make explicit alignment and imply it. |
Beside, I don't think these |
C/C++ compilers cannot typically reorder structs. But ultimately, none of these structs are used in performance sensitive paths, and it is not worth making the code less readable, introducing a breaking change, and the risk of adding bugs for what essentially is going to save one cache miss out of a billion. |
I also think that this will not affect main performance, but compiler, due to reordering structures, can greatly change assembly code, it is advisable to test it with benchmark and go from results benchmark. |
If I understand correctly, you got from But tbh I think this brings less than 0.1% of performance improvement while adding another task for other contributors: to make sure the struct is kept align - it just doesn't seem economically appealing to me |
… platforms - llm_graph_context from 256 to 248 bytes - llm_graph_params from 104 to 96 bytes - llama_sampler_chain from 48 to 40 bytes - llama_model_loader from 328 to 320 bytes (saved 1 cacheline) - llama_model_params from 72 to 64 bytes (saved 1 cacheline) - common_log_entry from 48 to 40 bytes - templates_params from 112 to 96 bytes (saved 16 bytes) - common_chat_params from 152 to 144 bytes - common_chat_templates_inputs from 136 to 128 bytes (saved 1 cacheline) - common_params from 4960 to 4888 bytes (saved 1 cacheline) - common_params_sampling from 288 to 280 bytes - common_grammar_trigger from 48 to 40 bytes - cpu_params from 532 to 528 bytes
This PR will decrease costs copying, moving, and creating object-structures only for common 64bit processors due to the 8-byte data alignment.
Smaller size structure or class, higher chance putting into CPU cache. Most processors are already 64 bit, so the change won't make it any worse.
Pahole example output:
/* XXX {n} bytes hole, try to pack */
shows where optimization is possible by rearranging the order of fields structures and classesMaster branch
This PR
Info about technique:
https://hpc.rz.rptu.de/Tutorials/AVX/alignment.shtml
https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2013_2014/epc-14-haase-svenhendrik-alignmentinc-presentation.pdf
https://en.wikipedia.org/wiki/Data_structure_alignment
https://stackoverflow.com/a/20882083
https://zijishi.xyz/post/optimization-technique/learning-to-use-data-alignment/
Affected structs: