Closed
Description
The copy function of the Grammar sampler specifically is O(n^2) in time complexity.
On 13b 4_K_M, with all layers fully offloaded to my GPU (RTX 3060 12GB VRAM), I normally get a token speed of ~16T/s. This degrades to ~10T/s with grammar sampling on, regardless of the complexity of the grammar being used.
I'm not sure if the sampler code is being threaded at the moment, or if that would help, but hopefully the Grammar implementation could be refactored in some way to accomodate for this.
I'm not sure if it's running through the entire list of 32,000 logits. Maybe it would be smart to run the grammar sampler only after truncation samplers (Top K, Min P...)? If this time complexity is inherently necessary.