Grammar sampler implementation causes non-trivial token speed degradation

The copy function of the Grammar sampler specifically is O(n^2) in time complexity.

On 13b 4_K_M, with all layers fully offloaded to my GPU (RTX 3060 12GB VRAM), I normally get a token speed of ~16T/s. This degrades to ~10T/s with grammar sampling on, regardless of the complexity of the grammar being used.

I'm not sure if the sampler code is being threaded at the moment, or if that would help, but hopefully the Grammar implementation could be refactored in some way to accomodate for this.

I'm not sure if it's running through the entire list of 32,000 logits. Maybe it would be smart to run the grammar sampler only after truncation samplers (Top K, Min P...)? If this time complexity is inherently necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions