8000 Grammar sampler implementation causes non-trivial token speed degradation · Issue #3980 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
Grammar sampler implementation causes non-trivial token speed degradation #3980
Closed
@kalomaze

Description

@kalomaze

The copy function of the Grammar sampler specifically is O(n^2) in time complexity.
image

On 13b 4_K_M, with all layers fully offloaded to my GPU (RTX 3060 12GB VRAM), I normally get a token speed of ~16T/s. This degrades to ~10T/s with grammar sampling on, regardless of the complexity of the grammar being used.

I'm not sure if the sampler code is being threaded at the moment, or if that would help, but hopefully the Grammar implementation could be refactored in some way to accomodate for this.

I'm not sure if it's running through the entire list of 32,000 logits. Maybe it would be smart to run the grammar sampler only after truncation samplers (Top K, Min P...)? If this time complexity is inherently necessary.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0