-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
Description
Refs:
In simple terms, after implementing batched decoding (a.k.a. parallel decoding) we can extend the inference functionality to support applying a custom attention mask to the batch. This can be used to create a causal tree mask that allows to evaluate a tree of continuations in a single pass, instead of a large batch of independent sequences.
This is useful for implementing advanced speculative strategies such as SpecInfer's token tree verification and Medusa heads
bobqianic, wsxiaoys, ps4vs, jhen0409, Mihaiii and 6 moreKerfuffleV2, Mihaiii, lin72h, PhilippeFerreiraDeSousa, lalcmellkmal and 1 more
53AE
div>