Draft-then-verify for dramatic speedups
Speculative decoding uses a small, fast draft model to propose multiple tokens, then verifies them with the target model in a single forward pass. Same outputs, fewer forward passes.
Speculative Decoding Flow
Mathematical Equivalence
Speculative decoding produces the exact same output distribution as standard autoregressive decoding. The verification step uses rejection sampling to ensure correctness. No quality loss, guaranteed.
Acceptance Rate
When draft tokens match target distribution, they're accepted. Typical acceptance rates are 60-80%. Higher rates = more speedup. Rates depend on draft model quality and task domain.
Speculation Length (K)
Number of tokens to propose per iteration. Higher K = more potential speedup but lower acceptance rate. Optimal K varies by workload. Typically 4-8 for chat, 16-32 for code.
Draft Model Selection
Smaller model from same family works best. Llama 7B drafts for Llama 70B. Mistral 7B drafts for Mixtral. Fine-tuned drafts can achieve higher acceptance rates.
Why 274x for Verified Synthesis?
Code generation with verification achieves extreme speedups because: (1) draft model proposes code, (2) verifier checks correctness instantly via execution, (3) if wrong, restart is cheap. The target model only runs for accepted sequences.
Key insight: Verification doesn't need LLM reasoning. Execution is deterministic and fast.
Hand-optimized GPU operations
# Fused attention eliminates memory round-trips @triton.jit def fused_attention(q, k, v, output, ...): # Compute QK^T, softmax, and V multiplication # in a single kernel without intermediate writes acc = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # ... optimized implementation