Draft-then-verify for dramatic speedups
Speculative decoding uses a small, fast draft model to propose tokens, then verifies them with your target model in parallel. Same outputs, dramatically faster.
Speculative Decoding Flow
Why It Works
Small models generate tokens fast. Large models verify multiple tokens simultaneously. When draft tokens match what the large model would have produced, you skip the slow generation entirely.
Mathematically Equivalent
Speculative decoding produces the exact same outputs as standard decoding. The verification step ensures no quality loss. Faster without compromise.
Custom Triton Kernels
We write low-level GPU kernels optimized for your specific workload. Attention, MLP, and memory operations tuned for maximum throughput on your hardware.
Workload-Specific Tuning
Draft model selection, speculation length, and kernel parameters tuned for your use case. What works for code generation differs from what works for chat.
Verified Synthesis
274x Speedup for code generation with verificationConversational
3-5x Typical speedup for chat applicationsResearch Foundation
Speculative decoding is a well-established technique with strong theoretical guarantees. Our implementation builds on this research while adding production-grade optimizations.
Open source: rotalabs-accel toolkit available at rotalabs.ai. Benchmark yourself.