LLM Inference

KV-Cache is a cache used to store result of computations from forward pass to avoid unnecessarily rerunning parts of the forward pass for each subsequent token
- KV-Cache can be huge

Key Metrics

Time to first token (TTFT): How quickly user starts seeing the first token after entering their query
Time per output token (TPOT): Time to generate an output token for each user that is querying the model. Corresponds with how each user will percieve the “speed” of the model

At lower batch size LLM inference is bandwidth bound
- LImited by how quickly can page weights in/out of memory into cache in order to do computation
At higher batch size LLM inference is compute bound

Model Flop Utililization(MFU) = (achieved FLOPS) / (peak FLOPs)

Model Bandwidth Utilization(MBU) = (achieved bandwidth) / (peak bandwidth)

LLM Inference Performance Engineering: Best Practices
- Blog from MOSAIC with some good information about LLM inference
Serving Quantized LLMS on NVIDIA H100 Tensor Core GPUs
Speeding up LLM Inference with TensorRT-LLM - video from GTC
- Lots of good information about serving
- Has benchmarking results for TRT-LLM
- Results illustrate various tradeoffs; e.g. as you have longer context size you have less memory for KV cache so max batch size decreases
  - As batch size decreases your utilization of compute decreases
Blog post illustrating batching techningues