LLM Inference

  • KV-Cache is a cache used to store result of computations from forward pass to avoid unnecessarily rerunning parts of the forward pass for each subsequent token

    • KV-Cache can be huge

Key Metrics

  • Time to first token (TTFT): How quickly user starts seeing the first token after entering their query

  • Time per output token (TPOT): Time to generate an output token for each user that is querying the model. Corresponds with how each user will percieve the “speed” of the model

Bandwidth vs. Compute Bound

  • At lower batch size LLM inference is bandwidth bound
    • LImited by how quickly can page weights in/out of memory into cache in order to do computation
  • At higher batch size LLM inference is compute bound

Model Flop Utililization(MFU) = (achieved FLOPS) / (peak FLOPs)

Model Bandwidth Utilization(MBU) = (achieved bandwidth) / (peak bandwidth) * Achieved bandiwdth is ((total model parameter size + KV cache size)/ TPOT)

Reference