NVIDIA
TensorRT-LLM
TensorRT-LLM Architecture Docs
TensorRT engines embed the network weights
- However TensorRT can refit engines to update the wegiths after compilation
TensorRT Compilation compiles the graph operations into a CUDA graph
Plugins are a way to extend TensorRT’s compilation mechanism with new ways of optimizing the graph
- e.g. FlashAttention is implemented as a plugin
- FlashAttention is a mechanism to fuse operations and interleave computation to make them more efficient
- A plugin are nodes inserted into the CUDA graph that map to user defined GPU kernels
- e.g. FlashAttention is implemented as a plugin
-
- Runtime is responsible for loading the TensorRT engines and driving their execution
- They can be written in Python or C++
-
Looks like this process is being refactored with the goal of moving a way from convert checkpoint scripts outside the core TensorRT-LLM lib repository
tensorrt_llm/models/llama is an example of the new ay
trtlm-build
builds models from TensorRT-LLM checkpoint
Triton Inference Server
-
- Supports serving multiple models
- Has queuing and schedulers to handle scheduling requests on different models
- Supports ensemble models
- Multiple backends for different frameworks (e.g TensorFlow, Onyx, PyTorch)
TensorRT-LLM this is a python API to define large language models
- Its optimized to execute them on NVIDIA GPU
- You define models using the Python API and then compile them to TensorRT engines for NVIDIA GPUs
Deploying LLMs with TensorRT-LLM on Triton
- Download the weights (e.g. from Hugging Face)
- Download the example models code from the TensorRT-LLM repository
- This repository contains convert_checkpoint.py to convert build the model
- Question what exactly is the output? Is it a set of floats? c++ code?
- This repository contains convert_checkpoint.py to convert build the model
- Use
trtllm-build
to compile the model to a TensorRT engine (docs)[https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#quick-start-guide-compile]- I think this compiles the model from the Python TensorRT-LLM API to specific kernels
- Fill in the model configuration
- Model configuration is a proto
- fill_template.py is a script to modify the text version of the proto
- Specify things like
- Where the compiled model engine is
- If your engine is going to be on a volume inside a container the location shoul then point to the location inside that volume
- Question: Is this the model weights as well? How do you use Triton’s API to dynamically load models?
- What tokenizer to use
- How to handle KV cache when performing inference in batches
- Where the compiled model engine is
- Start the docker container
- Step 4 has you logging into HuggingFace to get the tokenizer and install some pip dependencies before calling
launch_triton_server.py
. Why isn’t that baked in?
- Step 4 has you logging into HuggingFace to get the tokenizer and install some pip dependencies before calling
When serving an LLM multiple models e.g. the preprocessing model is responsible for tokenization (ref)[https://github.com/triton-inference-server/tensorrtllm_backend]