This article introduces how TensorRT-LLM improves the efficiency of large language model inference by using quantization, in-flight batching, attention, and graph rewriting.