This article introduces how the Attention in the decode phase is optimized on GPU based on RTP-LLM practices.
This article introduces how to deploy optimized LLM model inference services in a cloud-native environment using the TensorRT-LLM-optimized Llama-2-hf model as an example.
This article uses the Llama-2-7b-hf model as an example to demonstrate how to deploy the Triton framework using KServe in Alibaba Cloud ACK.
This article introduces how TensorRT-LLM improves the efficiency of large language model inference by using quantization, in-flight batching, attention, and graph rewriting.