×
TensorRT-LLM

LLM Inference Acceleration: GPU Optimization for Attention in the Decode Phase

This article introduces how the Attention in the decode phase is optimized on GPU based on RTP-LLM practices.

Building a Large Language Model Inference Service Optimized by TensorRT-LLM Based on KServe on ASM

This article introduces how to deploy optimized LLM model inference services in a cloud-native environment using the TensorRT-LLM-optimized Llama-2-hf model as an example.

Best Practices for Large Model Inference in ACK: TensorRT-LLM

This article uses the Llama-2-7b-hf model as an example to demonstrate how to deploy the Triton framework using KServe in Alibaba Cloud ACK.

Accelerating Large Language Model Inference: High-performance TensorRT-LLM Inference Practices

This article introduces how TensorRT-LLM improves the efficiency of large language model inference by using quantization, in-flight batching, attention, and graph rewriting.