Community

Blog Events Webinars Tutorials Forum

Create Account

×

TensorRT-LLM

LLM Inference Acceleration: GPU Optimization for Attention in the Decode Phase

This article introduces how the Attention in the decode phase is optimized on GPU based on RTP-LLM practices.

Alibaba Cloud Community October 10, 2024 2,274

Building a Large Language Model Inference Service Optimized by TensorRT-LLM Based on KServe on ASM

This article introduces how to deploy optimized LLM model inference services in a cloud-native environment using the TensorRT-LLM-optimized Llama-2-hf model as an example.

Alibaba Container Service August 30, 2024 3,292

Best Practices for Large Model Inference in ACK: TensorRT-LLM

This article uses the Llama-2-7b-hf model as an example to demonstrate how to deploy the Triton framework using KServe in Alibaba Cloud ACK.

Alibaba Container Service July 24, 2024 4,201

Accelerating Large Language Model Inference: High-performance TensorRT-LLM Inference Practices

This article introduces how TensorRT-LLM improves the efficiency of large language model inference by using quantization, in-flight batching, attention, and graph rewriting.

Alibaba Cloud Native Community April 2, 2024 8,112

Related Tags

artificial intelligence big data cloud computing