IBM

vLLM Office Hours – Using NVIDIA CUTLASS for

vLLM Office Hours – Using NVIDIA CUTLASS for High-Performance Inference – September 05, 2024

#vLLM #Office #Hours #NVIDIA #CUTLASS

“Neural Magic”

In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8…

source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

Leave a Reply