IBM
vLLM Office Hours – Using NVIDIA CUTLASS for
vLLM Office Hours – Using NVIDIA CUTLASS for High-Performance Inference – September 05, 2024
#vLLM #Office #Hours #NVIDIA #CUTLASS
“Neural Magic”
In this session, we explored the exciting updates in the vLLM v0.6.0 release, including significant system changes that led to a 2.7x throughput increase and a 5x latency improvement. We then dove into how you can leverage NVIDIA CUTLASS to optimize high-performance inference with INT8 and FP8…
source
To see the full content, share this page by clicking one of the buttons below |