vLLM
High-throughput LLM serving engine with OpenAI-compatible API.
vLLM is a high-throughput serving engine optimized for GPU inference. It supports continuous batching and PagedAttention for efficient memory management.
Configuration
[backend]
name = "vllm"
url = "http://localhost:8000"
model = "meta-llama/Llama-3.2-3B-Instruct"
Default Endpoint
POST http://localhost:8000/v1/chat/completions