vLLM

High-throughput LLM serving engine with OpenAI-compatible API.

1 min read Provider Backend local openai-api gpu high-throughput

vLLM is a high-throughput serving engine optimized for GPU inference. It supports continuous batching and PagedAttention for efficient memory management.

Configuration

[backend]
name = "vllm"
url = "http://localhost:8000"
model = "meta-llama/Llama-3.2-3B-Instruct"

Default Endpoint

POST http://localhost:8000/v1/chat/completions

Compiled with SchemaFlux

Configuration

Default Endpoint

Related Documentation