llama.cpp

High-performance local inference engine with OpenAI-compatible server mode.

1 min read Provider Backend local openai-api cpu gpu

llama.cpp provides efficient CPU and GPU inference for GGUF models. Its built-in server mode exposes an OpenAI-compatible API.

[backend]
name = "llamacpp"
url = "http://localhost:8080"
model = "default"

POST http://localhost:8080/v1/chat/completions

Related Documentation