llama.cpp
High-performance local inference engine with OpenAI-compatible server mode.
llama.cpp provides efficient CPU and GPU inference for GGUF models. Its built-in server mode exposes an OpenAI-compatible API.
Configuration
[backend]
name = "llamacpp"
url = "http://localhost:8080"
model = "default"
Default Endpoint
POST http://localhost:8080/v1/chat/completions