Inference Backend
Client interface, OpenAI-compatible API translation, streaming and structured response modes, per-request metrics, and transcription support.
The inference backend presents a simple C++ function call interface to the rest of the plugin. Callers (UI thread, audio thread) make ordinary function calls. The backend translates these into OpenAI-compatible HTTP API calls, sends them to the configured inference server, and delivers responses back to the caller, either as a stream of tokens (text) or as a complete response (structured output).
The caller never touches HTTP, JSON, or threading directly.
Call Flow
sequenceDiagram
participant Caller as Caller (UI)
participant Client as InferenceClient
participant BG as Background Thread
participant Server as Backend Server
Caller->>Client: chat(request)
Client-->>Caller: returns immediately
Client->>BG: enqueue request
BG->>Server: POST /v1/chat/completions
Server->>BG: SSE tokens
BG->>Client: token
Client->>Caller: onToken("Hi")
BG->>Client: token
Client->>Caller: onToken(" there")
BG->>Client: done
Client->>Caller: onComplete(response)
Two Response Modes
Streaming (text output)
For conversational/text responses. The backend sets stream: true in the API call and delivers tokens incrementally via a callback as they arrive over SSE.
client.chat(request, onToken, onComplete);
onTokenis called for each token as it arrivesonCompleteis called once when the full response is finished- The call returns immediately (non-blocking)
- Returns a
RequestHandlefor cancellation
Complete (structured output)
For structured responses (JSON schema, MIDI data, analysis results). The backend sets response_format to a JSON schema and waits for the full response before delivering it.
client.complete(request, onComplete);
onCompleteis called once with the full parsed response- No token-by-token streaming: the response arrives whole
- The call returns immediately (non-blocking)
- Returns a
RequestHandlefor cancellation
OpenAI-Compatible API Translation
All backends speak the same wire protocol. The InferenceClient translates C++ calls into OpenAI-compatible API requests:
Streaming request
POST /v1/chat/completions
{
"model": "<configured model>",
"messages": [
{"role": "system", "content": "<system prompt>"},
{"role": "user", "content": "<user message>"}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 512
}
Response: Server-Sent Events (SSE), each chunk containing a delta token.
Structured output request
POST /v1/chat/completions
{
"model": "<configured model>",
"messages": [
{"role": "system", "content": "<system prompt>"},
{"role": "user", "content": "<user message>"}
],
"stream": false,
"temperature": 0.0,
"max_tokens": 512,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "<schema name>",
"schema": { ... }
}
}
}
Response: Single JSON object with the complete response.
Interface
Types
struct ChatMessage {
std::string role; // "system", "user", "assistant"
std::string content;
};
struct ChatRequest {
std::vector<ChatMessage> messages;
float temperature = 0.7f;
int max_tokens = 512;
};
struct StructuredRequest {
std::vector<ChatMessage> messages;
std::string schema_name;
std::string json_schema; // JSON schema as string
float temperature = 0.0f;
int max_tokens = 512;
};
struct ChatResponse {
std::string full_response;
bool cancelled = false;
bool error = false;
std::string error_message;
};
struct StructuredResponse {
std::string raw_json;
bool cancelled = false;
bool error = false;
std::string error_message;
};
struct TranscriptionRequest {
std::vector<uint8_t> audio_data;
std::string format; // "wav", "flac", etc.
};
struct TranscriptionResponse {
std::string text;
bool error = false;
std::string error_message;
};
struct RequestMetrics {
double latency_ms = 0.0;
double time_to_first_token_ms = 0.0;
size_t tokens_generated = 0;
double tokens_per_second = 0.0;
};
struct TokenUsage {
size_t total_tokens = 0;
};
using RequestHandle = uint64_t;
InferenceClient
class InferenceClient {
public:
RequestHandle chat(
const ChatRequest& request,
std::function<void(const std::string& token)> onToken,
std::function<void(const ChatResponse&)> onComplete
);
RequestHandle complete(
const StructuredRequest& request,
std::function<void(const StructuredResponse&)> onComplete
);
void cancel(RequestHandle handle);
void transcribe(
const TranscriptionRequest& request,
std::function<void(const TranscriptionResponse&)> onComplete
);
bool isInFlight(RequestHandle handle) const;
void configure(const Config& config);
RequestMetrics lastRequestMetrics() const;
TokenUsage tokenUsage() const;
};
HttpClient (internal, mockable)
class HttpClient {
public:
virtual ~HttpClient() = default;
virtual HttpResponse send(const HttpRequest& request) = 0;
virtual HttpResponse sendStreaming(
const HttpRequest& request,
std::function<void(const std::string& chunk)> onChunk,
std::function<bool()> shouldCancel
) = 0;
};
In tests, a MockHttpClient replaces this to verify correct request URL, headers, body, simulated response parsing, SSE simulation, and error handling.
Audio Thread Access
The audio thread can read inference results (e.g., parsed MIDI from structured output) via a lock-free queue. It never calls chat() or complete() directly; those enqueue work on the background thread.
// Lock-free, called from audio thread only
std::optional<MidiResult> InferenceClient::readMidiResult();
Error Handling
| Scenario | Behavior |
|---|---|
| Backend not running | onComplete called with error: true, message: "Connection refused" |
| Backend timeout | onComplete called with error: true, message: "Request timed out" |
| Invalid JSON response | onComplete called with error: true, message: "Failed to parse response" |
| SSE stream interrupted | onComplete called with partial response + error: true |
| Request cancelled | onComplete called with cancelled: true |
| Schema validation failure | onComplete called with error: true, message: "Response did not match schema" |
| Transcription failure | onComplete called with error: true, relevant message |
Errors are always delivered via the callback. They never throw, crash, or block.
Per-Request Metrics
Every completed request (streaming or structured) records timing metrics accessible via lastRequestMetrics():
| Metric | Description |
|---|---|
latency_ms |
Wall-clock time from request start to completion |
time_to_first_token_ms |
Time from request start to first streamed token (streaming only) |
tokens_generated |
Count of token chunks received |
tokens_per_second |
tokens_generated / (latency_ms / 1000), 0 on error |
Metrics reset at the start of each new request. Before any request, all values are zero.
Deterministic Inference
For evaluation and benchmarking, the backend supports deterministic settings:
inference.temperature = 0: sent explicitly in the request body as"temperature": 0inference.seed: optional; when configured, sent as"seed": Nin the request body; omitted when not set- Same input + same config = identical request body (for reproducibility)
Transcription (whisper.cpp)
The transcribe() method sends audio data to a whisper.cpp backend at its /inference endpoint. This uses a different wire protocol from chat: multipart/binary body instead of JSON. The response is a JSON object with a text field.
For audio eval pipelines, the caller chains transcribe() then chat(): transcribe audio, then feed the resulting text into an LLM prompt.