Inference Backend

Client interface, OpenAI-compatible API translation, streaming and structured response modes, per-request metrics, and transcription support.

5 min read Architecture Backend inference openai-api streaming sse metrics transcription

The inference backend presents a simple C++ function call interface to the rest of the plugin. Callers (UI thread, audio thread) make ordinary function calls. The backend translates these into OpenAI-compatible HTTP API calls, sends them to the configured inference server, and delivers responses back to the caller, either as a stream of tokens (text) or as a complete response (structured output).

The caller never touches HTTP, JSON, or threading directly.

Call Flow

sequenceDiagram
    participant Caller as Caller (UI)
    participant Client as InferenceClient
    participant BG as Background Thread
    participant Server as Backend Server

    Caller->>Client: chat(request)
    Client-->>Caller: returns immediately
    Client->>BG: enqueue request
    BG->>Server: POST /v1/chat/completions
    Server->>BG: SSE tokens
    BG->>Client: token
    Client->>Caller: onToken("Hi")
    BG->>Client: token
    Client->>Caller: onToken(" there")
    BG->>Client: done
    Client->>Caller: onComplete(response)

Two Response Modes

Streaming (text output)

For conversational/text responses. The backend sets stream: true in the API call and delivers tokens incrementally via a callback as they arrive over SSE.

client.chat(request, onToken, onComplete);

onToken is called for each token as it arrives
onComplete is called once when the full response is finished
The call returns immediately (non-blocking)
Returns a RequestHandle for cancellation

Complete (structured output)

For structured responses (JSON schema, MIDI data, analysis results). The backend sets response_format to a JSON schema and waits for the full response before delivering it.

client.complete(request, onComplete);

onComplete is called once with the full parsed response
No token-by-token streaming: the response arrives whole
The call returns immediately (non-blocking)
Returns a RequestHandle for cancellation

OpenAI-Compatible API Translation

All backends speak the same wire protocol. The InferenceClient translates C++ calls into OpenAI-compatible API requests:

Streaming request

POST /v1/chat/completions
{
    "model": "<configured model>",
    "messages": [
        {"role": "system", "content": "<system prompt>"},
        {"role": "user", "content": "<user message>"}
    ],
    "stream": true,
    "temperature": 0.7,
    "max_tokens": 512
}

Response: Server-Sent Events (SSE), each chunk containing a delta token.

Structured output request

POST /v1/chat/completions
{
    "model": "<configured model>",
    "messages": [
        {"role": "system", "content": "<system prompt>"},
        {"role": "user", "content": "<user message>"}
    ],
    "stream": false,
    "temperature": 0.0,
    "max_tokens": 512,
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "<schema name>",
            "schema": { ... }
        }
    }
}

Response: Single JSON object with the complete response.

Interface

Types

struct ChatMessage {
    std::string role;    // "system", "user", "assistant"
    std::string content;
};

struct ChatRequest {
    std::vector<ChatMessage> messages;
    float temperature = 0.7f;
    int max_tokens = 512;
};

struct StructuredRequest {
    std::vector<ChatMessage> messages;
    std::string schema_name;
    std::string json_schema;   // JSON schema as string
    float temperature = 0.0f;
    int max_tokens = 512;
};

struct ChatResponse {
    std::string full_response;
    bool cancelled = false;
    bool error = false;
    std::string error_message;
};

struct StructuredResponse {
    std::string raw_json;
    bool cancelled = false;
    bool error = false;
    std::string error_message;
};

struct TranscriptionRequest {
    std::vector<uint8_t> audio_data;
    std::string format;  // "wav", "flac", etc.
};

struct TranscriptionResponse {
    std::string text;
    bool error = false;
    std::string error_message;
};

struct RequestMetrics {
    double latency_ms = 0.0;
    double time_to_first_token_ms = 0.0;
    size_t tokens_generated = 0;
    double tokens_per_second = 0.0;
};

struct TokenUsage {
    size_t total_tokens = 0;
};

using RequestHandle = uint64_t;

InferenceClient

class InferenceClient {
public:
    RequestHandle chat(
        const ChatRequest& request,
        std::function<void(const std::string& token)> onToken,
        std::function<void(const ChatResponse&)> onComplete
    );

    RequestHandle complete(
        const StructuredRequest& request,
        std::function<void(const StructuredResponse&)> onComplete
    );

    void cancel(RequestHandle handle);

    void transcribe(
        const TranscriptionRequest& request,
        std::function<void(const TranscriptionResponse&)> onComplete
    );

    bool isInFlight(RequestHandle handle) const;

    void configure(const Config& config);

    RequestMetrics lastRequestMetrics() const;

    TokenUsage tokenUsage() const;
};

HttpClient (internal, mockable)

class HttpClient {
public:
    virtual ~HttpClient() = default;

    virtual HttpResponse send(const HttpRequest& request) = 0;

    virtual HttpResponse sendStreaming(
        const HttpRequest& request,
        std::function<void(const std::string& chunk)> onChunk,
        std::function<bool()> shouldCancel
    ) = 0;
};

In tests, a MockHttpClient replaces this to verify correct request URL, headers, body, simulated response parsing, SSE simulation, and error handling.

Audio Thread Access

The audio thread can read inference results (e.g., parsed MIDI from structured output) via a lock-free queue. It never calls chat() or complete() directly; those enqueue work on the background thread.

// Lock-free, called from audio thread only
std::optional<MidiResult> InferenceClient::readMidiResult();

Error Handling

Scenario	Behavior
Backend not running	`onComplete` called with `error: true`, message: "Connection refused"
Backend timeout	`onComplete` called with `error: true`, message: "Request timed out"
Invalid JSON response	`onComplete` called with `error: true`, message: "Failed to parse response"
SSE stream interrupted	`onComplete` called with partial response + `error: true`
Request cancelled	`onComplete` called with `cancelled: true`
Schema validation failure	`onComplete` called with `error: true`, message: "Response did not match schema"
Transcription failure	`onComplete` called with `error: true`, relevant message

Errors are always delivered via the callback. They never throw, crash, or block.

Per-Request Metrics

Every completed request (streaming or structured) records timing metrics accessible via lastRequestMetrics():

Metric	Description
`latency_ms`	Wall-clock time from request start to completion
`time_to_first_token_ms`	Time from request start to first streamed token (streaming only)
`tokens_generated`	Count of token chunks received
`tokens_per_second`	`tokens_generated / (latency_ms / 1000)`, 0 on error

Metrics reset at the start of each new request. Before any request, all values are zero.

Deterministic Inference

For evaluation and benchmarking, the backend supports deterministic settings:

inference.temperature = 0: sent explicitly in the request body as "temperature": 0
inference.seed: optional; when configured, sent as "seed": N in the request body; omitted when not set
Same input + same config = identical request body (for reproducibility)

Transcription (whisper.cpp)

The transcribe() method sends audio data to a whisper.cpp backend at its /inference endpoint. This uses a different wire protocol from chat: multipart/binary body instead of JSON. The response is a JSON object with a text field.

For audio eval pipelines, the caller chains transcribe() then chat(): transcribe audio, then feed the resulting text into an LLM prompt.