Home / Architecture/ Inference Backend

Inference Backend

Client interface, OpenAI-compatible API translation, streaming and structured response modes, per-request metrics, and transcription support.

The inference backend presents a simple C++ function call interface to the rest of the plugin. Callers (UI thread, audio thread) make ordinary function calls. The backend translates these into OpenAI-compatible HTTP API calls, sends them to the configured inference server, and delivers responses back to the caller, either as a stream of tokens (text) or as a complete response (structured output).

The caller never touches HTTP, JSON, or threading directly.

Call Flow

sequenceDiagram
    participant Caller as Caller (UI)
    participant Client as InferenceClient
    participant BG as Background Thread
    participant Server as Backend Server

    Caller->>Client: chat(request)
    Client-->>Caller: returns immediately
    Client->>BG: enqueue request
    BG->>Server: POST /v1/chat/completions
    Server->>BG: SSE tokens
    BG->>Client: token
    Client->>Caller: onToken("Hi")
    BG->>Client: token
    Client->>Caller: onToken(" there")
    BG->>Client: done
    Client->>Caller: onComplete(response)

Two Response Modes

Streaming (text output)

For conversational/text responses. The backend sets stream: true in the API call and delivers tokens incrementally via a callback as they arrive over SSE.

client.chat(request, onToken, onComplete);
  • onToken is called for each token as it arrives
  • onComplete is called once when the full response is finished
  • The call returns immediately (non-blocking)
  • Returns a RequestHandle for cancellation

Complete (structured output)

For structured responses (JSON schema, MIDI data, analysis results). The backend sets response_format to a JSON schema and waits for the full response before delivering it.

client.complete(request, onComplete);
  • onComplete is called once with the full parsed response
  • No token-by-token streaming: the response arrives whole
  • The call returns immediately (non-blocking)
  • Returns a RequestHandle for cancellation

OpenAI-Compatible API Translation

All backends speak the same wire protocol. The InferenceClient translates C++ calls into OpenAI-compatible API requests:

Streaming request

POST /v1/chat/completions
{
    "model": "<configured model>",
    "messages": [
        {"role": "system", "content": "<system prompt>"},
        {"role": "user", "content": "<user message>"}
    ],
    "stream": true,
    "temperature": 0.7,
    "max_tokens": 512
}

Response: Server-Sent Events (SSE), each chunk containing a delta token.

Structured output request

POST /v1/chat/completions
{
    "model": "<configured model>",
    "messages": [
        {"role": "system", "content": "<system prompt>"},
        {"role": "user", "content": "<user message>"}
    ],
    "stream": false,
    "temperature": 0.0,
    "max_tokens": 512,
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "<schema name>",
            "schema": { ... }
        }
    }
}

Response: Single JSON object with the complete response.

Interface

Types

struct ChatMessage {
    std::string role;    // "system", "user", "assistant"
    std::string content;
};

struct ChatRequest {
    std::vector<ChatMessage> messages;
    float temperature = 0.7f;
    int max_tokens = 512;
};

struct StructuredRequest {
    std::vector<ChatMessage> messages;
    std::string schema_name;
    std::string json_schema;   // JSON schema as string
    float temperature = 0.0f;
    int max_tokens = 512;
};

struct ChatResponse {
    std::string full_response;
    bool cancelled = false;
    bool error = false;
    std::string error_message;
};

struct StructuredResponse {
    std::string raw_json;
    bool cancelled = false;
    bool error = false;
    std::string error_message;
};

struct TranscriptionRequest {
    std::vector<uint8_t> audio_data;
    std::string format;  // "wav", "flac", etc.
};

struct TranscriptionResponse {
    std::string text;
    bool error = false;
    std::string error_message;
};

struct RequestMetrics {
    double latency_ms = 0.0;
    double time_to_first_token_ms = 0.0;
    size_t tokens_generated = 0;
    double tokens_per_second = 0.0;
};

struct TokenUsage {
    size_t total_tokens = 0;
};

using RequestHandle = uint64_t;

InferenceClient

class InferenceClient {
public:
    RequestHandle chat(
        const ChatRequest& request,
        std::function<void(const std::string& token)> onToken,
        std::function<void(const ChatResponse&)> onComplete
    );

    RequestHandle complete(
        const StructuredRequest& request,
        std::function<void(const StructuredResponse&)> onComplete
    );

    void cancel(RequestHandle handle);

    void transcribe(
        const TranscriptionRequest& request,
        std::function<void(const TranscriptionResponse&)> onComplete
    );

    bool isInFlight(RequestHandle handle) const;

    void configure(const Config& config);

    RequestMetrics lastRequestMetrics() const;

    TokenUsage tokenUsage() const;
};

HttpClient (internal, mockable)

class HttpClient {
public:
    virtual ~HttpClient() = default;

    virtual HttpResponse send(const HttpRequest& request) = 0;

    virtual HttpResponse sendStreaming(
        const HttpRequest& request,
        std::function<void(const std::string& chunk)> onChunk,
        std::function<bool()> shouldCancel
    ) = 0;
};

In tests, a MockHttpClient replaces this to verify correct request URL, headers, body, simulated response parsing, SSE simulation, and error handling.

Audio Thread Access

The audio thread can read inference results (e.g., parsed MIDI from structured output) via a lock-free queue. It never calls chat() or complete() directly; those enqueue work on the background thread.

// Lock-free, called from audio thread only
std::optional<MidiResult> InferenceClient::readMidiResult();

Error Handling

Scenario Behavior
Backend not running onComplete called with error: true, message: "Connection refused"
Backend timeout onComplete called with error: true, message: "Request timed out"
Invalid JSON response onComplete called with error: true, message: "Failed to parse response"
SSE stream interrupted onComplete called with partial response + error: true
Request cancelled onComplete called with cancelled: true
Schema validation failure onComplete called with error: true, message: "Response did not match schema"
Transcription failure onComplete called with error: true, relevant message

Errors are always delivered via the callback. They never throw, crash, or block.

Per-Request Metrics

Every completed request (streaming or structured) records timing metrics accessible via lastRequestMetrics():

Metric Description
latency_ms Wall-clock time from request start to completion
time_to_first_token_ms Time from request start to first streamed token (streaming only)
tokens_generated Count of token chunks received
tokens_per_second tokens_generated / (latency_ms / 1000), 0 on error

Metrics reset at the start of each new request. Before any request, all values are zero.

Deterministic Inference

For evaluation and benchmarking, the backend supports deterministic settings:

  • inference.temperature = 0: sent explicitly in the request body as "temperature": 0
  • inference.seed: optional; when configured, sent as "seed": N in the request body; omitted when not set
  • Same input + same config = identical request body (for reproducibility)

Transcription (whisper.cpp)

The transcribe() method sends audio data to a whisper.cpp backend at its /inference endpoint. This uses a different wire protocol from chat: multipart/binary body instead of JSON. The response is a JSON object with a text field.

For audio eval pipelines, the caller chains transcribe() then chat(): transcribe audio, then feed the resulting text into an LLM prompt.

Compiled with SchemaFlux