Server & Clients¶

Quincy is split into two parts: a server that does the thinking, and clients that you interact with.

The server runs the orchestrator, manages your agents, keeps the llama.cpp server alive, and holds your conversation sessions. The clients — the command-line interface, the macOS app, the iOS app — are lightweight frontends. They send your messages to the server and stream back the responses. All the AI work happens on the server side.

This split means you can talk to the same Quincy from multiple devices. Start a conversation on your Mac, pick it up from your iPhone. The agents, models, and session history all live on the server — clients just connect.

graph TB
    CLI[Command Line] --> Server[Quincy Server]
    Mac[macOS App] --> Server
    iOS[iOS App] --> Server
    Server --> Agents[Your Agents]
    Server --> LLM[llama.cpp server / Cloud Models]
    Server --> Sessions[Conversation Sessions]

The Quincy server manages all agents, models, and sessions. Clients are thin frontends that connect over the local network. Multiple clients can be connected simultaneously — what you approve on your iPhone is reflected on your Mac.

graph TD
    subgraph Server
        Agents[Agents & Models]
        Sessions[Sessions]
        Scheduler[Request Scheduler]
    end

    Mac[macOS App] <--> Server
    iOS[iOS App] <--> Server
    CLI[CLI] <--> Server

    Mac -.->|Bonjour| Server
    iOS -.->|Bonjour| Server

Built-in vs. External Server¶

Quincy can run in two modes depending on your setup.

The Built-in Server (macOS)¶

When you use Quincy on a Mac, the server starts automatically. You don't need to think about it — Quincy handles launching, connecting, and shutting down the server process behind the scenes.

This is the simplest setup. Everything runs on one machine: the server, the agents, the local models (accelerated by your Mac's GPU via Metal). It's great for personal use — one person, one Mac.

The trade-off is that the server only runs while your Mac is awake. If you close the lid, the server stops. When you open it again, Quincy picks up where it left off.

An External Server (Linux or another Mac)¶

If you want Quincy available around the clock — or shared across multiple people — you can run quincy-server on a separate machine. This is typically a Linux box, but it can be any machine that stays on.

An external server is useful when:

You want always-on availability — A headless server doesn't sleep. Your agents are reachable from any device, any time.
You're sharing with a household or team — Multiple people can connect their own clients to the same server. Each person gets their own sessions, but they share the same agents and models.
You have better hardware elsewhere — A desktop with a dedicated GPU or more RAM can run larger models than a laptop. Run the server on the beefier machine, connect from wherever you are.

How Clients Find the Server¶

When a Quincy client starts, it looks for a running server automatically. You don't need to type in an address.

flowchart LR
    A[Start] --> B{Bonjour\nscan}
    B -->|Found one| G{Setup\nmode?}
    B -->|Found multiple| D[Pick one]
    D --> G
    B -->|Found none| E{Probe\nlocalhost}
    E -->|Found| G
    E -->|Not found| F[Start server\nautomatically]
    F --> G
    G -->|Yes| H[Run setup\nwizard]
    H --> C[Connect]
    G -->|No| C

Bonjour discovery — Quincy scans the local network for servers advertising themselves. If exactly one is found, the client connects automatically. If multiple servers are found, Quincy asks you to pick one (local servers sort first).
Port fallback — If Bonjour doesn't find anything, the client checks whether a server is already running on this machine by probing the default ports.
Auto-start — If no server is found anywhere, and you're on macOS, Quincy starts one for you.
Setup mode — A newly started (or unconfigured) server responds with a 503 status indicating it needs setup. The client detects this and drives the first-run onboarding wizard over the server's REST API before proceeding to a normal connection. See Getting Started for what the wizard covers.

This means the typical experience is: you launch Quincy and it just works. If the server hasn't been configured yet, the setup wizard runs automatically first. The discovery sequence runs in about two seconds.

Explicit Server URL¶

If you want to bypass discovery entirely — say, to connect to a specific remote server — pass the URL directly:

quincy chat --server-url https://my-server.local:9234

This skips Bonjour and port probing and connects straight to the given address.

What the Server Manages¶

The server is where the real work happens:

Agents — The orchestrator and all sub-agents run on the server. When you send a message, the server's orchestrator decides how to handle it and delegates to the right sub-agent. See How the Agent System Works for details.
Models — The server manages the llama.cpp server (including auto-start, idle shutdown, and GPU offloading) and connects to cloud providers. See Setting Up llama.cpp and Setting Up Anthropic.
Sessions — Your conversation history lives on the server. Multiple clients can access the same sessions. Sessions can be created, listed, reset, and destroyed via the REST API. Each session tracks metadata including subscriber count, last activity, and an auto-generated title.
Session titles — After a few exchanges, the server automatically generates a short title summarizing the conversation topic (e.g., "Email agent setup" or "Tax filing questions"). Clients receive a sessionTitleChanged event and update their UI. Title generation runs as a background housekeeping task using a lightweight model.
Request scheduling — LLM requests are dispatched through a priority-based scheduler. Interactive chat gets the highest priority, background tasks (like memory extraction) run when capacity is available, and idle tasks (cron jobs, sweeps) run at the lowest priority. Per-provider concurrency limits prevent overloading any single backend.

Provider fallback: If a cloud provider is temporarily unavailable, Quincy automatically retries with backoff and can fall back to an alternative provider. This means a brief Anthropic outage won't stop your workflow if you also have a local model or Gemini configured. The fallback is automatic — you don't need to configure anything beyond having multiple providers set up.

Clients are intentionally thin. They handle input/output and rendering, but they don't run agents, load models, or store conversation history. If you switch from the CLI to the macOS app, you're talking to the same server with the same state.

Cross-Device Approval¶

When an agent needs your approval to run a tool (like sending an email or modifying a file), the approval request is routed to whichever client is connected. If you started a task from your Mac and step away, you can approve or deny the request from your iPhone — the request follows you across devices. You'll see the tool name, what it wants to do, and a summary of the arguments so you can make an informed decision.