Setting Up llama.cpp¶

Quincy uses the llama.cpp server to run language models locally on your Mac. This gives you fast, private inference with no API costs — and Quincy manages the server process for you.

macOS app users

The Quincy macOS app includes a bundled copy of llama-server — you don't need to install it separately. This page is for Linux server setups, or if you want to use a newer or customised build.

Alternatives to llama.cpp¶

You don't have to use llama.cpp. Quincy works with any OpenAI-compatible inference server. If you'd prefer Ollama, vLLM, LM Studio, or any other server that speaks the OpenAI chat completions format, point Quincy at it during onboarding and it will work the same way.

llama.cpp is the default because it runs well on Apple Silicon and requires no extra setup on macOS. But the choice is yours.

Install llama.cpp¶

The easiest way to install is via Homebrew:

brew install llama.cpp

This installs the llama-server binary (along with other llama.cpp tools) to /opt/homebrew/bin/llama-server.

How Quincy Manages the llama.cpp Server¶

You don't need to start or stop llama-server manually. Quincy handles the full lifecycle:

Auto-start: When you run a command that needs a local model, Quincy spawns llama-server automatically if it isn't already running
Idle shutdown: After a configurable period of inactivity (default: 30 minutes), the model is unloaded to free memory
GPU acceleration: By default, Quincy offloads all model layers to the GPU via Metal, which is dramatically faster than CPU inference on Apple Silicon
Context size: The context window defaults to 32,768 tokens — enough for most agent interactions

Download a Model¶

You'll need at least one GGUF model file. During onboarding, Quincy will ask you to select a model file or pick from models it discovers on your system.

If you're comfortable with the terminal and want to download a model manually, a good starting point is an 8B parameter model with Q4_K_M quantization:

# Install the Hugging Face command-line tool
pip3 install huggingface-hub

# Download Meta-Llama-3.1-8B-Instruct (Q4_K_M quantization)
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
    Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --local-dir ~/Models

Router Mode¶

Quincy runs the llama.cpp server in router mode, which means a single server process can serve multiple models. This is what makes the sub-agent architecture work — the orchestrator might use a large, capable model while a specialist sub-agent uses a smaller, faster one, and they all share the same server instance.

Each agent's config specifies which model it prefers. When an agent makes a request, it includes the model name and the llama.cpp server loads it on demand.

Verify the Setup¶

After onboarding, you can verify everything is working:

quincy chat
> What model are you using?

The agent will tell you which model is active. You can also use built-in tools during a chat session:

list_models — Shows all available models
current_model — Shows the active model for this agent
use_model — Temporarily switch to a different model

Next Steps¶

Choose the right models for your agents — size, quantization, and the hybrid approach
Understand the agent system to see how models are assigned to sub-agents

Advanced: Building from Source¶

If you need a custom build (e.g., for specific Metal optimizations or to use a newer version than Homebrew provides):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

The binary will be at build/bin/llama-server. Point Quincy to it during onboarding or ask Quincy to update the llama-server path.