Skip to content

Choosing Models

One of Quincy's key design decisions is that each agent can use a different model. The orchestrator might run on Claude while a file-reading sub-agent uses a fast 3B local model. Picking the right model for each job is how you get the best balance of speed, accuracy, and cost.

Model Preferences

Each agent has an ordered list of model preferences, configured through Quincy. Each entry specifies a provider and model identifier. The first viable entry wins at startup, so you can set up fallback chains — try a local model first, fall back to the cloud if it's unavailable.

Quincy supports three provider types:

  • llama.cpp and OpenAI-compatible servers — Run models locally via llama.cpp, or point Quincy at any OpenAI-compatible inference server (Ollama, vLLM, LM Studio, Together, and others)
  • Anthropic Claude — Cloud-hosted models for high-accuracy tasks. See Setting Up Anthropic
  • Google Gemini — Google's cloud models via the Gemini API. See Setting Up Gemini

The most effective setup for Quincy is a hybrid configuration: local models for routine work, cloud models for accuracy-critical tasks.

Agent Model Why
Orchestrator Anthropic Claude (cloud) Needs to understand complex requests, plan multi-step tasks, decide which sub-agent to use
General sub-agents Llama 3.1 8B Q4_K_M (local) Fast, capable enough for focused tasks with good system prompts
Specialized sub-agents Llama 3.1 8B Q4_K_M (local) Same model, different system prompt and tools
Reasoning-heavy sub-agents Anthropic Claude (cloud) When a sub-agent's task genuinely requires strong reasoning

Why Hybrid?

Local models are:

  • Private — Your data never leaves your machine
  • Free — No per-token API costs
  • Always available — No rate limits, no outages, works offline

Cloud models are:

  • Faster — Cloud infrastructure is purpose-built for inference and significantly outperforms local hardware
  • More accurate — Larger models with better training, better at complex reasoning
  • Better at planning — The orchestrator benefits from a model that can break down ambiguous requests
  • Larger context windows — Handle long conversations and big documents

Neither is strictly better. The right answer is both.

Getting Started

Start simple:

  1. Install llama.cpp and download an 8B Q4_K_M model (see Setting Up llama.cpp)
  2. Run with just the local model for a while — see where it works and where it falls short
  3. Add a cloud provider — Anthropic or Gemini — and assign it to the orchestrator
  4. Tune from there — move specific agents to cloud if they need better accuracy, or to smaller local models if they're doing simple work

The model preference fallback system makes this easy to iterate on. Change an agent's config and see how it performs — Quincy picks up config changes automatically.

Model Sizes

Parameters are the numbers inside a language model that determine how it responds — more parameters generally means more capable but slower and hungrier for memory. Models are grouped by parameter count into rough size tiers:

Small Models (1-3B parameters)

Examples: Llama 3.2 1B/3B, Phi-3 Mini, Qwen2.5 3B

Pros:

  • Extremely fast inference, even on older hardware
  • Minimal memory usage (~1-2 GB)
  • Good for highly focused, single-purpose tasks

Cons:

  • Limited reasoning and instruction-following
  • Struggle with ambiguous or multi-step requests
  • More likely to produce incorrect or nonsensical output

Best for: Simple extraction tasks, formatting, narrow classification — agents with a tight system prompt and a single, well-defined job.

Medium Models (7-8B parameters)

Examples: Llama 3.1 8B, Mistral 7B, Qwen2.5 7B

Pros:

  • Good balance of speed and capability
  • Fits comfortably in memory on any Apple Silicon Mac (~4-6 GB with Q4 quantization)
  • Competent at instruction-following, tool calling, and structured output

Cons:

  • Can still struggle with complex multi-step reasoning
  • Not as reliable for nuanced tasks as larger models

Best for: Most sub-agent work. This is the recommended starting point for Quincy — an 8B model with Q4_K_M quantization is the sweet spot for the majority of agent tasks.

Large Models (13B+ parameters)

Examples: Llama 3.1 70B, Mixtral 8x7B, Qwen2.5 32B/72B

Pros:

  • Significantly better reasoning, instruction-following, and accuracy
  • Handle ambiguity and complex tasks more reliably

Cons:

  • Much slower inference
  • High memory requirements (a 70B Q4 model needs ~40 GB)
  • May not fit on all machines without significant quantization
  • Diminishing returns for simple tasks — a 70B model reading a file is overkill

Best for: The orchestrator (if running locally), complex planning tasks, or situations where accuracy is critical and you don't want to use a cloud model.


Advanced: Quantization

Quantization reduces a model's file size and memory usage by storing its parameters with lower numerical precision. The trade-off is a small loss in output quality — but for most practical purposes, quantized models perform nearly as well as their full-precision counterparts.

GGUF models come in various quantization levels:

Quantization Size Reduction Quality Impact
Q8_0 ~50% of original Negligible quality loss
Q6_K ~40% of original Very minor quality loss
Q5_K_M ~35% of original Minor quality loss
Q4_K_M ~30% of original Good quality, recommended default
Q3_K_M ~25% of original Noticeable quality loss
Q2_K ~20% of original Significant quality loss

Q4_K_M is the recommended default. It's the best balance of file size, inference speed, and output quality. Go to Q5_K_M or Q6_K if you have the memory and want slightly better quality. Avoid Q2/Q3 unless you're very memory-constrained.