Choosing Models¶
One of Quincy's key design decisions is that each agent can use a different model. The orchestrator might run on Claude while a file-reading sub-agent uses a fast 3B local model. Picking the right model for each job is how you get the best balance of speed, accuracy, and cost.
Model Preferences¶
Each agent has an ordered list of model preferences, configured through Quincy. Each entry specifies a provider and model identifier. The first viable entry wins at startup, so you can set up fallback chains — try a local model first, fall back to the cloud if it's unavailable.
Quincy supports three provider types:
- llama.cpp and OpenAI-compatible servers — Run models locally via llama.cpp, or point Quincy at any OpenAI-compatible inference server (Ollama, vLLM, LM Studio, Together, and others)
- Anthropic Claude — Cloud-hosted models for high-accuracy tasks. See Setting Up Anthropic
- Google Gemini — Google's cloud models via the Gemini API. See Setting Up Gemini
Recommended Setup¶
The most effective setup for Quincy is a hybrid configuration: local models for routine work, cloud models for accuracy-critical tasks.
| Agent | Model | Why |
|---|---|---|
| Orchestrator | Anthropic Claude (cloud) | Needs to understand complex requests, plan multi-step tasks, decide which sub-agent to use |
| General sub-agents | Llama 3.1 8B Q4_K_M (local) | Fast, capable enough for focused tasks with good system prompts |
| Specialized sub-agents | Llama 3.1 8B Q4_K_M (local) | Same model, different system prompt and tools |
| Reasoning-heavy sub-agents | Anthropic Claude (cloud) | When a sub-agent's task genuinely requires strong reasoning |
Why Hybrid?¶
Local models are:
- Private — Your data never leaves your machine
- Free — No per-token API costs
- Always available — No rate limits, no outages, works offline
Cloud models are:
- Faster — Cloud infrastructure is purpose-built for inference and significantly outperforms local hardware
- More accurate — Larger models with better training, better at complex reasoning
- Better at planning — The orchestrator benefits from a model that can break down ambiguous requests
- Larger context windows — Handle long conversations and big documents
Neither is strictly better. The right answer is both.
Getting Started¶
Start simple:
- Install llama.cpp and download an 8B Q4_K_M model (see Setting Up llama.cpp)
- Run with just the local model for a while — see where it works and where it falls short
- Add a cloud provider — Anthropic or Gemini — and assign it to the orchestrator
- Tune from there — move specific agents to cloud if they need better accuracy, or to smaller local models if they're doing simple work
The model preference fallback system makes this easy to iterate on. Change an agent's config and see how it performs — Quincy picks up config changes automatically.
Model Sizes¶
Parameters are the numbers inside a language model that determine how it responds — more parameters generally means more capable but slower and hungrier for memory. Models are grouped by parameter count into rough size tiers:
Small Models (1-3B parameters)¶
Examples: Llama 3.2 1B/3B, Phi-3 Mini, Qwen2.5 3B
Pros:
- Extremely fast inference, even on older hardware
- Minimal memory usage (~1-2 GB)
- Good for highly focused, single-purpose tasks
Cons:
- Limited reasoning and instruction-following
- Struggle with ambiguous or multi-step requests
- More likely to produce incorrect or nonsensical output
Best for: Simple extraction tasks, formatting, narrow classification — agents with a tight system prompt and a single, well-defined job.
Medium Models (7-8B parameters)¶
Examples: Llama 3.1 8B, Mistral 7B, Qwen2.5 7B
Pros:
- Good balance of speed and capability
- Fits comfortably in memory on any Apple Silicon Mac (~4-6 GB with Q4 quantization)
- Competent at instruction-following, tool calling, and structured output
Cons:
- Can still struggle with complex multi-step reasoning
- Not as reliable for nuanced tasks as larger models
Best for: Most sub-agent work. This is the recommended starting point for Quincy — an 8B model with Q4_K_M quantization is the sweet spot for the majority of agent tasks.
Large Models (13B+ parameters)¶
Examples: Llama 3.1 70B, Mixtral 8x7B, Qwen2.5 32B/72B
Pros:
- Significantly better reasoning, instruction-following, and accuracy
- Handle ambiguity and complex tasks more reliably
Cons:
- Much slower inference
- High memory requirements (a 70B Q4 model needs ~40 GB)
- May not fit on all machines without significant quantization
- Diminishing returns for simple tasks — a 70B model reading a file is overkill
Best for: The orchestrator (if running locally), complex planning tasks, or situations where accuracy is critical and you don't want to use a cloud model.
Advanced: Quantization¶
Quantization reduces a model's file size and memory usage by storing its parameters with lower numerical precision. The trade-off is a small loss in output quality — but for most practical purposes, quantized models perform nearly as well as their full-precision counterparts.
GGUF models come in various quantization levels:
| Quantization | Size Reduction | Quality Impact |
|---|---|---|
| Q8_0 | ~50% of original | Negligible quality loss |
| Q6_K | ~40% of original | Very minor quality loss |
| Q5_K_M | ~35% of original | Minor quality loss |
| Q4_K_M | ~30% of original | Good quality, recommended default |
| Q3_K_M | ~25% of original | Noticeable quality loss |
| Q2_K | ~20% of original | Significant quality loss |
Q4_K_M is the recommended default. It's the best balance of file size, inference speed, and output quality. Go to Q5_K_M or Q6_K if you have the memory and want slightly better quality. Avoid Q2/Q3 unless you're very memory-constrained.