Choosing Models¶

One of Quincy's key design decisions is that each agent can use a different model. The orchestrator might run on Claude while a file-reading sub-agent uses a fast 3B local model. Picking the right model for each job is how you get the best balance of speed, accuracy, and cost.

Model Preferences¶

Each agent has an ordered list of model preferences, configured through Quincy. Each entry specifies a provider and model identifier. The first viable entry wins at startup, so you can set up fallback chains — try a local model first, fall back to the cloud if it's unavailable.

Quincy supports three provider types:

llama.cpp and OpenAI-compatible servers — Run models locally via llama.cpp, or point Quincy at any OpenAI-compatible inference server (Ollama, vLLM, LM Studio, Together, and others)
Anthropic Claude — Cloud-hosted models for high-accuracy tasks. See Setting Up Anthropic
Google Gemini — Google's cloud models via the Gemini API. See Setting Up Gemini

Recommended Setup¶

The most effective setup for Quincy is a hybrid configuration: local models for routine work, cloud models for accuracy-critical tasks.

Agent	Model	Why
Orchestrator	Anthropic Claude (cloud)	Needs to understand complex requests, plan multi-step tasks, decide which sub-agent to use
General sub-agents	Llama 3.1 8B Q4_K_M (local)	Fast, capable enough for focused tasks with good system prompts
Specialized sub-agents	Llama 3.1 8B Q4_K_M (local)	Same model, different system prompt and tools
Reasoning-heavy sub-agents	Anthropic Claude (cloud)	When a sub-agent's task genuinely requires strong reasoning

Why Hybrid?¶

Local models are:

Private — Your data never leaves your machine
Free — No per-token API costs
Always available — No rate limits, no outages, works offline

Cloud models are:

Faster — Cloud infrastructure is purpose-built for inference and significantly outperforms local hardware
More accurate — Larger models with better training, better at complex reasoning
Better at planning — The orchestrator benefits from a model that can break down ambiguous requests
Larger context windows — Handle long conversations and big documents

Neither is strictly better. The right answer is both.

Getting Started¶

Start simple:

Install llama.cpp and download an 8B Q4_K_M model (see Setting Up llama.cpp)
Run with just the local model for a while — see where it works and where it falls short
Add a cloud provider — Anthropic or Gemini — and assign it to the orchestrator
Tune from there — move specific agents to cloud if they need better accuracy, or to smaller local models if they're doing simple work

The model preference fallback system makes this easy to iterate on. Change an agent's config and see how it performs — Quincy picks up config changes automatically.

Model Sizes¶

Parameters are the numbers inside a language model that determine how it responds — more parameters generally means more capable but slower and hungrier for memory. Models are grouped by parameter count into rough size tiers:

Small Models (1-3B parameters)¶

Examples: Llama 3.2 1B/3B, Phi-3 Mini, Qwen2.5 3B

Pros:

Extremely fast inference, even on older hardware
Minimal memory usage (~1-2 GB)
Good for highly focused, single-purpose tasks

Cons:

Limited reasoning and instruction-following
Struggle with ambiguous or multi-step requests
More likely to produce incorrect or nonsensical output

Best for: Simple extraction tasks, formatting, narrow classification — agents with a tight system prompt and a single, well-defined job.

Medium Models (7-8B parameters)¶

Examples: Llama 3.1 8B, Mistral 7B, Qwen2.5 7B

Pros:

Good balance of speed and capability
Fits comfortably in memory on any Apple Silicon Mac (~4-6 GB with Q4 quantization)
Competent at instruction-following, tool calling, and structured output

Cons:

Can still struggle with complex multi-step reasoning
Not as reliable for nuanced tasks as larger models

Best for: Most sub-agent work. This is the recommended starting point for Quincy — an 8B model with Q4_K_M quantization is the sweet spot for the majority of agent tasks.

Large Models (13B+ parameters)¶

Examples: Llama 3.1 70B, Mixtral 8x7B, Qwen2.5 32B/72B

Pros:

Significantly better reasoning, instruction-following, and accuracy
Handle ambiguity and complex tasks more reliably

Cons:

Much slower inference
High memory requirements (a 70B Q4 model needs ~40 GB)
May not fit on all machines without significant quantization
Diminishing returns for simple tasks — a 70B model reading a file is overkill

Best for: The orchestrator (if running locally), complex planning tasks, or situations where accuracy is critical and you don't want to use a cloud model.

Advanced: Quantization¶

Quantization reduces a model's file size and memory usage by storing its parameters with lower numerical precision. The trade-off is a small loss in output quality — but for most practical purposes, quantized models perform nearly as well as their full-precision counterparts.

GGUF models come in various quantization levels:

Quantization	Size Reduction	Quality Impact
Q8_0	~50% of original	Negligible quality loss
Q6_K	~40% of original	Very minor quality loss
Q5_K_M	~35% of original	Minor quality loss
Q4_K_M	~30% of original	Good quality, recommended default
Q3_K_M	~25% of original	Noticeable quality loss
Q2_K	~20% of original	Significant quality loss

Q4_K_M is the recommended default. It's the best balance of file size, inference speed, and output quality. Go to Q5_K_M or Q6_K if you have the memory and want slightly better quality. Avoid Q2/Q3 unless you're very memory-constrained.