Saving Money with Local LLM Configurations¶
Quincy runs several background tasks that consume LLM tokens: generating session titles, curating memories, enriching tool metadata, and assembling context for scheduled jobs. By default, these use whatever model the orchestrator uses — but they don't need to.
Why Background Tasks Are Good Candidates for Local Models¶
These tasks are formulaic, low-stakes, and high-volume. They usually don't need the reasoning power of a cloud model:
- Session title generation — Produces a short summary from conversation context. A small local model handles this easily.
- Memory curation — Extracts facts and observations from conversation logs. The output is structured and predictable.
- Tool enrichment — Generates human-readable display names and parameter labels for MCP tools. Follows a consistent pattern.
- Job context assembly — Gathers and formats context before a scheduled job runs. Mostly retrieval and light summarization.
Running these on a local model eliminates per-token cloud costs for work that happens frequently in the background.
What Should Stay on Cloud Models¶
Some tasks benefit from the stronger reasoning and larger context windows that cloud models provide:
- Orchestrator decisions — Choosing which agent to delegate to, interpreting ambiguous requests
- Complex multi-step tool use — Chaining tool calls with conditional logic
- User-facing conversation — Where response quality directly affects your experience
- Policy evaluation for sensitive tools — Where mistakes could have real consequences
The hybrid approach gives you the best of both: cloud accuracy where it matters, local speed and savings where it doesn't.
How to Configure It¶
You can adjust background task models conversationally. Ask Quincy:
"Use my local model for memory curation and session titles"
Or be more specific:
"Set the memory curation agent to use llama-server with the 8B model"
Quincy updates the per-agent model preferences in your signed configuration. Each background agent can target a different model independently.
The hybrid setup in practice:
- Your orchestrator and user-facing agents use a cloud model (Anthropic, Gemini)
- Background agents (session housekeeping, memory curation) use your local llama-server
- Scheduled jobs can use either, depending on the task complexity
To verify which model is handling what, ask Quincy: "show me the model assignments for all agents."
Recommended Local Models for Background Tasks¶
A 7–8B parameter model at Q4_K_M quantization handles all background tasks well. This is the sweet spot — small enough to run fast on most Macs with Metal GPU acceleration, large enough to produce reliable structured output.
For more details on model sizes, quantization levels, and hardware requirements, see Choosing Models.