A Practical Architecture for Running Multiple Specialized Agents Locally (No Keys, No Credits, No External Meterin
DOI:
John Swygert
January 22, 2026
Abstract
This paper proposes a practical, buildable architecture for running a local Large Language Model (LLM) on a personal computer while simultaneously running multiple specialized AI agents as local modules that share the same model endpoint and the same workspace. The design explicitly avoids per-agent billing, external API keys, cloud dependency, and “points” systems by keeping both the inference engine and the agent layer entirely on-device. The system is framed as a “castle”: a local sovereign workspace with defined gates, walls, rooms, and stewardship rules. The paper provides a grounded hardware assessment based on two Windows 10 desktop towers equipped with GTX 1660-class GPUs (6 GB VRAM), explains realistic model sizing constraints, specifies a recommended software stack, and outlines a phased implementation plan. The result is a stable upstream/downstream pipeline where ideation and agentic work occurs upstream in a controlled local sandbox, while publication remains downstream in canonical systems such as WordPress.
1. Purpose And Motivation
Many modern AI workflows are built around external services: a user sends prompts to a remote model, remote “agents” perform tasks, and the user pays per token, per agent, or per request. This model creates several structural problems for a creator who wants autonomy and continuity:
- Per-agent metering becomes a tax on creativity. If each agent is external, each agent becomes a billable consumer.
- Credentials become a chronic maintenance burden. Keys, passwords, scopes, and token refreshes become their own job.
- Provenance becomes blurry. External services can change, go down, silently update, or alter behavior.
- The attack surface expands. The moment tools operate across network boundaries, complexity rises and security deteriorates.
The desired alternative is simple and powerful:
- Run the model locally.
- Run all agents locally as modules/processes.
- Make the workspace local and shared.
- Keep publication downstream (WordPress or other canonical outlets) without embedding AI inside the publishing layer.
This design allows the user to treat AI as a sovereign instrument—like a local machine tool—rather than a metered utility.
2. The “Castle” Architecture (A Clear Mental Model)
The “castle” metaphor is not poetic decoration; it is a useful systems model:
- The Engine Room: where the LLM runs (local inference server).
- The Vault: the shared workspace directory where all content lives.
- Rooms: structured subfolders that define what type of work belongs where.
- Stewards: agents with narrowly defined roles and privileges.
- Gates: the interface boundaries—what can read, what can write, what can publish.
- Walls: enforcement mechanisms—file permissions, process constraints, audit logs.
A castle is not a prison. A castle is a stable boundary that allows power without chaos. In agent systems, chaos is the default unless roles and boundaries are explicit.
3. Upstream And Downstream (The Core Systems Advantage)
This architecture benefits from a key distinction:
- Upstream: exploration, drafting, brainstorming, synthesis, transformation—high entropy and high experimentation.
- Downstream: publication, archiving, canonical record—low entropy and high stability.
Many people incorrectly place AI downstream inside publishing systems. That introduces risks: accidental edits, provenance confusion, and drift. The more correct pattern is:
- Keep AI upstream where error is permitted and reversible.
- Keep publication downstream where stability and authorship matter.
Under this model, WordPress is not a “brain.” WordPress is a ledger, a vault door, and a public record.
4. Hardware Reality Check: Two Consumer Towers
This build targets two existing Windows towers (summarized below without reproducing sensitive licensing identifiers).
Tower A
- Windows 10 Home
- Intel i7-6700K
- 32 GB RAM
- GTX 1660 Ti, 6 GB VRAM
- HDD storage, with an available M.2 NVMe slot
Tower B
- Windows tower (CyberPowerPC)
- Intel i5-9400F
- 40 GB RAM
- GTX 1660, 6 GB VRAM
- HDD storage
4.1 The Constraint That Matters Most: VRAM
For local LLM inference, GPU VRAM is the primary constraint. At 6 GB VRAM, the system can run modern 7B-class models (quantized) comfortably and can run limited larger models with significant tradeoffs.
This yields a realistic and important conclusion:
- 7B/8B quantized models are the workhorse class on these GPUs.
- 13B models are possible but often slow and tight (reduced context, lower concurrency).
- 30B+ models are not the target class for this hardware if the goal is smooth multi-agent work.
This is not a weakness. Multi-agent specialization frequently outperforms single “huge” models in real workflows because it replaces brute force with structure.
5. The Single Most Important Physical Upgrade
Both towers are currently HDD-based for the OS and/or primary workspace. For local agent systems, disk I/O becomes visible: models load, logs write, files transform, embeddings store, revisions diff.
Installing an NVMe SSD in the available M.2 slot (Tower A) is the highest return upgrade per dollar because it:
- drastically reduces model load time
- speeds up file reads/writes across the workspace
- improves responsiveness of agent pipelines
- makes “local AI” feel like a usable instrument, not a sluggish experiment
If only one upgrade is done, this is the one.
6. Architectural Requirement: “No Keys, No Credits, No External Metering”
The requirement is explicit:
All agents must be local modules on the local LLM.
No agent should require external API keys, paid credits, or token systems to talk to the model.
This implies the following design constraints:
- The LLM must be hosted locally and exposed as a local endpoint (typically localhost).
- Agents should call the LLM via local IPC or local HTTP calls.
- Authentication between local agents and the model server can be minimal because the threat boundary is the machine itself—though local controls can still be applied for safety and discipline.
- Any external integration (WordPress, indexing services, publishing APIs) should be optional, gated, and downstream.
7. Recommended Software Stack (Practical, Not Theoretical)
The stack needs to satisfy four properties:
- easy to run on Windows
- stable local inference
- multi-agent orchestration
- clean workspace integration
7.1 Local Inference: One Model Endpoint
Use a local model server that runs persistently and provides a single endpoint that all agents share. Conceptually:
- One server hosts the model.
- All agents connect to the same endpoint.
- Concurrency is managed locally.
This ensures agents are not “customers.” They are modules calling an internal engine.
7.2 Multi-Agent Orchestration: Roles As First-Class Objects
A multi-agent orchestrator is what turns a local model into a productive system. The orchestrator should support:
- multiple agents with distinct prompts and responsibilities
- stateful flows (work can continue across steps)
- tool usage (file IO, diffing, indexing, etc.)
- routing (which agent handles which task)
The core idea is: the model is general; agents are specialized.
7.3 Optional UI: Local Cockpit
A local UI is helpful but not necessary. The system can run headless. If used, the UI should remain local and should not force cloud dependencies.
8. Workspace Design: The Vault With Rooms
A shared workspace is the heart of this system. It makes the system real, tangible, and auditable.
A recommended structure:
AO_WORKSPACE/
├── 00_INBOX/ # raw inputs, voice dumps, pasted notes
├── 10_RESEARCH/ # citations, summaries, extracted facts
├── 20_DRAFTS/ # agent-written drafts and assemblies
├── 30_REVISIONS/ # edits, diffs, referee notes
├── 40_PUBLISHED/ # final “canonical” exports ready for posting
├── 50_MEMORY/ # embeddings, indices, persistent notes
└── 90_LOGS/ # agent actions, timestamps, audit trails
Why this matters
- It prevents content sprawl.
- It allows each agent to operate cleanly.
- It supports provenance and continuity.
- It matches upstream/downstream logic: inbox → research → drafts → revisions → published.
9. Agent Roles (A Minimal Powerful Set)
Start with three agents. This avoids complexity while immediately delivering value.
9.1 Researcher (Read-Heavy, Write Notes)
Goal: Convert messy input into usable knowledge.
Allowed write locations: 10_RESEARCH/ (and optionally 50_MEMORY/).
Outputs: summaries, bullet fact packs, source lists, open questions.
9.2 Writer (Draft Producer)
Goal: Convert research + inbox into coherent draft structures.
Allowed write locations: 20_DRAFTS/.
Outputs: full sections, outlines, coherent narrative drafts.
9.3 Editor / Referee (Diff-Only Authority)
Goal: Improve and correct drafts without becoming a second writer.
Allowed write locations: 30_REVISIONS/.
Outputs: diffs, revision notes, clarity improvements, consistency checks.
This “Referee” role is crucial because it prevents the system from collapsing into one agent rewriting everything endlessly.
10. Local Permissions: Simple Controls That Work
Because all agents are local, complicated OAuth is not required internally. The most effective controls are:
- folder permissions (read/write rules)
- process discipline (what each agent is allowed to do)
- audit logs (what happened and when)
- diff-based editing for any agent that should not directly overwrite canonical text
This is the practical version of “scopes.”
11. Concurrency Strategy On 6 GB VRAM
With 7B-class models quantized, concurrency becomes feasible but should be managed. Recommendations:
- Run one model server.
- Allow multiple agents to queue requests rather than load multiple models simultaneously.
- Use short, explicit agent prompts with consistent formats.
- Keep memory and context windows realistic; don’t try to brute force huge context sizes that degrade speed.
It is better to have agents that cooperate through files than agents that attempt to carry everything in context.
12. Why This Beats “AI In WordPress” For This Workflow
If the author is drafting in ChatGPT or X (Grok) and then publishing to WordPress, embedding AI in WordPress adds little value and increases risk.
Under the castle model:
- WordPress remains downstream—canonical and stable.
- The local system remains upstream—creative and experimental.
- Publication becomes a deliberate human act: export from 40_PUBLISHED/ to WordPress.
This protects authorship, reduces drift, and keeps the publishing layer clean.
13. The Future Role Of OAuth (Later, Not Now)
OAuth becomes relevant when the castle needs external gates, for example:
- remote access from another device on the home network
- multiple human stewards or collaborators
- publish-only tokens to downstream services
- read-only tokens for indexing and mirroring systems
OAuth is the outer moat and gatehouse. It is not required for local internal movement between rooms.
14. Deployment Plan (Phased, Low Waste)
Phase 1 — Foundation (Fast, Minimal)
- Add NVMe SSD (Tower A recommended) and place workspace there
- Install local model server
- Verify you can chat locally with the model
- Create the workspace folder structure
Deliverable: You have a working local LLM + vault.
Phase 2 — Three Agents (The First Real System)
- Implement Researcher / Writer / Referee roles
- Route tasks by explicit commands or simple orchestration
- Start producing drafts and revision diffs into the workspace
Deliverable: Multi-agent output with provenance and logs.
Phase 3 — Memory And Indexing
- Add local embeddings store (optional)
- Build “search my vault” capability
- Create persistent summaries and context packs per project
Deliverable: Your own local knowledge base that grows.
Phase 4 — Stewardship And Publishing Automation (Still Local)
- Build a “Publisher” module that packages content from 40_PUBLISHED/
- Optionally generate metadata templates, DOI stubs, or HTML exports
- Keep WordPress posting manual, or add a controlled publish script later
Deliverable: Repeatable publishing without cloud AI inside WordPress.
Phase 5 — Expansion To Two-Tower Workflow
- Use Tower B for background jobs: indexing, batch conversions, video scripts, long-running tasks
- Keep Tower A as the interactive agent cockpit
- Add a lightweight sync strategy (local network share or scheduled mirroring)
Deliverable: A personal “micro-cluster” without overengineering.
15. Risk Management And Safety
15.1 Avoiding Drift
- Canonical output is stored only in 40_PUBLISHED/
- Referee edits are diffs, not overwrites
- Logs record agent actions and timestamps
15.2 Avoiding Credential Leakage
- Keep the system local
- Don’t paste software keys or OS activation codes into agent prompts
- Treat the vault as sensitive: if it is ever synced, encrypt it
15.3 Avoiding “Agent Chaos”
- Start with 3 agents
- Add new agents only when the role is clearly distinct
- Make each agent’s outputs predictable (templates)
16. Why Patience Still Makes Sense
It is reasonable to believe packaging will improve rapidly over 1–3 years, because the market is converging on:
- local inference becoming normal
- multi-agent orchestration becoming standardized
- permissions and identity becoming cleaner and boring (which is good)
However, the foundation work described here is not wasted. A properly designed vault + roles + upstream/downstream flow will integrate into future packages with minimal change. The user is not building a fragile prototype; the user is building a durable architecture.
17. Conclusion
A local multi-agent LLM system is buildable today on consumer hardware with 6 GB VRAM GPUs, provided the design respects real constraints and prioritizes architecture over brute force. The most realistic success path is to standardize around 7B-class quantized models, run one local inference endpoint, and build a disciplined workspace where agents operate as local modules with defined roles and bounded write access. This approach eliminates per-agent billing, avoids cloud dependencies, preserves authorship, and creates a sovereign upstream creative engine feeding a downstream canonical publication layer (WordPress). With a single NVMe SSD upgrade and a phased deployment strategy, the system can evolve from a simple local chat model into a true multi-agent “castle” capable of producing sustained, organized output across multiple projects.
Appendix A — Minimal Agent Protocol (Suggested)
To keep outputs consistent across agents, define a protocol each agent must follow.
Researcher Output Template
- Summary
- Key facts (bulleted)
- Open questions
- Suggested next actions
- File written to: 10_RESEARCH/YYYYMMDD_topic.md
Writer Output Template
- Title
- Outline
- Draft section(s)
- “Needs research” flags
- File written to: 20_DRAFTS/YYYYMMDD_project_section.md
Referee Output Template
- Issues found (numbered)
- Proposed edits (diff or patch style)
- Consistency notes (terminology, tone, constraints)
- File written to: 30_REVISIONS/YYYYMMDD_draftname_review.md
Appendix B — The One Rule That Keeps It Clean
AI upstream, publication downstream.
That one rule prevents 80% of confusion, drift, and wasted effort.
