Local Multi-Agent LLM Castle On Consumer Hardware:

A Practical Architecture for Running Multiple Specialized Agents Locally (No Keys, No Credits, No External Meterin

DOI:

John Swygert

January 22, 2026

Abstract

This paper proposes a practical, buildable architecture for running a local Large Language Model (LLM) on a personal computer while simultaneously running multiple specialized AI agents as local modules that share the same model endpoint and the same workspace. The design explicitly avoids per-agent billing, external API keys, cloud dependency, and “points” systems by keeping both the inference engine and the agent layer entirely on-device. The system is framed as a “castle”: a local sovereign workspace with defined gates, walls, rooms, and stewardship rules. The paper provides a grounded hardware assessment based on two Windows 10 desktop towers equipped with GTX 1660-class GPUs (6 GB VRAM), explains realistic model sizing constraints, specifies a recommended software stack, and outlines a phased implementation plan. The result is a stable upstream/downstream pipeline where ideation and agentic work occurs upstream in a controlled local sandbox, while publication remains downstream in canonical systems such as WordPress.

1. Purpose And Motivation

Many modern AI workflows are built around external services: a user sends prompts to a remote model, remote “agents” perform tasks, and the user pays per token, per agent, or per request. This model creates several structural problems for a creator who wants autonomy and continuity:

Per-agent metering becomes a tax on creativity. If each agent is external, each agent becomes a billable consumer.
Credentials become a chronic maintenance burden. Keys, passwords, scopes, and token refreshes become their own job.
Provenance becomes blurry. External services can change, go down, silently update, or alter behavior.
The attack surface expands. The moment tools operate across network boundaries, complexity rises and security deteriorates.

The desired alternative is simple and powerful:

Run the model locally.
Run all agents locally as modules/processes.
Make the workspace local and shared.
Keep publication downstream (WordPress or other canonical outlets) without embedding AI inside the publishing layer.

This design allows the user to treat AI as a sovereign instrument—like a local machine tool—rather than a metered utility.

2. The “Castle” Architecture (A Clear Mental Model)

The “castle” metaphor is not poetic decoration; it is a useful systems model:

The Engine Room: where the LLM runs (local inference server).
The Vault: the shared workspace directory where all content lives.
Rooms: structured subfolders that define what type of work belongs where.
Stewards: agents with narrowly defined roles and privileges.
Gates: the interface boundaries—what can read, what can write, what can publish.
Walls: enforcement mechanisms—file permissions, process constraints, audit logs.

A castle is not a prison. A castle is a stable boundary that allows power without chaos. In agent systems, chaos is the default unless roles and boundaries are explicit.

3. Upstream And Downstream (The Core Systems Advantage)

This architecture benefits from a key distinction:

Upstream: exploration, drafting, brainstorming, synthesis, transformation—high entropy and high experimentation.
Downstream: publication, archiving, canonical record—low entropy and high stability.

Many people incorrectly place AI downstream inside publishing systems. That introduces risks: accidental edits, provenance confusion, and drift. The more correct pattern is:

Keep AI upstream where error is permitted and reversible.
Keep publication downstream where stability and authorship matter.

Under this model, WordPress is not a “brain.” WordPress is a ledger, a vault door, and a public record.

4. Hardware Reality Check: Two Consumer Towers

This build targets two existing Windows towers (summarized below without reproducing sensitive licensing identifiers).

Tower A

Windows 10 Home
Intel i7-6700K
32 GB RAM
GTX 1660 Ti, 6 GB VRAM
HDD storage, with an available M.2 NVMe slot

Tower B

Windows tower (CyberPowerPC)
Intel i5-9400F
40 GB RAM
GTX 1660, 6 GB VRAM
HDD storage

4.1 The Constraint That Matters Most: VRAM

For local LLM inference, GPU VRAM is the primary constraint. At 6 GB VRAM, the system can run modern 7B-class models (quantized) comfortably and can run limited larger models with significant tradeoffs.

This yields a realistic and important conclusion:

7B/8B quantized models are the workhorse class on these GPUs.
13B models are possible but often slow and tight (reduced context, lower concurrency).
30B+ models are not the target class for this hardware if the goal is smooth multi-agent work.

This is not a weakness. Multi-agent specialization frequently outperforms single “huge” models in real workflows because it replaces brute force with structure.

5. The Single Most Important Physical Upgrade

Both towers are currently HDD-based for the OS and/or primary workspace. For local agent systems, disk I/O becomes visible: models load, logs write, files transform, embeddings store, revisions diff.

Installing an NVMe SSD in the available M.2 slot (Tower A) is the highest return upgrade per dollar because it:

drastically reduces model load time
speeds up file reads/writes across the workspace
improves responsiveness of agent pipelines
makes “local AI” feel like a usable instrument, not a sluggish experiment

If only one upgrade is done, this is the one.

6. Architectural Requirement: “No Keys, No Credits, No External Metering”

The requirement is explicit:

All agents must be local modules on the local LLM.
No agent should require external API keys, paid credits, or token systems to talk to the model.

This implies the following design constraints:

The LLM must be hosted locally and exposed as a local endpoint (typically localhost).
Agents should call the LLM via local IPC or local HTTP calls.
Authentication between local agents and the model server can be minimal because the threat boundary is the machine itself—though local controls can still be applied for safety and discipline.
Any external integration (WordPress, indexing services, publishing APIs) should be optional, gated, and downstream.

7. Recommended Software Stack (Practical, Not Theoretical)

The stack needs to satisfy four properties:

easy to run on Windows
stable local inference
multi-agent orchestration
clean workspace integration

7.1 Local Inference: One Model Endpoint

Use a local model server that runs persistently and provides a single endpoint that all agents share. Conceptually:

One server hosts the model.
All agents connect to the same endpoint.
Concurrency is managed locally.

This ensures agents are not “customers.” They are modules calling an internal engine.

7.2 Multi-Agent Orchestration: Roles As First-Class Objects

A multi-agent orchestrator is what turns a local model into a productive system. The orchestrator should support:

multiple agents with distinct prompts and responsibilities
stateful flows (work can continue across steps)
tool usage (file IO, diffing, indexing, etc.)
routing (which agent handles which task)

The core idea is: the model is general; agents are specialized.

7.3 Optional UI: Local Cockpit

A local UI is helpful but not necessary. The system can run headless. If used, the UI should remain local and should not force cloud dependencies.

8. Workspace Design: The Vault With Rooms

A shared workspace is the heart of this system. It makes the system real, tangible, and auditable.

A recommended structure:

AO_WORKSPACE/

├── 00_INBOX/ # raw inputs, voice dumps, pasted notes

├── 10_RESEARCH/ # citations, summaries, extracted facts

├── 20_DRAFTS/ # agent-written drafts and assemblies

├── 30_REVISIONS/ # edits, diffs, referee notes

├── 40_PUBLISHED/ # final “canonical” exports ready for posting

├── 50_MEMORY/ # embeddings, indices, persistent notes

└── 90_LOGS/ # agent actions, timestamps, audit trails

Why this matters

It prevents content sprawl.
It allows each agent to operate cleanly.
It supports provenance and continuity.
It matches upstream/downstream logic: inbox → research → drafts → revisions → published.

9. Agent Roles (A Minimal Powerful Set)

Start with three agents. This avoids complexity while immediately delivering value.

9.1 Researcher (Read-Heavy, Write Notes)

Goal: Convert messy input into usable knowledge.
Allowed write locations: 10_RESEARCH/ (and optionally 50_MEMORY/).
Outputs: summaries, bullet fact packs, source lists, open questions.

9.2 Writer (Draft Producer)

Goal: Convert research + inbox into coherent draft structures.
Allowed write locations: 20_DRAFTS/.
Outputs: full sections, outlines, coherent narrative drafts.

9.3 Editor / Referee (Diff-Only Authority)

Goal: Improve and correct drafts without becoming a second writer.
Allowed write locations: 30_REVISIONS/.
Outputs: diffs, revision notes, clarity improvements, consistency checks.

This “Referee” role is crucial because it prevents the system from collapsing into one agent rewriting everything endlessly.

10. Local Permissions: Simple Controls That Work

Because all agents are local, complicated OAuth is not required internally. The most effective controls are:

folder permissions (read/write rules)
process discipline (what each agent is allowed to do)
audit logs (what happened and when)
diff-based editing for any agent that should not directly overwrite canonical text

This is the practical version of “scopes.”

11. Concurrency Strategy On 6 GB VRAM

With 7B-class models quantized, concurrency becomes feasible but should be managed. Recommendations:

Run one model server.
Allow multiple agents to queue requests rather than load multiple models simultaneously.
Use short, explicit agent prompts with consistent formats.
Keep memory and context windows realistic; don’t try to brute force huge context sizes that degrade speed.

It is better to have agents that cooperate through files than agents that attempt to carry everything in context.

12. Why This Beats “AI In WordPress” For This Workflow

If the author is drafting in ChatGPT or X (Grok) and then publishing to WordPress, embedding AI in WordPress adds little value and increases risk.

Under the castle model:

WordPress remains downstream—canonical and stable.
The local system remains upstream—creative and experimental.
Publication becomes a deliberate human act: export from 40_PUBLISHED/ to WordPress.

This protects authorship, reduces drift, and keeps the publishing layer clean.

13. The Future Role Of OAuth (Later, Not Now)

OAuth becomes relevant when the castle needs external gates, for example:

remote access from another device on the home network
multiple human stewards or collaborators
publish-only tokens to downstream services
read-only tokens for indexing and mirroring systems

OAuth is the outer moat and gatehouse. It is not required for local internal movement between rooms.

14. Deployment Plan (Phased, Low Waste)

Phase 1 — Foundation (Fast, Minimal)

Add NVMe SSD (Tower A recommended) and place workspace there
Install local model server
Verify you can chat locally with the model
Create the workspace folder structure

Deliverable: You have a working local LLM + vault.

Phase 2 — Three Agents (The First Real System)

Implement Researcher / Writer / Referee roles
Route tasks by explicit commands or simple orchestration
Start producing drafts and revision diffs into the workspace

Deliverable: Multi-agent output with provenance and logs.

Phase 3 — Memory And Indexing

Add local embeddings store (optional)
Build “search my vault” capability
Create persistent summaries and context packs per project

Deliverable: Your own local knowledge base that grows.

Phase 4 — Stewardship And Publishing Automation (Still Local)

Build a “Publisher” module that packages content from 40_PUBLISHED/
Optionally generate metadata templates, DOI stubs, or HTML exports
Keep WordPress posting manual, or add a controlled publish script later

Deliverable: Repeatable publishing without cloud AI inside WordPress.

Phase 5 — Expansion To Two-Tower Workflow

Use Tower B for background jobs: indexing, batch conversions, video scripts, long-running tasks
Keep Tower A as the interactive agent cockpit
Add a lightweight sync strategy (local network share or scheduled mirroring)

Deliverable: A personal “micro-cluster” without overengineering.

15. Risk Management And Safety

15.1 Avoiding Drift

Canonical output is stored only in 40_PUBLISHED/
Referee edits are diffs, not overwrites
Logs record agent actions and timestamps

15.2 Avoiding Credential Leakage

Keep the system local
Don’t paste software keys or OS activation codes into agent prompts
Treat the vault as sensitive: if it is ever synced, encrypt it

15.3 Avoiding “Agent Chaos”

Start with 3 agents
Add new agents only when the role is clearly distinct
Make each agent’s outputs predictable (templates)

16. Why Patience Still Makes Sense

It is reasonable to believe packaging will improve rapidly over 1–3 years, because the market is converging on:

local inference becoming normal
multi-agent orchestration becoming standardized
permissions and identity becoming cleaner and boring (which is good)

However, the foundation work described here is not wasted. A properly designed vault + roles + upstream/downstream flow will integrate into future packages with minimal change. The user is not building a fragile prototype; the user is building a durable architecture.

17. Conclusion

A local multi-agent LLM system is buildable today on consumer hardware with 6 GB VRAM GPUs, provided the design respects real constraints and prioritizes architecture over brute force. The most realistic success path is to standardize around 7B-class quantized models, run one local inference endpoint, and build a disciplined workspace where agents operate as local modules with defined roles and bounded write access. This approach eliminates per-agent billing, avoids cloud dependencies, preserves authorship, and creates a sovereign upstream creative engine feeding a downstream canonical publication layer (WordPress). With a single NVMe SSD upgrade and a phased deployment strategy, the system can evolve from a simple local chat model into a true multi-agent “castle” capable of producing sustained, organized output across multiple projects.

Appendix A — Minimal Agent Protocol (Suggested)

To keep outputs consistent across agents, define a protocol each agent must follow.

Researcher Output Template

Summary
Key facts (bulleted)
Open questions
Suggested next actions
File written to: 10_RESEARCH/YYYYMMDD_topic.md

Writer Output Template

Title
Outline
Draft section(s)
“Needs research” flags
File written to: 20_DRAFTS/YYYYMMDD_project_section.md

Referee Output Template

Issues found (numbered)
Proposed edits (diff or patch style)
Consistency notes (terminology, tone, constraints)
File written to: 30_REVISIONS/YYYYMMDD_draftname_review.md

Appendix B — The One Rule That Keeps It Clean

AI upstream, publication downstream.
That one rule prevents 80% of confusion, drift, and wasted effort.