The 4-Tier Agent Memory System
The file-based persistence architecture that makes AI agents remember everything
Most AI agents forget everything when the session ends. This blueprint gives you the exact 4-tier memory architecture used in a production platform running 40+ persistent agents ΓÇö core identity, working state, long-term patterns, and event streams. File-based, zero-infrastructure, and battle-tested across thousands of agent runs.
The 4-Tier Agent Memory System
Developers building AI agent systems (with GitHub Copilot, Claude, LangChain, CrewAI, or custom frameworks) who need their agents to maintain context across sessions. You've built agents that work great for one conversation but start from scratch every time. You need a production-proven persistence pattern that doesn't require a database, vector store, or complex infrastructure.
Every agent builder hits the same wall: your agent is brilliant for 20 minutes, then you close the session and it forgets everything. The next conversation starts from zero. You've tried stuffing everything into a system prompt (too big), using a database (too complex), or vector search (too lossy). You need a simple, file-based memory system that actually works in production ΓÇö one that separates what an agent IS from what it's DOING from what it's LEARNED. This blueprint is that system.
Your AI agent is brilliant ΓÇö for exactly one session. It writes great code, gives perfect advice, coordinates complex tasks. Then you close the terminal. Tomorrow morning, it has no idea who you are, what it was working on, or what it learned yesterday.
An agent without memory isn't an agent. It's a very expensive autocomplete that resets every time you blink.
I run more than 40 persistent agents on a single platform. They manage finances, coordinate content production, track health appointments, maintain repositories, and coach my family through daily life. Every single one of them remembers ΓÇö across sessions, across days, across months. Not because they use a fancy vector database or a million-dollar infrastructure stack. Because they use four Markdown files.
This blueprint teaches you the exact memory architecture I built after months of iteration. It's file-based, zero-infrastructure, and designed for the way AI agents actually work ΓÇö not the way database vendors wish they worked. You'll get the templates, the lifecycle rules, the pruning logic, and the anti-patterns I learned the hard way so you don't have to.
Clear problem statement: AI agents are stateless by default. This blueprint gives them a memory system that's simple enough to implement in an afternoon and robust enough to run in production for months.
The Problem ΓÇö Why Agents Forget
Understanding why every approach you've tried doesn't work ΓÇö and why the solution is simpler than you think.
The Stateless Default
Every major AI agent framework ΓÇö LangChain, CrewAI, AutoGen, GitHub Copilot coding agents ΓÇö starts you with the same architecture: a system prompt, a conversation history, and nothing else. When the session ends, the conversation history evaporates. Your agent wakes up tomorrow with total amnesia.
This isn't a bug. It's a design choice optimized for single-session interactions. But the moment you need an agent to:
- Remember what it was working on yesterday
- Track a project across weeks
- Learn from past mistakes
- Maintain relationships with users over time
- Coordinate with other agents that run on different schedules
...you need persistence. And the solutions most developers reach for are wrong.
The Approaches That Don't Work
Approach 1: Stuff Everything Into the System Prompt
The naive approach. Just keep adding context to the system prompt until the agent "remembers" everything. This works for about a week. Then your system prompt is 15,000 tokens, your agent is slow, expensive, and confused. It reads the same irrelevant context on every single run, whether it needs it or not. Token waste compounds ΓÇö you're paying to remind the agent about a task it completed three weeks ago.
Approach 2: Database-Backed Memory
The enterprise approach. Stand up PostgreSQL, create a memory schema, build an ORM layer, add query logic, handle migrations. Now your agent can remember things ΓÇö if you spend six weeks building infrastructure instead of building the agent. For most agent systems, this is like buying a semi-truck to deliver a pizza.
Approach 3: Vector Store + RAG
The trendy approach. Embed all your agent's memories into a vector database, then use retrieval-augmented generation to pull relevant context. Sounds elegant. In practice: the retrieval is lossy (it often misses the exact context you need), the embeddings are expensive, and you've added a dependency that's harder to debug than the agent itself. Vector search is great for finding similar documents ΓÇö it's terrible for maintaining precise operational state.
Approach 4: Conversation History Persistence
Save the entire conversation history and reload it next session. This seems logical but scales terribly. After a few days, you're loading thousands of tokens of back-and-forth that are mostly irrelevant. The signal-to-noise ratio plummets. Your agent spends more time processing old conversation turns than doing useful work.
What Actually Works
The solution is embarrassingly simple: structured Markdown files with clear separation of concerns.
Instead of one giant memory blob, you separate agent memory into four tiers based on how it's used:
- Core Identity ΓÇö who the agent is (loaded every time, never changes)
- Working State ΓÇö what the agent is doing right now (loaded every time, changes constantly)
- Long-Term Patterns ΓÇö what the agent has learned over time (loaded on demand)
- Event Stream ΓÇö what has happened (append-only, never loaded in bulk)
Each tier has different load rules, size limits, and lifecycle patterns. The result is an agent that boots up in milliseconds with exactly the context it needs ΓÇö no more, no less.
The rest of this blueprint shows you exactly how to build it.
Get the full blueprint
You've seen the foundation. The full blueprint covers 101 pages of implementation detail — from context engineering to deterministic safety, delegated agents, production workflows, and the complete transformation path.
- ▸ Complete 4-tier memory templates (core.md, working.md, long-term.md, events.log)
- ▸ Memory management skill definition (SKILL.md)
- ▸ Load/save lifecycle diagrams
- ▸ Pruning and promotion decision flowcharts
- ▸ Staleness detection checklist
- ▸ Migration guide: database to file-based memory
- ▸ Anti-pattern reference card
- ▸ Real production examples from 40+ agents
- ▸ Agent Skills chapter: SKILL.md anatomy, decision framework, production examples
- ▸ MCP Servers as Memory-Aware Tool Layers: middleware pattern, shared memory architecture, real examples
- ▸ Extension Architecture chapter: hooks, extensions, skills, hookflow engine, and starter scaffolding
- ▸ Multi-Agent Orchestration chapter: 4 agent patterns, parallel dispatch, state machines, team agents, cron, agent mesh
- ▸ AI Agent Governance chapter: constitution, tiered autonomy, approval gates, safety protocols, code and data guards, context isolation, and brand-safe publishing
Instant access after purchase · Questions? hector.flores@htek.dev
Already purchased? Get a fresh access link:
Your AI agent is brilliant ΓÇö for exactly one session. It writes great code, gives perfect advice, coordinates complex tasks. Then you close the terminal. Tomorrow morning, it has no idea who you are, what it was working on, or what it learned yesterday.
An agent without memory isn’t an agent. It’s a very expensive autocomplete that resets every time you blink.
I run more than 40 persistent agents on a single platform. They manage finances, coordinate content production, track health appointments, maintain repositories, and coach my family through daily life. Every single one of them remembers ΓÇö across sessions, across days, across months. Not because they use a fancy vector database or a million-dollar infrastructure stack. Because they use four Markdown files.
This blueprint teaches you the exact memory architecture I built after months of iteration. It’s file-based, zero-infrastructure, and designed for the way AI agents actually work ΓÇö not the way database vendors wish they worked. You’ll get the templates, the lifecycle rules, the pruning logic, and the anti-patterns I learned the hard way so you don’t have to.
Clear problem statement: AI agents are stateless by default. This blueprint gives them a memory system that’s simple enough to implement in an afternoon and robust enough to run in production for months.
The Problem ΓÇö Why Agents Forget
Understanding why every approach you’ve tried doesn’t work ΓÇö and why the solution is simpler than you think.
The Stateless Default
Every major AI agent framework ΓÇö LangChain, CrewAI, AutoGen, GitHub Copilot coding agents ΓÇö starts you with the same architecture: a system prompt, a conversation history, and nothing else. When the session ends, the conversation history evaporates. Your agent wakes up tomorrow with total amnesia.
This isn’t a bug. It’s a design choice optimized for single-session interactions. But the moment you need an agent to:
- Remember what it was working on yesterday
- Track a project across weeks
- Learn from past mistakes
- Maintain relationships with users over time
- Coordinate with other agents that run on different schedules
…you need persistence. And the solutions most developers reach for are wrong.
The Approaches That Don’t Work
Approach 1: Stuff Everything Into the System Prompt
The naive approach. Just keep adding context to the system prompt until the agent “remembers” everything. This works for about a week. Then your system prompt is 15,000 tokens, your agent is slow, expensive, and confused. It reads the same irrelevant context on every single run, whether it needs it or not. Token waste compounds ΓÇö you’re paying to remind the agent about a task it completed three weeks ago.
Approach 2: Database-Backed Memory
The enterprise approach. Stand up PostgreSQL, create a memory schema, build an ORM layer, add query logic, handle migrations. Now your agent can remember things ΓÇö if you spend six weeks building infrastructure instead of building the agent. For most agent systems, this is like buying a semi-truck to deliver a pizza.
Approach 3: Vector Store + RAG
The trendy approach. Embed all your agent’s memories into a vector database, then use retrieval-augmented generation to pull relevant context. Sounds elegant. In practice: the retrieval is lossy (it often misses the exact context you need), the embeddings are expensive, and you’ve added a dependency that’s harder to debug than the agent itself. Vector search is great for finding similar documents ΓÇö it’s terrible for maintaining precise operational state.
Approach 4: Conversation History Persistence
Save the entire conversation history and reload it next session. This seems logical but scales terribly. After a few days, you’re loading thousands of tokens of back-and-forth that are mostly irrelevant. The signal-to-noise ratio plummets. Your agent spends more time processing old conversation turns than doing useful work.
What Actually Works
The solution is embarrassingly simple: structured Markdown files with clear separation of concerns.
Instead of one giant memory blob, you separate agent memory into four tiers based on how it’s used:
- Core Identity ΓÇö who the agent is (loaded every time, never changes)
- Working State ΓÇö what the agent is doing right now (loaded every time, changes constantly)
- Long-Term Patterns ΓÇö what the agent has learned over time (loaded on demand)
- Event Stream ΓÇö what has happened (append-only, never loaded in bulk)
Each tier has different load rules, size limits, and lifecycle patterns. The result is an agent that boots up in milliseconds with exactly the context it needs ΓÇö no more, no less.
The rest of this blueprint shows you exactly how to build it.
The Architecture ΓÇö Four Tiers, Four Files
The complete mental model: what each tier stores, when it loads, and why the boundaries matter.
The Mental Model: Separate What Changes at Different Speeds
The core insight behind the 4-tier system is simple: different types of memory change at different speeds, and they should be stored separately. An agent’s identity never changes mid-run. Its working state changes every run. Its accumulated wisdom changes weekly. Its event history is append-only.
When you store all of these in one file (or one database table, or one conversation history), you create a mess. The agent reloads its entire identity every time it just needs to check what it was working on. It scans through months of event history to find today’s state. Token waste compounds, latency increases, and the agent’s attention degrades.
The 4-tier system fixes this by giving each type of memory its own file with its own load rules:
Tier 1: core.md ΓÇö Who the Agent Is
Core memory is the agent’s identity. It answers the questions: Who am I? What do I own? What are my rules? This file is written once when the agent is created and rarely changes afterward ΓÇö maybe a few times per month when the agent’s responsibilities shift.
A production core.md contains these sections:
- Identity ΓÇö one or two sentences describing the agent’s role
- Mission ΓÇö what the agent exists to accomplish (2-3 bullet points)
- Ownership boundaries ΓÇö what it owns AND explicitly what it does not own
- Core heuristics ΓÇö the decision rules that guide every action
- Key rules ΓÇö critical behavioral constraints
Here’s what a real production core.md looks like ΓÇö this is from a finance management agent tracking a family’s budget, bills, and debt across ten accounts:
# Finance Manager ΓÇö Core Identity
## Last Updated
2026-04-23
## Identity
Family Budget & Bills Manager. Owns budget tracking,
bill payments, expense categorization, savings goals,
and debt management.
## Key Context
- Income: Primary earner is a software engineer.
Second earner recently lost job. Single-income household.
- Total debt: $112,790 across 10 accounts. $1,252/mo minimums.
- Strategy: DEBT SNOWBALL ΓÇö smallest balance first,
one target at a time.
## Monthly Budget Targets
| Category | Budget |
|---|---|
| Housing | $2,500 |
| Groceries | $800 |
| Insurance | $500 |
| Transportation | $400 |
| Utilities | $350 |
| Dining | $300 |
| ...12 categories total... |
| **Total** | **$5,950** |
## Snowball Order
1. 🔴 TJX $214 ← ACTIVE TARGET
2. Case $323
3. Home Depot $367
4. Savor $783
...Size limit: 3-5KB. Core memory must be small because it’s loaded on every single agent run. If you can’t fit your agent’s identity in 3-5KB, you’re probably storing operational state in the wrong tier. Move current data to working memory, move history to long-term.
Tier 2: working.md ΓÇö What the Agent Is Doing Right Now
Working memory is the agent’s scratchpad. It changes every run ΓÇö sometimes multiple times per run. It answers: What am I working on? What happened recently? What’s pending?
This is the tier most developers get wrong. They either skip it entirely (forcing the agent to rediscover its current state from scratch each time) or they let it grow unbounded (polluting the context window with weeks of stale data).
A production working.md follows this structure:
# Content Scheduler ΓÇö Working Memory
## Last Updated
2026-05-14T16:45:00-05:00
## Current Queue State
- Total scheduled: ~1,238 posts (13 pages @ 100)
- Near-term window (May 14-20): 28 posts
- Near-term health: 0 collisions, 0 spacing violations
- Collision-free streak: 125 cycles
## Active Issues / Blockers
1. Brand watch: 8 far-out posts need content reframe
2. Analytics blocked: returns 402. Skip analytics calls.
3. Token warning: TikTok expires tomorrow 3:08 AM
## Current Ordering Decisions
- Monday lineup reviewed: 5 active clusters this week
- Campaign launch verified: all spacing/collision rules hold
- Held drafts: 2 pregnancy/twins drafts remain paused
## Active Scheduling Rules
1. No same-platform collisions
2. ≥2h spacing between same-platform posts
3. Cascade order: LinkedIn ΓåÆ YouTube ΓåÆ Twitter ΓåÆ TikTokSize limit: 5KB max. This is the most important constraint in the entire system. When working memory exceeds 5KB, the agent starts spending more attention on old state than on the current task. I’ve seen agents degrade noticeably once working memory crosses 8KB ΓÇö their responses get slower, less focused, and more prone to referencing stale information.
The 5KB limit forces discipline: if you can’t fit your current state in 5KB, you need to prune completed items, summarize patterns instead of listing every instance, and move validated learnings to long-term memory.
Tier 3: long-term.md ΓÇö What the Agent Has Learned
Long-term memory is the agent’s accumulated wisdom. Unlike working memory (which reflects current state), long-term memory captures validated patterns and lessons that apply across time. It changes infrequently ΓÇö new entries are added weekly or monthly, not every run.
The critical word is validated. Not every observation belongs in long-term memory. An interesting finding from one run is a note in working memory. A pattern confirmed across five runs is a candidate for long-term. A lesson learned from a mistake that keeps recurring? That belongs here.
A production long-term.md looks like this:
# Finance Manager ΓÇö Long-Term Memory
## Last Updated
2026-05-11
## History & Learnings
### 2026-04-14 ΓÇö MILESTONE: Full Debt Profile Received
- 10 accounts, $111,454 total, $1,277/mo minimums.
- Credit cards: $22,747 across 8 cards (78.3% utilization).
- Resolved "Citi card" mystery = Home Depot CC.
- Strategy: Debt snowball confirmed by user.
### 2026-04-17 ΓÇö TWINS BORN / NICU FINANCIAL EVENT
- Premature delivery at 29-30 weeks. NICU stay 6-10+ weeks.
- Financial impact: potentially $100K-$500K+ before insurance.
- Strategy shift: pausing snowball extra payments.
- Key lesson: Not having autopay set up on most accounts
is a major risk when a family crisis hits.
### 2026-04-25 ΓÇö Credit Limit Pressure Pattern
- 2nd credit incident in 10 days: autopay bounced (NSF),
then credit card declined at vendor.
- Pattern: 83.3% CC utilization + emergency spending =
credit availability crisis.
### 2026-04 Monthly Close
- April closed at roughly $6,336 against $5,950 budget
(~106%) during the NICU transition month.
- Operational lesson: autopay coverage and bill-DB accuracy
mattered more than aggressive payoff during crisis mode.Size limit: 10KB. Long-term memory is loaded on-demand ΓÇö only when the agent needs historical context for a specific decision. This means it can be larger than core or working memory, but it still has a ceiling. When long-term exceeds 10KB, consolidate similar lessons into single entries and remove patterns that have been promoted into the agent’s core rules or platform-level skills.
Tier 4: events.log ΓÇö What Has Happened
The events log is an append-only audit trail. It’s the agent’s journal ΓÇö one line per significant action, never edited, never bulk-loaded. It exists for debugging, auditing, and extracting patterns during maintenance cycles.
A production events.log follows a strict one-line format:
[2026-05-05T06:30:00-05:00] daily-review: Synced accounts. 3 budget categories over. Alerted user.
[2026-05-05T11:02:25-05:00] email-triage: Scanned inbox ΓÇö no new receipts to log.
[2026-05-05T15:00:00-05:00] checkin: Payment confirmed. Applied Payment Logged=Clear Reminder rule.
[2026-05-05T17:00:00-05:00] checkin: Fixed auto_pay DB flags for 7 cards. Removed 5 duplicate txns.
[2026-05-06T06:30:00-05:00] daily-review: 6 txns, new merchant flagged. Entertainment at 347%.
[2026-05-07T06:30:00-05:00] daily-review: Dining hit 100% budget Day 7 ΓÇö alert sent.
[2026-05-07T11:02:00-05:00] email-triage: Logged subscription. Logged freelance income. Score +12 pts.Format rules:
- ISO-8601 timestamp with timezone in square brackets
- Lowercase action verb after the timestamp (scan:, create:, fix:, notify:, skip:)
- Each line under 120 characters
- One line per significant action ΓÇö not every trivial step
Size limit: Unlimited, but prune entries older than 30 days. The events log is never loaded in bulk during normal operations. Agents write to it; humans and maintenance agents read it. Keep milestone entries regardless of age ΓÇö they’re the backbone of the agent’s history.
The Directory Convention
Every agent gets a directory under a shared root. The structure is identical for every agent ΓÇö no exceptions, no custom layouts:
data/agents/{agent-name}/
Γö£ΓöÇΓöÇ core.md # Tier 1 ΓÇö identity, rules, preferences
Γö£ΓöÇΓöÇ working.md # Tier 2 ΓÇö current state, today's context
Γö£ΓöÇΓöÇ long-term.md # Tier 3 ΓÇö accumulated wisdom, patterns
ΓööΓöÇΓöÇ events.log # Tier 4 ΓÇö append-only audit trailThis consistency is load-bearing. When you have 40+ agents, any maintenance tool (auditor, pruner, template sync) can walk the directory tree and find exactly what it expects. No agent gets a special snowflake layout. The directory name IS the agent name. The file names are always the same four files.
Why These Size Limits
The size limits aren’t arbitrary ΓÇö they come from hard-won production experience:
- Core (3-5KB): Loaded every run. At 200 runs/week across 40 agents, that’s 8,000 loads. Even small bloat compounds into significant token waste.
- Working (5KB): Also loaded every run, but changes constantly. The 5KB cap forces pruning discipline. I’ve measured attention degradation starting around 8KB ΓÇö the agent starts referencing stale items and missing current context.
- Long-term (10KB): Loaded on-demand (maybe 10-20% of runs). Larger budget because it’s not always in the context window, but still bounded to prevent “digital hoarding.”
- Events (unlimited but pruned): Never loaded in bulk, so unbounded is fine. The 30-day prune keeps disk usage manageable and ensures the audit trail stays relevant.
The Templates ΓÇö Production-Ready Files
Copy-paste templates for every tier, with inline comments explaining every section.
The core.md Template
Copy this template for every new agent. The sections are ordered by importance ΓÇö identity first, rules last ΓÇö because agents process context top-to-bottom and attention is strongest at the beginning of the window.
# {Agent Name} ΓÇö Core Identity
## Last Updated
{ISO-8601 timestamp, e.g. 2026-05-06}
## Identity
{1-2 sentences: who this agent is and what it does.
Be specific ΓÇö "Family Budget & Bills Manager" not "Finance Agent"}
## Mission
- {Primary responsibility ΓÇö the ONE thing this agent exists to do}
- {Secondary responsibility}
- {Tertiary responsibility, if applicable}
## Ownership Boundaries
### You own
- {Explicit list of domains this agent controls}
- {Be specific: "bill payment tracking" not "finance stuff"}
### You do NOT own
- {Explicit exclusions ΓÇö prevents scope creep}
- {Example: "Investment decisions (user handles directly)"}
- {Example: "Meal planning (nutrition-chef agent owns this)"}
## Key Context
{3-5 bullet points of essential domain knowledge.
This is information the agent needs on EVERY run
regardless of what task it's performing.
Examples: account balances, family members, key dates.}
## Core Heuristics
1. {Decision rule that guides this agent's behavior}
2. {Example: "Always check both calendars before scheduling"}
3. {Example: "Auto-pay bills should never generate reminder tasks"}
4. {Example: "Notify user only for items requiring human action"}
## Key Rules
- {Critical behavioral constraints ΓÇö things that must NEVER happen}
- {Example: "NEVER skip the memory save step at end of run"}
- {Example: "NEVER create duplicate tasks ΓÇö always check first"}Template notes:
- Last Updated ΓÇö don’t skip this. It’s how maintenance tools detect stale core files. Use ISO-8601 format (YYYY-MM-DD minimum, full timestamp preferred).
- ”You do NOT own” ΓÇö this section prevents the most common multi-agent failure: scope creep. When agent A starts doing agent B’s job, neither does it well. Explicit exclusions make boundaries clear.
- Key Context ΓÇö only include information that’s relevant to EVERY run. Current project status belongs in working.md, not here.
- Core Heuristics vs Key Rules ΓÇö heuristics are “prefer X over Y” guidance. Rules are “NEVER do X” constraints. The distinction matters because rules are absolute while heuristics allow judgment.
The working.md Template
Working memory is the most frequently written file in the system. Design it for fast scanning ΓÇö the agent reads this at the start of every run and needs to orient itself in seconds.
# {Agent Name} ΓÇö Working Memory
## Last Updated
{Full ISO-8601 with timezone, e.g. 2026-05-06T07:30:00-05:00}
## Current State
{What's active RIGHT NOW ΓÇö 3-5 bullet points max.
This section answers: "If I woke up with amnesia,
what do I need to know to continue my work?"}
- {Active task or project status}
- {Key metrics or thresholds being monitored}
- {Blockers or waiting-for items}
## Recent Actions
{What happened in the last 1-3 runs.
Keep this section FIFO ΓÇö newest first, oldest pruned.}
- {Run timestamp}: {What was done, what was the outcome}
- {Run timestamp}: {What was done, what was the outcome}
## Pending / Deferred
{Items waiting for input or scheduled for later.
Each item should note WHY it's pending.}
- {Item}: waiting for {reason}
- {Item}: deferred until {date/condition}
## Active Rules
{Temporary rules or sprint-mode adjustments.
These override or augment core rules for a limited time.}
- {Example: "NICU mode ΓÇö minimize non-urgent notifications"}
- {Example: "Content blitz ΓÇö prioritize publishing over research"}Template notes:
- Full timestamp with timezone ΓÇö working memory timestamps must include the timezone. Agents running on different schedules (some at 6 AM, some at 9 PM) need to know exactly when the last update happened. “May 6” is ambiguous; “2026-05-06T07:30:00-05:00” is not.
- ”3-5 bullet points max” in Current State ΓÇö this is the first thing the agent reads. If it’s 20 bullet points, the agent will spend its attention budget on orientation instead of action. Ruthlessly prioritize.
- Recent Actions is FIFO ΓÇö newest first, and actively remove entries older than 3 runs. If something from 5 runs ago is still relevant, it should be in Current State or promoted to long-term memory.
- Active Rules are temporary ΓÇö if a rule persists for more than 2 weeks, move it to core.md. Active Rules is for situational overrides, not permanent behavior.
The long-term.md Template
# {Agent Name} ΓÇö Long-Term Memory
## Last Updated
{ISO-8601 timestamp}
## History & Learnings
### {Date} ΓÇö {Milestone or Pattern Title}
- {What happened}
- {What was learned}
- {How this affects future decisions}
### {Date} ΓÇö {Another Milestone}
- {Context: why this matters}
- {Pattern discovered: describe the repeatable insight}
- {Action taken: what changed as a result}
## Recurring Patterns
{Patterns that have been validated across multiple runs.
Each pattern should include WHEN it was first observed
and HOW MANY times it's been confirmed.}
- {Pattern}: first seen {date}, confirmed {N} times
- {Pattern}: first seen {date}, confirmed {N} times
## Decisions Made
{Significant decisions and their reasoning.
Future agents (or future you) will want to know
WHY a choice was made, not just WHAT was chosen.}
- {Decision}: {reasoning} ({date})
## Ideas & Improvements
{Potential improvements that haven't been validated yet.
These are hypotheses, not facts.}The events.log Format Specification
Events.log is the simplest tier ΓÇö no template, just a strict format for each line:
[{ISO-8601 timestamp with timezone}] {action-verb}: {description under 120 chars total}
# Examples:
[2026-05-06T07:30:00-05:00] scan: Email inbox ΓÇö 12 messages, 3 receipts logged, 0 urgent
[2026-05-06T07:31:00-05:00] create: Task "pay water bill" due 2026-05-10, assigned to user
[2026-05-06T07:32:00-05:00] fix: Duplicate task merged ΓÇö kept "insurance deadline" (more detail)
[2026-05-06T07:33:00-05:00] notify: Telegram sent ΓÇö 3 budget categories over threshold
[2026-05-06T07:34:00-05:00] skip: Meal plan empty but not owned by this agent ΓÇö no actionAction verb vocabulary: Use consistent verbs across all agents. The standard set is: scan, create, update, fix, notify, skip, close, prune, promote, escalate, error. You can extend this list, but keep verbs lowercase and single-word.
Full Agent Example: Platform Manager (All 4 Tiers)
Here’s how the four tiers work together for a real production agent ΓÇö the platform manager, which oversees 40+ other agents, 20+ extensions, and 44+ cron jobs:
- core.md (Tier 1): “Meta-agent that owns the entire assistant platform. Operating philosophy: Detect ΓåÆ Fix ΓåÆ Report.” Contains the autonomous capabilities list (what it can fix without asking), the platform inventory counts, and critical rules like “CRON DISPATCH ΓÇö Always launch fresh agents.”
- working.md (Tier 2): Current agent count (49), extension count (24), cron job count (44+), token health for 5 social media platforms (with exact expiry timestamps), tonight’s reflection results, and active issues. Changes every nightly cycle.
- long-term.md (Tier 3): Chronological log of every major platform improvement ΓÇö from the first email triage pipeline to the autonomous nightly maintenance protocol. Contains the health log showing system state at key moments. Used when debugging recurring issues or planning new features.
- events.log (Tier 4): One-line entries for every nightly cycle: “Verified 44 cron jobs ΓÇö no phantom entries”, “Auto-trimmed working memory 12.1ΓåÆ3.7KB”, “TikTok token recovered ≡ƒƒó”. Pruned monthly, milestone entries preserved.
Notice the separation: the core file hasn’t changed since May 5. The working file changes every night. The long-term file gets a new entry maybe once a week. The events log gets 5-10 entries per nightly cycle. Each tier has its own rhythm.
The Lifecycle ΓÇö Load, Use, Save
The exact sequence every agent follows: boot, load memory, do work, save state, shut down.
The Boot Sequence: Load Memory First, Always
Every agent run starts the same way ΓÇö load Tier 1 and Tier 2 before doing anything else. This is non-negotiable. An agent that skips memory loading is flying blind, and an agent that loads in the wrong order will process identity through the lens of current state instead of the other way around.
The mandatory load sequence:
# Step 1: Load core identity (Tier 1) ΓÇö ALWAYS
Read data/agents/{agent-name}/core.md
# Step 2: Load working state (Tier 2) ΓÇö ALWAYS
Read data/agents/{agent-name}/working.md
# Step 3: Begin actual work
# (The agent now knows WHO it is and WHAT it was doing)That’s it. Two file reads. No database connections, no API calls, no vector similarity searches. The agent reads two Markdown files and it’s oriented. In a system with 40+ agents running on cron schedules every 20-30 minutes, this boot sequence takes negligible time.
On-Demand Loading: Tier 3
Long-term memory (Tier 3) is NOT loaded at boot. It’s loaded only when the agent encounters a situation that requires historical context. The trigger conditions are specific:
- Pattern question: “Have we seen this before?” ΓÇö check long-term for recurring patterns
- Decision context: “Why did we choose X over Y last time?” ΓÇö check long-term for decision history
- Debugging: “This keeps happening” ΓÇö check long-term for previous occurrences and resolutions
- Milestone check: “What’s our progress over time?” ΓÇö check long-term for the history timeline
If none of these conditions apply ΓÇö which is the majority of routine runs ΓÇö Tier 3 stays on disk, saving token budget for actual work.
Why Tier 4 Is Never Bulk-Loaded
The events log exists for three consumers: humans reading it directly, maintenance agents scanning for patterns, and auditing tools checking agent behavior. The primary agent itself never reads its own events log during a normal run. Here’s why:
- Events are raw data ΓÇö they need aggregation and analysis to be useful, which is the job of maintenance agents, not the primary agent
- After 30 days, an events.log for an active agent can be thousands of lines ΓÇö bulk-loading this into the context window would destroy attention on everything else
- Any insight from the events log that matters for the agent’s behavior should have already been promoted to long-term memory or encoded as a heuristic in core memory
The Save Sequence: State Persistence Before Shutdown
The save sequence is the mirror of the boot sequence ΓÇö and it’s where most memory systems fail. Developers build great load logic and then forget to save. The agent does brilliant work, the session ends, and all of that context evaporates because nobody wrote it to disk.
The mandatory save sequence runs at the end of every agent run:
Step 1: Update working.md
Write the current state back to working.md. This includes:
- What actions were taken this run
- Any state changes (new data discovered, status updates, items completed)
- Deferred items or new blockers
- Updated “Last Updated” timestamp ΓÇö always ISO-8601 with timezone
The timestamp format matters. Use full precision with timezone offset:
## Last Updated
2026-05-06T07:30:00-05:00Not “May 6”. Not “2026-05-06”. Not “7:30 AM”. The full ISO-8601 timestamp with timezone. This is how maintenance tools detect stale memory ΓÇö they compare the Last Updated timestamp to the current time and flag anything older than 3 days for agents with active cron schedules.
Step 2: Append to events.log
Append one-line entries for each significant action taken during this run. “Significant” means actions that changed state, created artifacts, or made decisions. Don’t log “read file X” or “checked working memory” ΓÇö log “created task for overdue bill” or “flagged 3 posts for brand safety review.”
Step 3: Promote to long-term.md (Only When Justified)
This step is optional and should be rare ΓÇö maybe once every 5-10 runs. Append to long-term.md only when:
- A new repeatable pattern was discovered and validated across multiple runs
- A significant milestone was reached (first sale, system migration complete, crisis resolved)
- A lesson was learned from a mistake that will affect future decisions
- A heuristic was proven correct (or wrong) across enough data points
Do NOT promote: one-off events that won’t recur, transient state that belongs in working memory, raw data dumps, or anything that hasn’t been validated.
What Happens When an Agent Crashes Mid-Run
This is the pragmatic question every memory system must answer. If the agent crashes (timeout, API failure, uncaught error) before completing its save sequence, what happens?
With the file-based system, the answer is simple: the previous state persists. Working.md still contains the last successful save. Events.log has all entries up to the crash point (if the crash happened after Step 2). Long-term.md is unchanged.
The next run boots up, loads the slightly-stale working memory, and continues. It might redo some work from the crashed run, but it won’t lose its identity, its history, or its accumulated wisdom. This is a fundamental advantage of file-based memory over database-backed systems ΓÇö there are no half-committed transactions, no connection pool exhaustion, no migration failures. The worst case is “slightly stale working memory,” which the next successful run fixes automatically.
In practice, across thousands of agent runs on a production platform, crash recovery has never required manual intervention. The agent just picks up where the last successful save left off.
The Memory Skill ΓÇö Codifying the Rules
How to turn memory management into a reusable skill that every agent inherits automatically.
From Manual to Automatic: The Skill Pattern
The first version of the 4-tier memory system was manual. Each agent had its own copy of the load/save logic embedded directly in its instructions. This worked for 5 agents. At 15 agents, maintaining consistency was painful. At 30+, it was unsustainable ΓÇö every time a rule changed (like adjusting the working memory size limit from 8KB to 5KB), I had to update every agent’s instructions individually.
The solution: extract the memory management logic into a skill ΓÇö a self-contained instruction file that any agent can reference.
What Is a Skill?
A skill is a reusable instruction document that agents consume at runtime. It’s not a code library ΓÇö it’s a set of rules, templates, and procedures written in Markdown that become part of an agent’s context when referenced. Think of skills as shared documentation that agents actually follow, not just read.
Skills live in a standard directory structure:
.github/skills/{skill-name}/
ΓööΓöÇΓöÇ SKILL.md # The complete skill definitionEach skill has YAML frontmatter (name + description with trigger phrases) followed by the complete self-contained instructions. A skill must contain everything an agent needs to follow it ΓÇö no external dependencies, no “see also” references to other docs.
The Memory Management Skill
Here’s the actual skill definition that governs memory management across 30+ production agents. This is the single source of truth ΓÇö when this file changes, every agent that references it picks up the new behavior automatically on its next run.
---
name: memory-management
description: >-
4-tier memory system management for all stateful agents ΓÇö
loading, saving, pruning, promoting, and maintaining
memory files. Use when user says "load memory",
"save memory", "update working memory", "prune memory",
"memory management", "tier system", or any agent
memory lifecycle activity.
---
# Memory Management Skill
Standard 4-tier memory system used by all stateful agents.
## The 4-Tier System
| Tier | File | Purpose | Load Rule | Size Limit |
|------|------|---------|-----------|-----------|
| 1 | core.md | Identity, mission, heuristics | ALWAYS load first | 3-5KB |
| 2 | working.md | Current state, active context | ALWAYS load second | 5KB max |
| 3 | long-term.md | Historical patterns, lessons | On-demand only | 10KB max |
| 4 | events.log | Append-only event stream | Never bulk-load | Unlimited (prune >30 days) |
## First Action: Load Memory (Every Run)
MANDATORY for every stateful agent at run start:
1. Read data/agents/{agent-name}/core.md # Tier 1 ΓÇö ALWAYS
2. Read data/agents/{agent-name}/working.md # Tier 2 ΓÇö ALWAYS
Do NOT bulk-load Tier 3 (long-term.md).
Never load Tier 4 (events.log) in bulk.
## Last Action: Save Memory (Every Run)
### Step 1: Update Working Memory (Tier 2)
Update working.md with: what happened, state changes,
deferred items. Update "Last Updated" timestamp.
Keep under 5KB.
### Step 2: Append to Events Log (Tier 4)
One-line entries: [ISO timestamp] action: description
### Step 3: Promote to Long-Term (Tier 3) ΓÇö ONLY when justified
Append ONLY for validated patterns, milestones, or lessons.How Agents Reference the Skill
In the agent’s definition file (the instructions document that defines the agent), you add a skill reference directive. The exact syntax depends on your agent framework, but the pattern is the same: point the agent at the skill file and tell it to follow those rules.
In the production system, agent definitions include a directive like:
## Memory (4-Tier System) ΓÇö see memory-management skill
**Load first:** data/agents/{agent-name}/core.md (Tier 1)
+ data/agents/{agent-name}/working.md (Tier 2).
On-demand: long-term.md (Tier 3).
**Save last:** Update working.md, append events.log,
promote to long-term.md only for validated patterns.The agent doesn’t embed the full memory management logic ΓÇö it references the skill. If the skill changes (say, the working memory size limit drops from 5KB to 4KB), every agent that references it automatically follows the new rule on its next run.
The Skill-First Scaling Principle
Memory management was the first skill extracted, but the pattern applies to any repeatable capability. In the production platform, 60+ skills cover everything from Telegram communication rules to content scheduling cadences to email encoding requirements.
The governing principle is: any repeatable process that more than one agent follows should be a skill, not embedded logic.
To determine whether something should be a skill, apply these criteria:
- Is it used by more than one agent? → Extract to a skill.
- Does it change independently of the agents that use it? ΓåÆ Extract to a skill. Memory rules might change monthly; the finance agent’s identity rarely changes.
- Would inconsistency between agents cause problems? → Extract to a skill. If one agent prunes at 5KB and another at 10KB, the system is unpredictable.
- Is it complex enough to get wrong? → Extract to a skill. Better to get the logic right once in a shared file than risk N different implementations with N different bugs.
The memory management skill was the proof-of-concept for this pattern. Once it worked, the platform went from 5 skills to 60+ in three weeks ΓÇö and the consistency improvement across agents was immediately measurable.
Pruning and Promotion ΓÇö Keeping Memory Clean
The rules that prevent memory bloat: when to prune, when to promote, and when to delete.
Why Pruning Is the Hardest Part
Building the memory system is the easy part. Keeping it clean is the hard part. Without active pruning, working memory grows past its 5KB limit, long-term memory becomes a dump of every observation ever made, and events.log becomes a multi-megabyte file nobody reads.
I learned this the hard way. Three weeks into running the platform, a task-coaching agent’s memory file had ballooned to 101KB. A content scheduling agent hit 77KB. The agents were slow, confused, and spending more time reading stale context than doing useful work. The solution: an emergency migration that compressed those files back down to under 5KB each ΓÇö and a set of pruning rules to prevent it from ever happening again.
Working Memory Pruning (Weekly Cadence)
Working memory pruning runs on a weekly cycle. The rules:
- Remove completed items older than 7 days. If a task was done last Tuesday, it doesn’t belong in current state. If the completion is significant, it was already logged in events.log.
- Collapse repeated entries into summaries. If the agent has checked the same metric 5 times in 5 runs, replace the 5 entries with: “Metric X stable at Y for past 5 checks.”
- Archive deferred items older than 14 days. If something has been “pending” for two weeks, either escalate it (create a task for the user), move it to long-term memory, or remove it. Working memory is not a parking lot.
- Target: always under 5KB. Measure the file size. If it’s over 5KB after pruning, be more aggressive ΓÇö summarize paragraphs into bullet points, remove verbose notes, merge related items.
In the production system, a context auditor agent runs weekly scans across all 30+ agent memory files. It flags any working.md over 5KB, any core.md over 5KB, and any working memory with a “Last Updated” timestamp more than 3 days old for agents with active cron schedules. This automated pruning pressure is what keeps the system healthy at scale.
Long-Term Memory Consolidation (Monthly Cadence)
Long-term memory grows slowly ΓÇö maybe one or two entries per week ΓÇö but over months it accumulates. Monthly consolidation ensures it stays under 10KB and remains useful:
- Remove patterns now captured in skills. If a recurring lesson was extracted into a platform skill (e.g., “always check both calendars”), the long-term entry teaching that lesson is redundant. Remove it.
- Consolidate similar lessons. If three separate entries all teach “don’t create duplicate tasks,” merge them into one entry with the best example.
- Remove context that’s become standard. If a decision from three months ago is now embedded in the agent’s core.md as a rule, the long-term entry explaining the decision’s background can be shortened or removed.
Events Log Pruning (30-Day Rolling Window)
Events.log is the simplest to prune: delete entries older than 30 days, but preserve milestone entries regardless of age.
Milestone entries are marked by their content ΓÇö entries about first occurrences, system migrations, crisis responses, or major achievements. In practice, you’ll know a milestone when you see it:
# Regular entry ΓÇö safe to prune after 30 days:
[2026-04-15T11:00:00-05:00] email-triage: Scanned 20 messages, 2 receipts logged
# Milestone entry ΓÇö keep forever:
[2026-04-17T08:00:00-05:00] milestone: TWINS BORN ΓÇö switching to NICU crisis mode
[2026-04-20T21:00:00-05:00] milestone: FIRST AUTONOMOUS MAINTENANCE CYCLE ΓÇö 20 fixes, 0 human inputThe Promotion Decision Flowchart
The hardest judgment call in the memory system is: should this observation in working memory be promoted to long-term memory? Use this decision tree:
- Is this a one-off event? (Yes ΓåÆ log it in events.log, do NOT promote. Examples: “scanned inbox, found 3 receipts” or “API returned 500 error once.”)
- Has this pattern been confirmed across 3+ runs? (No ΓåÆ keep it in working memory as a hypothesis. It’s not validated yet.)
- Will this affect future decisions? (No ΓåÆ it’s interesting trivia, not actionable memory. Don’t promote.)
- Is this already captured in a skill or core rule? (Yes → the lesson is already encoded. No need to duplicate it in long-term.)
- If you answered “Yes, 3+ times, Yes, No” ΓåÆ promote it to long-term.md with the date, what was observed, what was learned, and how it affects future behavior.
Staleness Detection
A working memory file is stale when it contains outdated information that could mislead the agent. The detection rules:
- Timestamp check: If the “Last Updated” timestamp is more than 3 days old AND the agent has an active cron schedule, the memory is stale. An agent running every 30 minutes shouldn’t have 3-day-old working memory.
- Temporal reference check: If working memory contains “today,” “this week,” or “tomorrow” and the Last Updated timestamp is from a different week, those references are misleading.
- Unresolved items check: If working memory references dates or events that have already passed without any resolution note, those items are stale.
When stale memory is detected (either by the agent itself or by a maintenance agent):
- Flag it ΓÇö make the staleness visible (log entry, audit flag, or notification)
- On the next run, refresh all temporal references
- Remove completed items that weren’t cleaned up
- Update the timestamp to the current time
Staleness detection is one of those features that sounds optional until you’ve been bitten by an agent giving advice based on last week’s data. In the production system, the automated auditor catches 2-3 stale memory files per week ΓÇö usually from agents that were temporarily disabled and then re-enabled without a memory refresh.
Anti-Patterns and Gotchas
The mistakes that will destroy your memory system ΓÇö learned from hundreds of production agent runs.
Anti-Pattern 1: Bulk-Loading All Tiers at Boot
The most common mistake: loading core.md + working.md + long-term.md + events.log at the start of every run. Developers think “more context = better agent.” The opposite is true.
Loading all four tiers at boot means the agent starts every run with 20-30KB of context before it’s even seen the task. Most of that context is irrelevant to the current run. The model’s attention budget is finite ΓÇö every token spent on three-month-old event logs is a token NOT spent on understanding the current task.
The fix: Load only Tier 1 + Tier 2 at boot. Load Tier 3 on-demand. Never load Tier 4 in bulk. This alone can cut boot-time token usage by 60-80%.
Anti-Pattern 2: Skipping the Save Step
This is the “it works in dev” anti-pattern. During development, you’re interacting with the agent in a single long session, so state is maintained in the conversation. You ship to production, the agent starts running on a cron schedule, and suddenly every run starts from scratch because nobody wrote the save logic.
The consequence: The agent does the same work over and over. It re-discovers the same blockers, re-creates the same tasks, and sends the same notifications because it doesn’t remember that it already handled them. I once had an agent send the same “overdue bill” alert seven times in one day because it had no working memory to record “already notified.”
The fix: Make the save sequence non-negotiable in the agent’s instructions. Use explicit language: “MANDATORY for every run ΓÇö update working.md before ending.” Put it at the top of the agent’s instruction file, not buried at the bottom.
Anti-Pattern 3: Unbounded Working Memory
Working memory without a size limit grows linearly with every run. After a month, it’s 50KB of every observation the agent ever made. The agent loads this entire file at boot and spends 80% of its attention budget on stale historical data instead of the current task.
This was the single biggest performance problem in the first month of the production platform. Three agents had working memory files over 50KB. Their response quality had degraded so noticeably that I thought the underlying model had gotten worse ΓÇö it hadn’t. The context was just too polluted.
The fix: Hard cap at 5KB. When approaching the limit: remove completed items older than 7 days, summarize patterns instead of listing every instance, move proven lessons to long-term, and trim verbose notes to essential facts.
Anti-Pattern 4: Storing Raw Data in Long-Term Memory
Long-term memory is for patterns and lessons, not raw data. I’ve seen agents store entire transaction lists, full API responses, and complete email contents in long-term memory. This defeats the purpose ΓÇö long-term is supposed to be curated wisdom, not a data warehouse.
Example of wrong:
### 2026-05-01 ΓÇö Transactions
- Chase: $43.20 DoorDash, $23.24 DoorDash, $62.77 restaurant,
$53.53 restaurant, $30.63 fast food, $10 SaaS, $16.70 TikTok,
$5.15 charity, $353.09 unknown vendor, $52 unknown vendor...Example of right:
### 2026-05 Early Month Pattern
- Dining hit 100% of monthly budget by Day 7 ΓÇö chronic overspend.
- Two new unknown merchants flagged (Briggs Barrett, Diluu) ΓÇö
verification tasks created. Pattern: ~2 unknown merchants/week.
- Subscriptions at 234% of budget ΓÇö AWS and social tools
are the primary drivers.The first version is a data dump. The second is actionable intelligence. Long-term memory should tell the agent “here’s the pattern” ΓÇö not “here’s every data point.”
Anti-Pattern 5: Never Pruning events.log
Events.log is append-only, which means it grows forever unless you actively prune it. An agent running 3 times per day generates roughly 90 log entries per month. After a year, that’s 1,000+ entries. Nobody reads a 1,000-line log file.
The fix: 30-day rolling prune. Delete entries older than 30 days, preserve milestones. Run the prune during monthly maintenance cycles.
Anti-Pattern 6: Not Checking Timestamps
An agent reads its working memory and sees “Current state: Bill payment due tomorrow.” But working memory was last updated 5 days ago. “Tomorrow” was 4 days ago. The agent creates a frantic notification about a bill that was already paid.
The fix: Always read the “Last Updated” timestamp before trusting working memory content. If the timestamp is stale (>3 days old for cron agents), treat all temporal references as potentially wrong and verify before acting.
Anti-Pattern 7: Promoting Everything to Long-Term
The opposite of anti-pattern 4. Some developers treat long-term memory as “everything that ever happened.” Every run’s results get promoted. After a month, long-term memory is 30KB of undifferentiated observations.
Long-term should be promoted to rarely ΓÇö maybe once every 5-10 runs. If you’re promoting on every run, you’re treating long-term as a second events log. Use the promotion decision flowchart from Chapter 6: it must be a pattern (not a one-off), confirmed across 3+ runs, actionable for future decisions, and not already captured elsewhere.
Gotcha: Concurrent Agent Writes
What happens when two agents write to the same memory file at the same time? In a file-based system, the last write wins ΓÇö and one agent’s changes get silently overwritten.
This sounds like a critical problem but it’s almost never an issue in practice, because of a fundamental design principle: each agent owns its own memory directory. The finance agent writes to data/agents/finance-manager/. The content scheduler writes to data/agents/content-scheduler/. They never write to each other’s files.
The one exception is when a maintenance agent (like a context auditor) prunes or updates another agent’s memory during a maintenance window. The solution: run maintenance agents during off-hours when the target agents aren’t scheduled. In the production system, the auditor runs at 2 AM ΓÇö no other agents are active.
Gotcha: Timezone Handling
Timestamps without timezones are ambiguous. “2026-05-06T07:30:00” ΓÇö is that UTC? Eastern? Central? If your agents run in different timezone contexts (or on a cloud server in UTC), ambiguous timestamps cause cascading confusion.
The rule is simple: always include the timezone offset. “2026-05-06T07:30:00-05:00” is unambiguous. In the production system, all agents compute the current time using system commands (PowerShell’s Get-Date) rather than guessing, and all timestamps include the offset.
Gotcha: Memory Format Drift
Over time, different agents develop slightly different working memory formats. One agent uses ”## Current State”, another uses ”## Active Context”, a third uses ”## Status.” Maintenance tools that parse these files break silently.
The fix is the memory management skill from Chapter 5. By defining the canonical templates in a shared skill, all agents start from the same structure. New agents copy the template. Existing agents are gradually migrated during maintenance cycles. Format drift is caught by automated audits.
Migration Guide ΓÇö Database to Files
Already using a database for agent memory? Here’s how to migrate to the 4-tier file system without losing data.
When File-Based Memory Beats a Database
File-based memory wins in these scenarios:
- AI agent systems where memory is loaded into a context window. The memory is going to become text anyway ΓÇö storing it as text from the start eliminates a serialization layer.
- Systems with fewer than 100 agents. At this scale, file I/O is negligible and the simplicity advantage of “read a file” over “connect to database, query, deserialize” is significant.
- Solo developers or small teams. No database to provision, no migrations to run, no connection strings to manage. The memory system is just files in a git repository.
- Systems where memory is human-readable. A product manager can open core.md in a text editor and understand what the agent thinks it is. Try that with a PostgreSQL row.
- Systems that benefit from version control. Memory files are git-tracked. You can see exactly when an agent’s identity changed, diff its working memory across days, and revert bad changes with a git checkout.
When You Should Keep the Database
File-based memory is not a universal solution. Keep your database if:
- You need transactional consistency. If multiple systems write to the same memory and ordering matters, a database’s ACID guarantees are worth the complexity.
- You have thousands of agents. At 500+ agents, directory enumeration and file I/O start to matter. A database with proper indexing scales better.
- Memory items need querying by attributes. If you need “find all agents with working memory older than 3 days” as a SQL query rather than a file-walking script, a database is the right choice.
- You need real-time cross-agent data sharing. File-based memory is per-agent. If agents need to read each other’s state in real time (not on-demand), a shared database or message bus is better.
For most AI agent systems ΓÇö including production platforms running dozens of agents ΓÇö file-based memory is the simpler, more maintainable choice. The database becomes justified when you outgrow the file system’s capabilities, and for most teams that threshold is much further away than they think.
Step-by-Step Migration
If you’re currently storing agent memory in a database (or a single large JSON file, or conversation history), here’s the migration path:
Phase 1: Extract and Categorize
Export your current agent memory and categorize every piece of data into one of the four tiers:
- Tier 1 (Core): Anything that describes WHO the agent is ΓÇö role descriptions, behavioral rules, ownership boundaries, fixed configuration. This data rarely changes.
- Tier 2 (Working): Anything that describes WHAT the agent is currently doing ΓÇö active tasks, recent results, pending items, current metrics. This data changes every run.
- Tier 3 (Long-Term): Anything that represents WHAT the agent has learned ΓÇö patterns observed, lessons from failures, significant milestones, validated heuristics. This data changes monthly.
- Tier 4 (Events): Anything that records WHAT happened ΓÇö action logs, audit trails, timestamps of operations. This data is append-only.
If a piece of data doesn’t clearly fit into one tier, ask: “How often does this change?” Daily ΓåÆ Tier 2. Monthly ΓåÆ Tier 3. Never ΓåÆ Tier 1. Always ΓåÆ Tier 4.
Phase 2: Create the Directory Structure
For each agent, create the standard directory:
mkdir -p data/agents/{agent-name}
touch data/agents/{agent-name}/core.md
touch data/agents/{agent-name}/working.md
touch data/agents/{agent-name}/long-term.md
touch data/agents/{agent-name}/events.logPopulate each file using the templates from Chapter 3. Fill in the content from your categorized data. Apply size limits immediately ΓÇö if your extracted core content is 8KB, trim it to 5KB before writing it. The migration is the perfect time to be ruthless about what actually matters.
Phase 3: Dual-Write Transition
Don’t switch everything at once. Run a dual-write period where the agent writes to both the old system (database) and the new system (files) for 1-2 weeks. This gives you:
- A safety net ΓÇö if the file-based system has issues, the database still has the data
- Validation ΓÇö compare the file contents to the database contents to ensure nothing was lost
- Confidence ΓÇö when the files have been consistent with the database for 2 weeks, you know the migration is solid
During the dual-write period, the agent still READS from the old system. It writes to both. This ensures zero behavior change while the new system is validated.
Phase 4: Cut Over
Switch the agent to read from files instead of the database. The moment of truth: the agent boots up, reads core.md and working.md, and continues its work. If it behaves identically to the database-backed version, the migration is complete.
Keep the database for 30 days as a backup. After 30 days with no issues, decommission it.
Verifying Data Integrity
After migration, verify these properties:
- Completeness: Every piece of data that was in the database exists in one of the four tier files. Nothing was silently dropped.
- Correctness: The agent’s behavior is identical before and after migration. Run the same tasks and compare outputs.
- Size compliance: core.md Γëñ 5KB, working.md Γëñ 5KB, long-term.md Γëñ 10KB. If any file exceeds its limit, the categorization step missed something.
- Timestamp freshness: All “Last Updated” timestamps are current (from the migration date, not from the original database creation date).
- Lifecycle validation: Run the agent 3-5 times on its cron schedule and verify that working.md updates correctly, events.log grows with each run, and long-term.md doesn’t grow on every run.
In the production platform’s migration, 16 domain agents were migrated from ad-hoc memory (various sizes, various formats) to the 4-tier system in a single day. The biggest compression: a task-coaching agent went from 101KB of unstructured memory to 6.8KB across all four tiers. Response quality improved immediately ΓÇö the agent stopped referencing month-old context and started focusing on the current task.
Scaling to 40+ Agents
What changes when you have dozens of agents sharing a memory system ΓÇö coordination, naming, auditing.
Directory Naming Conventions at Scale
With 40+ agents, naming conventions are load-bearing infrastructure. The production system uses these rules:
- Directory name = agent name. The directory
data/agents/finance-manager/belongs to thefinance-manageragent. No aliases, no abbreviations, no creative names. If the agent’s definition file isfinance-manager.agent.md, the memory directory isfinance-manager/. - Lowercase kebab-case exclusively.
content-scheduler, notContentSchedulerorcontent_scheduler. This eliminates case-sensitivity issues across operating systems and makes directory enumeration predictable. - Descriptive compound names.
content-scheduleris better thanscheduler.finance-manageris better thanfinance. When you have 40+ directories, short generic names create confusion. - Group by function through naming, not subdirectories. The production system has
content-scheduler,content-manager,content-editor,content-creative,content-researcher, andcontent-analyticsΓÇö all at the same directory level. Thecontent-prefix creates a natural grouping without adding nesting complexity.
At 40+ agents, the full directory listing looks like this:
data/agents/
Γö£ΓöÇΓöÇ blog-writer/
Γö£ΓöÇΓöÇ cloud-advisor/
Γö£ΓöÇΓöÇ coding-agent/
Γö£ΓöÇΓöÇ content-analytics/
Γö£ΓöÇΓöÇ content-creative/
Γö£ΓöÇΓöÇ content-editor/
Γö£ΓöÇΓöÇ content-manager/
Γö£ΓöÇΓöÇ content-researcher/
Γö£ΓöÇΓöÇ content-scheduler/
Γö£ΓöÇΓöÇ context-auditor/
Γö£ΓöÇΓöÇ credit-coach/
Γö£ΓöÇΓöÇ dog-parent/
Γö£ΓöÇΓöÇ entrepreneur-coach/
Γö£ΓöÇΓöÇ entrepreneur-driver/
Γö£ΓöÇΓöÇ finance-manager/
Γö£ΓöÇΓöÇ fitness-coach/
Γö£ΓöÇΓöÇ health-coach/
Γö£ΓöÇΓöÇ home-manager/
Γö£ΓöÇΓöÇ luna/
Γö£ΓöÇΓöÇ nicu-care/
Γö£ΓöÇΓöÇ nutrition-chef/
Γö£ΓöÇΓöÇ parent-support/
Γö£ΓöÇΓöÇ parenting-coach/
Γö£ΓöÇΓöÇ platform-manager/
Γö£ΓöÇΓöÇ project-manager/
Γö£ΓöÇΓöÇ realtor-team/
Γö£ΓöÇΓöÇ repo-maintainer/
Γö£ΓöÇΓöÇ skill-optimizer/
Γö£ΓöÇΓöÇ task-coach/
Γö£ΓöÇΓöÇ teacher/
Γö£ΓöÇΓöÇ wellness-coach/
Γö£ΓöÇΓöÇ work-life-sync/
ΓööΓöÇΓöÇ ... (40+ total, each with core.md, working.md,
long-term.md, events.log)Cross-Agent Memory Reads
Sometimes agent A needs context from agent B. A finance manager agent processing a medical bill might need to check the health-coach agent’s memory for insurance details. A content scheduler might need to read the content-editor’s working memory to see which videos are in production.
The rule: cross-agent reads are allowed. Cross-agent writes are not.
Any agent can read any other agent’s memory files. The file system doesn’t enforce access control ΓÇö it’s enforced by convention in the agent instructions. The critical boundary is writes: only the owning agent writes to its own memory directory. If the finance-manager needs to update the health-coach’s memory (it doesn’t ΓÇö but hypothetically), it sends a message or creates a task instead of directly editing the file.
In practice, cross-agent reads are rare ΓÇö maybe 5-10% of runs. Most agents are self-contained. When a cross-agent read is needed, the pattern is straightforward:
# Agent: content-scheduler
# Needs to check content-editor's production state
# Read the other agent's working memory
Read data/agents/content-editor/working.md
# Extract the relevant info (active video runs, publish dates)
# Use it to schedule content, but DON'T write to content-editor's filesData Domain Ownership
Beyond agent memory, most platforms have shared data directories ΓÇö financial data, content assets, family information, research outputs. These need explicit ownership rules to prevent write conflicts.
The ownership model has two layers:
- Agent memory (
data/agents/): Owned exclusively by the named agent. One writer, many readers. - Domain data (
data/{domain}/): Owned by the responsible agent. The finance-manager ownsdata/finance/. The health-coach ownsdata/health/. Other agents can read but not write.
Exceptions are handled through the shared data governance rule: if two agents need to write to the same data, escalate to a shared coordination mechanism (message queue, task system, or explicit handoff). Don’t let two agents silently write to the same file.
Automated Auditing: The Context Auditor Pattern
At scale, manual memory maintenance is impossible. You need an automated auditor ΓÇö an agent whose job is to monitor the health of all other agents’ memory files.
The production system runs a context auditor on two schedules: a daily quick scan and a weekly deep scan. Here’s what it checks:
- Size compliance: Any core.md over 5KB? Any working.md over 5KB? Any long-term.md over 10KB? Flag them.
- Staleness detection: Any working.md with a “Last Updated” timestamp more than 3 days old AND the agent has an active cron schedule? Flag it.
- Format consistency: Does every working.md have a ”## Last Updated” section? Does every core.md follow the standard template? Flag deviations.
- Missing files: Does every agent that should have memory files actually have all four? Flag missing tiers.
- Contradiction detection: Does an agent’s working memory reference capabilities or rules that contradict its core memory? Flag inconsistencies.
The auditor produces a report ΓÇö a list of issues ranked by severity. Critical issues (stale memory on an active agent, missing core.md) are fixed automatically. Advisory issues (working memory at 4.5KB, approaching the limit) are logged for the next maintenance cycle.
This pattern ΓÇö an auditor agent that enforces memory system health ΓÇö is what makes the 4-tier system sustainable at scale. Without it, drift is inevitable. With it, memory quality stays high across 30+ agents for months without manual intervention.
Memory System Metrics
Track these metrics to monitor memory system health over time:
- Average working memory size ΓÇö should stay under 3KB (comfortable headroom below the 5KB cap). If the average creeps toward 4KB+, pruning discipline is slipping.
- Stale memory count ΓÇö number of agents with working memory older than 3 days. Target: zero for agents with active cron schedules.
- Promotion rate ΓÇö how often data is promoted from working ΓåÆ long-term. If it’s more than once per week per agent, agents are over-promoting. If it’s less than once per month, agents might be losing valuable lessons.
- Events.log size ΓÇö total across all agents. Should stay bounded by the 30-day prune cycle. If total size keeps growing month over month, pruning isn’t running.
- Template compliance ΓÇö percentage of agent memory files that match the standard templates. Target: 100%. Deviations indicate format drift.
- Cross-agent read frequency ΓÇö how often agents read other agents’ memory. If this exceeds 20% of runs, consider whether the data model needs restructuring (maybe shared data should be in a domain directory instead of agent memory).
Agent Skills as the Scaling Layer
Memory tells agents what they know. Skills tell them what they can do. Together, they form the complete agent architecture.
The Missing Piece: Reusable Procedures
You now have a memory system that gives every agent persistent context across sessions. But there’s a problem that memory alone doesn’t solve: how do agents learn new capabilities without bloating their prompts?
Consider this scenario. You have 40 agents. Ten of them need to send Telegram notifications. Five of them need to run quality checks before publishing. Eight of them interact with cron schedules. If each agent embeds its own copy of these procedures inline, you get:
- Duplication — the same 200-line workflow copy-pasted across 10 agent definitions
- Drift — one agent’s copy gets updated, the other nine fall behind
- Bloat — agent prompts grow to 15,000+ tokens, most of it procedural instructions the agent rarely needs
- Confusion — the agent can’t distinguish its identity from the procedures it follows
This is the monolith pattern applied to agent architecture. And just like monolithic codebases, it breaks down at scale.
The solution is Agent Skills — reusable procedure files that agents load on-demand based on what they need to do, not who they are. Microsoft shipped this exact pattern in Visual Studio’s Agent Skills feature in 2026. I’ve been running it in production across 71 skills and 50+ agents for months before that. The pattern works because it applies the same “load what you need” philosophy as the 4-tier memory system — but for capabilities instead of context.
How Skills Connect to Memory
Memory and skills serve fundamentally different purposes in an agent system, and understanding the boundary is critical:
Memory is agent-specific and stateful. Each agent has its own memory files that change over time. The finance manager’s working memory looks nothing like the content scheduler’s — because they track different domains.
Skills are shared and stateless. The memory-management skill teaches any agent how to load, save, prune, and promote memory. The telegram-communication skill teaches any agent how to format and send notifications. Skills don’t change based on who’s using them — they’re portable procedures.
Here’s the key insight: the memory-management skill is the skill that implements the 4-tier memory system you’ve been learning in this blueprint. The memory architecture is the design pattern. The skill is the executable instructions that agents follow to operate the pattern. Every agent in our production platform loads the memory-management skill to know exactly how to handle its own memory files — and every one of them follows the same rules, because the rules live in one canonical place.
Agent vs. Skill — The Decision Framework
The most important architectural decision in a multi-agent system is knowing what belongs in the agent definition versus what should be extracted into a reusable skill. Get this wrong and you’ll either have bloated agents that are slow and confused, or over-extracted skills that fragment the agent’s identity.
The mental model is simple: Agent = WHO. Skill = HOW.
Agent (WHO) — Keep in the Agent Definition
- Persistent identity — the agent’s name, role, personality, voice
- Memory across runs — what it knows, what it’s learned, what it’s working on
- Relationship context — how it interacts with specific users or other agents
- Autonomy and mission ownership — what it decides on its own vs. what it escalates
- Judgment that depends on accumulated context — decisions that get better with history
Skill (HOW) — Extract to a Reusable Skill
- Reusable step-by-step procedures — workflows that multiple agents follow
- Domain-specific integration patterns — how to call an API, format a message, process data
- Tool-use recipes — specific commands, flags, and patterns for external tools
- Stateless instructions portable across agents — any procedure that works the same regardless of who executes it
- Progressive disclosure content — detailed instructions that don’t need to be loaded every run
The Decision Test
Before extracting any content from an agent into a skill, ask these five questions:
- Does the behavior depend on persistent memory over time? If yes, keep it in the agent.
- Does the behavior require personality, relationship, or trust context? If yes, keep it in the agent.
- Is it a repeatable multi-step workflow? If yes, skill candidate.
- Could the same procedure help 2+ agents? If yes, strong skill candidate.
- Is the content bloating the agent prompt without adding identity? If yes, extract it immediately.
Score each candidate 0 or 1 for: reusable across agents, mostly procedural, low dependence on memory, low dependence on personality, easy to test deterministically. 4-5 = extract to skill. 2-3 = hybrid (split procedure into skill, keep judgment in agent). 0-1 = keep in agent.
Progressive Disclosure Through Skills
Here’s where skills and the 4-tier memory system share the same core philosophy: load what you need, when you need it.
Just as the memory system separates always-load (core.md) from on-demand (long-term.md), the skill system uses trigger phrase matching to load capabilities on demand. A platform with 71 skills doesn’t dump all 71 into every agent’s context window. That would be worse than the monolith — it would be 71 monoliths stacked on top of each other.
Instead, each skill has a description field with trigger phrases. When an agent encounters a task that matches those phrases, the platform loads that specific skill into context. The agent gets exactly the procedures it needs for the current task, and nothing else.
This is progressive disclosure applied to agent capabilities:
- Tier 1 (always loaded): Agent definition — identity, mission, core rules. ~3-5KB.
- Tier 2 (always loaded): Working memory — current state. ~5KB max.
- Tier 3 (on-demand): Long-term memory + skills — loaded when the task requires them.
The result: an agent that starts with a lean 8-10KB context (identity + working state), then dynamically expands to include only the capabilities it needs for the current run. A content scheduling agent doesn’t load the quality-gate skill until it’s time to review something. A finance agent doesn’t load the telegram-communication skill until it has something to report. The context window stays focused, the token costs stay low, and the agent’s attention stays sharp.
The SKILL.md Anatomy
Every skill in the system follows the same file format — a single Markdown file with YAML frontmatter. Here’s the structure:
.github/skills/{skill-name}/SKILL.mdThe directory convention is load-bearing (just like the memory directory convention). Every skill gets its own directory under .github/skills/, and the skill file is always named SKILL.md. Tooling, auditors, and agents can walk the directory tree and discover every skill programmatically.
Here’s what a real production SKILL.md looks like — this is the quality-gate skill that implements the CHECK → FIX → RECHECK → ESCALATE pattern used by content pipelines, code review, and deployment workflows:
---
name: quality-gate
description: >-
Reusable quality review gate pattern with retry logic,
failure escalation, and lessons-learned feedback loops.
Use when implementing pre-publish quality checks,
content validation gates, automated review-and-retry
workflows, or any process that needs
"check → fix → recheck → escalate" logic.
Trigger phrases include "quality check",
"quality gate", "review gate", "pre-publish check",
"retry on failure", "quality loop".
---
# Quality Gate Skill
A quality gate is a mandatory checkpoint that a
deliverable must pass before proceeding to the next
pipeline stage.
## Core Pattern
CHECK → PASS? → Continue pipeline
↓ FAIL
FIX → RECHECK → PASS? → Continue pipeline
↓ FAIL (retry exhausted)
ESCALATE → Notify human, STOP
## Configuration Schema
Every quality gate needs these parameters:
- gate_name: descriptive identifier
- max_retries: how many fix attempts before escalating
- checks: array of validation functions
- failure_policy: stop_and_notify | continue_partial
## Rules
- Never skip the gate, even for "small" changes
- Log every gate result (pass/fail/retry)
- Escalate after max retries — don't loop foreverFrontmatter Anatomy
The YAML frontmatter is how the platform discovers and routes skills:
- name — the skill’s canonical identifier. Matches the directory name. Used for programmatic references.
- description — the most important field. Contains both a human-readable description AND trigger phrases. When an agent or user says something matching a trigger phrase, this skill gets loaded into context. The trigger phrases are the routing mechanism — they replace explicit imports with natural-language pattern matching.
The description field deserves special attention. Compare these two approaches:
# BAD — vague, no trigger phrases
description: "Helps with cron stuff"
# GOOD — specific, with trigger phrases
description: >-
Cron job dispatch rules — always launch fresh agents
via task tool, never write_agent for scheduled jobs.
Use when user says "cron dispatch", "scheduled job",
"launch cron", "cron architecture",
"fresh agent for cron", "cron rule",
"dispatch pattern", or any cron-triggered agent
execution.The difference is routing precision. A vague description means the skill gets loaded when it’s not needed (wasting tokens) or doesn’t get loaded when it is needed (causing the agent to improvise). Well-written trigger phrases are the difference between a skill system that works and one that’s ignored.
Production Skills in Action
Let’s look at four real skills from a 71-skill production platform to see how they solve different problems:
1. memory-management — The Skill That Runs This Blueprint
This is the meta-skill: the skill that codifies the 4-tier memory system into executable instructions. Every stateful agent in the platform loads this skill to know how to handle its memory files.
---
name: memory-management
description: >-
4-tier memory system management for all stateful
agents — loading, saving, pruning, promoting, and
maintaining memory files. Use when user says
"load memory", "save memory", "update working memory",
"prune memory", "memory management", "tier system",
or any agent memory lifecycle activity.
---
# Memory Management Skill
## The 4-Tier System
| Tier | File | Load Rule | Size Limit |
|------|-------------|----------------|-----------|
| 1 | core.md | ALWAYS load | 3-5KB |
| 2 | working.md | ALWAYS load | 5KB max |
| 3 | long-term.md| On-demand only | 10KB max |
| 4 | events.log | Never bulk-load| Unlimited |
## First Action: Load Memory (Every Run)
1. Read data/agents/{name}/core.md # Tier 1
2. Read data/agents/{name}/working.md # Tier 2
Do NOT bulk-load Tier 3 or Tier 4.
## Last Action: Save Memory (Every Run)
1. Update working.md with what happened
2. Append to events.log
3. Promote to long-term.md ONLY when justifiedWithout this skill, every agent would implement memory management differently. Some would forget to save. Some would over-save. Some would load all four tiers every run. The skill enforces consistency across the entire fleet — one canonical set of rules that every agent follows.
2. quality-gate — The CHECK → REVIEW → REMEDIATE → MERGE Pattern
Content publishing, code deployment, and blueprint review all share the same quality pattern: check the work, flag issues, fix them, recheck, and escalate to a human if fixes fail. Instead of each pipeline embedding its own quality logic, they all reference the quality-gate skill.
The skill defines:
- A configuration schema (gate name, max retries, check list, failure policy)
- The core loop:
CHECK → PASS? → Continue | FAIL → FIX → RECHECK → PASS? → Continue | FAIL → ESCALATE - Logging requirements for every gate result
- Rules for when to stop retrying and involve a human
In the content blitz that produces this blueprint’s companion newsletter, the quality gate catches banned patterns (previous employer names, unfinished copy, missing CTAs) before any content goes live. The same skill — loaded by a different agent, for a different domain — catches deployment issues in CI pipelines. Same procedure, different context. That’s the power of skills.
3. cron-dispatch — Fresh Agent Launch to Prevent Context Pollution
This skill encodes one of the hardest-won operational lessons in the platform: cron-triggered agents must always launch fresh.
---
name: cron-dispatch
description: >-
Cron job dispatch rules — always launch fresh agents
via task tool, never write_agent for scheduled jobs.
Use when user says "cron dispatch", "scheduled job",
"launch cron", "cron architecture",
"fresh agent for cron", "dispatch pattern",
or any cron-triggered agent execution.
---
# Cron Dispatch Skill
## The Rule (ABSOLUTE — zero exceptions)
Cron-dispatched agents MUST ALWAYS be launched as
NEW agents. NEVER inject into an existing agent for
cron dispatches. Each cron cycle gets a fresh agent
with clean context. No exceptions.
## Why This Matters
When cron fires, injecting messages into an already-
running agent:
- Pollutes context with irrelevant prior messages
- Degrades performance over time as context fills
- Creates unpredictable behavior from stale state
- Makes debugging impossibleBefore this was extracted into a skill, the cron dispatch rule lived inline in three different agent definitions — each with slightly different wording. One agent’s version was outdated. Another missed the debugging rationale. Extracting it into a skill created a single source of truth that every agent references. When the rule needed updating (adding the debugging rationale), it was updated once and every consumer got the fix.
4. agent-skill-management — When to Extract vs. Keep Inline
This is the governance skill — the skill that tells agents (and developers) when something should become a skill versus staying in the agent definition. It encodes the Agent vs. Skill decision framework we covered earlier in this chapter, plus the safe refactoring workflow for extraction.
Key capabilities:
- Extraction audit workflow — inventory candidates, check for reuse and portability, separate identity from procedure, design the skill
- Contradiction detection — find rule conflicts, tool mismatches, workflow divergence, authority confusion, and boundary leakage between agents and skills
- Safe refactor pattern — snapshot current behavior, create skill first, refactor agent to reference skill, validate no constraints were lost
- Anti-patterns — never create a skill for a personality, never leave the full procedure in the agent after extraction, never let skill and agent copies drift separately
This skill is used by platform maintenance agents that periodically audit all agent definitions for bloat, contradiction, and extraction opportunities. It’s the quality control layer for the skill system itself — a skill about skills.
The Directory Structure
Skills follow the same consistency principle as memory files: predictable structure, no special snowflakes.
.github/
└── skills/
├── memory-management/
│ └── SKILL.md
├── quality-gate/
│ └── SKILL.md
├── cron-dispatch/
│ └── SKILL.md
├── agent-skill-management/
│ └── SKILL.md
├── telegram-communication/
│ └── SKILL.md
├── copilot-brand-safety/
│ └── SKILL.md
├── email-encoding/
│ └── SKILL.md
└── ... (71 skills total in production)The .github/skills/ location is deliberate. It works natively with GitHub Copilot’s agent skills feature (shipped in Visual Studio 2026), and it’s also compatible with other AI coding tools that look for capability definitions in the .github directory. The pattern is IDE-agnostic — it’s just Markdown files in a predictable location.
Each skill gets its own directory (not just a flat file) for future extensibility — some skills may eventually include templates, example files, or test fixtures alongside the SKILL.md. The directory-per-skill convention accommodates this without restructuring.
Skills + Memory = The Complete Architecture
When you combine the 4-tier memory system with the skills pattern, you get the full production architecture for persistent, capable, scalable agents:
- Agent boots up — loads core.md (identity) + working.md (current state). ~8-10KB of focused context.
- Agent reads its task — trigger phrases in the task description match relevant skills. Platform loads those skills into context. Maybe 1-3 skills per run, 2-5KB each.
- Agent executes — uses its identity (from memory) and its capabilities (from skills) to complete the task. Loads long-term.md on-demand if historical context is needed.
- Agent saves state — updates working.md, appends to events.log, promotes validated patterns to long-term.md. The memory-management skill ensures this happens correctly.
Total context per run: 15-25KB. Compare that to the monolith approach where every agent loads a 50KB+ prompt with inline procedures for every possible capability. The lean architecture means faster responses, lower costs, sharper attention, and easier debugging.
The memory system gives agents persistence. Skills give agents capabilities. The combination gives agents expertise — the ability to remember what they’ve learned and know how to apply it, without wasting attention on procedures they don’t need right now.
Deep dive: For the full exploration of Agent Skills — including how Microsoft’s Visual Studio productized this exact pattern, cross-IDE compatibility, and the skills-first development methodology — read Newsletter Issue #3: Agent Skills — Microsoft Just Shipped What You’ve Been Building.
Want the full system? This blueprint covers the memory layer. For the complete agentic development architecture — agents, context engineering, delegation, testing, CI/CD workflows, and skills — see The Agentic Development Blueprint ($129).
Implementation Checklist
Ready to add skills to your agent platform? Follow these steps:
- ☐ Create the
.github/skills/directory in your repository - ☐ Identify your first 3-5 skill candidates: look for procedures duplicated across 2+ agents
- ☐ Write each skill’s YAML frontmatter with specific trigger phrases — this is your routing layer
- ☐ Extract the first skill: start with your most-duplicated procedure (memory management is a great first candidate)
- ☐ Refactor consuming agents: replace inline procedures with skill references
- ☐ Validate: confirm agents still work correctly and no constraints were lost in extraction
- ☐ Set up periodic audits: review agent definitions monthly for new extraction opportunities
- ☐ Track your skill count and agent prompt sizes — both should trend in the right direction (skills up, prompt sizes down)
MCP Servers as Memory-Aware Tool Layers
Memory-aware MCP servers don’t just execute commands for GitHub Copilot. They remember preferences, avoid repeated mistakes, and turn every tool call into a context-rich interaction.
Why MCP + Memory Matters
Most developers discover MCP through the exciting part first: tools. You expose a send_sms tool, a get_location tool, a create_event tool, and suddenly GitHub Copilot can act on the world instead of only describing it. That leap matters. It is the difference between an assistant that talks and an assistant that helps.
But the second problem appears as soon as you run those tools in production: the tool layer forgets everything. GitHub Copilot may understand the immediate conversation, yet every tool invocation still behaves like a clean-room function call. The SMS tool doesn’t know it just sent the same text. The location tool doesn’t remember a good GPS fix from two minutes ago. The call log tool doesn’t recognize a number that has called three times this week. Stateless tools create avoidable friction.
This is why MCP plus the 4-tier memory system matters so much. Memory lets the tool layer accumulate useful context over time without forcing the agent prompt to carry every detail forever. Instead of stuffing operational history into the model context window, you persist it where it belongs: in files designed for identity, working state, learned patterns, and historical events.
Once you do that, tools stop behaving like disposable RPC calls and start acting like durable collaborators:
- They avoid repeats. Duplicate SMS sends, duplicate reminders, and duplicate device actions get blocked before they happen.
- They preserve preferences. The system remembers that “Mom” maps to a saved phone number, that a user prefers texts over calls, or that a cached location is acceptable when GPS is flaky.
- They learn from outcomes. A failed GPS lookup isn’t just an error. It becomes evidence that can improve the next run.
- They personalize responses. The same tool returns better, more specific results because it understands what happened before.
The Model Context Protocol makes tools first-class citizens in AI systems. Memory makes those tools persistent, adaptive, and trustworthy. Without memory, every call starts from zero. With memory, every call benefits from accumulated context.
Key mental model: Don’t think of memory-aware MCP as “adding storage” to tools. Think of it as giving every tool execution a briefing before the work starts and a debrief after the work ends.
Architecture: The Memory-Aware MCP Server Pattern
The pattern is simple and powerful: the MCP server’s tool handler reads memory before execution, calls the real tool or device integration, then writes the outcome back into memory. The memory layer sits between the protocol handler and the actual side effect. That makes it middleware, not business logic.
Here’s the architecture in its simplest form:
AI Client → MCP Server → [Memory Layer] → Termux/API/Device
↕
Agent Memory (4 Tiers)
├── Tier 1: Core identity (safety rules, personality)
├── Tier 2: Working memory (current session state)
├── Tier 3: Long-term (learned patterns, history)
└── Tier 4: Archival (searchable past events)The placement matters. If memory lives only in the agent prompt, the tool layer stays dumb. If memory lives only in the downstream system, the agent loses visibility and portability. When the memory layer lives inside the MCP server, you get the best of both worlds: GitHub Copilot calls a normal tool, while the server transparently enriches that tool call with memory-derived context.
Each tier has a distinct role in the server:
- Tier 1: Core identity. These are the rules the server should enforce every time: quiet hours, safety constraints, confirmation requirements, preferred tone, and non-negotiable guardrails.
- Tier 2: Working memory. This is the hot path. Recent SMS history, last known GPS fix, pending retries, or the last contact lookup belong here because they change frequently and are needed immediately.
- Tier 3: Long-term memory. This stores validated patterns: nickname mappings, preferred contact methods, repeated failure zones, common call windows, or known device quirks.
- Tier 4: Archival memory. This is the append-only event stream. It doesn’t have to be loaded every time, but it gives you the raw history you need for audits, summaries, and pattern promotion.
The result is a server that behaves more like an experienced operator than a fresh process. It knows the rules, remembers what just happened, recognizes longer-term patterns, and keeps a durable history.
Implementation: Memory Layer Middleware
The cleanest implementation is a dedicated middleware class that every tool handler can call. It should do four jobs well: load context, expose context in a typed shape, persist outcomes, and sanitize what gets logged. The actual tool handlers stay focused on their domain work.
The example below uses Node.js and TypeScript with file-backed memory tiers. It keeps Tier 2 as structured JSON inside working.md, reads Tier 3 from named Markdown sections, and appends sanitized execution records to events.log.
memory-layer.ts
import { appendFileSync, readFileSync, writeFileSync } from 'node:fs';
import { createHash } from 'node:crypto';
interface MemoryConfig {
workingMemoryPath: string; // Tier 2
longTermPath: string; // Tier 3
eventsLogPath: string; // Tier 4
}
interface WorkingMemory {
recentMessages: Array<{ to: string; text: string; timestamp: string }>;
lastLocation: { lat: number; lng: number; timestamp: string } | null;
toolState: Record<string, { lastRun: string; lastResult: string }>;
}
interface ToolContext {
recentMessages: Array<{ to: string; text: string; timestamp: string }>;
contactNicknames: Record<string, string>;
lastKnownLocation: { lat: number; lng: number; timestamp: string } | null;
locationTimestamp: string | null;
callPatterns: Record<string, { callsThisWeek: number; typicalHours: number[]; lastSeenAt: string }>;
}
export class MemoryLayer {
constructor(private readonly config: MemoryConfig) {}
async getContext(toolName: string): Promise<ToolContext> {
const working = this.readWorkingMemory();
const context: ToolContext = {
recentMessages: [],
contactNicknames: this.readSectionMap('contact_nicknames'),
lastKnownLocation: null,
locationTimestamp: null,
callPatterns: this.readSectionMap('call_patterns'),
};
if (toolName === 'send_sms') {
context.recentMessages = working.recentMessages;
}
if (toolName === 'get_location') {
context.lastKnownLocation = working.lastLocation;
context.locationTimestamp = working.lastLocation?.timestamp ?? null;
}
if (toolName === 'get_call_log') {
context.callPatterns = this.readSectionMap('call_patterns');
}
return context;
}
async recordOutcome(toolName: string, input: unknown, output: { success: boolean; summary: string; data?: unknown }): Promise<void> {
const working = this.readWorkingMemory();
working.toolState[toolName] = {
lastRun: new Date().toISOString(),
lastResult: output.summary,
};
if (toolName === 'send_sms' && this.isSmsPayload(input)) {
working.recentMessages.push({
to: input.to,
text: input.message,
timestamp: new Date().toISOString(),
});
working.recentMessages = working.recentMessages.slice(-25);
}
if (toolName === 'get_location' && output.success && this.isLocationPayload(output.data)) {
working.lastLocation = {
lat: output.data.lat,
lng: output.data.lng,
timestamp: new Date().toISOString(),
};
}
this.writeWorkingMemory(working);
const event = {
timestamp: new Date().toISOString(),
tool: toolName,
input: this.sanitize(input),
output: this.sanitize(output),
success: output.success,
};
appendFileSync(this.config.eventsLogPath, JSON.stringify(event) + '
', 'utf8');
}
private readWorkingMemory(): WorkingMemory {
try {
const raw = readFileSync(this.config.workingMemoryPath, 'utf8');
const jsonStart = raw.indexOf('```json');
const jsonEnd = raw.indexOf('```', jsonStart + 7);
if (jsonStart === -1 || jsonEnd === -1) throw new Error('working.md is missing a JSON block');
const jsonText = raw.slice(jsonStart + 7, jsonEnd).trim();
const parsed = JSON.parse(jsonText) as Partial<WorkingMemory>;
return {
recentMessages: parsed.recentMessages ?? [],
lastLocation: parsed.lastLocation ?? null,
toolState: parsed.toolState ?? {},
};
} catch {
return {
recentMessages: [],
lastLocation: null,
toolState: {},
};
}
}
private writeWorkingMemory(memory: WorkingMemory): void {
const next = [
'# Working Memory',
'',
'## Last Updated',
new Date().toISOString(),
'',
'## State',
'```json',
JSON.stringify(memory, null, 2),
'```',
'',
].join('
');
writeFileSync(this.config.workingMemoryPath, next, 'utf8');
}
private readSectionMap<T = Record<string, unknown>>(sectionName: string): T {
try {
const raw = readFileSync(this.config.longTermPath, 'utf8');
const pattern = new RegExp('## ' + sectionName + '\n```json\n([\s\S]*?)\n```', 'i');
const match = raw.match(pattern);
return match ? JSON.parse(match[1]) as T : {} as T;
} catch {
return {} as T;
}
}
private sanitize(value: unknown): unknown {
if (!value || typeof value !== 'object') {
return value;
}
return JSON.parse(JSON.stringify(value, (_key, currentValue) => {
if (typeof currentValue !== 'string') {
return currentValue;
}
if (/^+?[0-9-() ]{7,}$/.test(currentValue)) {
return 'phone:' + createHash('sha256').update(currentValue).digest('hex').slice(0, 12);
}
if (currentValue.length > 80) {
return currentValue.slice(0, 77) + '...';
}
return currentValue;
}));
}
private isSmsPayload(value: unknown): value is { to: string; message: string } {
return Boolean(value)
&& typeof value === 'object'
&& typeof (value as { to?: unknown }).to === 'string'
&& typeof (value as { message?: unknown }).message === 'string';
}
private isLocationPayload(value: unknown): value is { lat: number; lng: number } {
return Boolean(value)
&& typeof value === 'object'
&& typeof (value as { lat?: unknown }).lat === 'number'
&& typeof (value as { lng?: unknown }).lng === 'number';
}
}This middleware is intentionally boring in the best way. It doesn’t know how to send an SMS or query GPS. It knows how to load the right context, persist the right facts, and protect the logs. That separation is what makes the pattern reusable across many tools and many servers.
Wrapping Any Tool with Memory
Once the memory layer exists, the next step is wrapping your normal MCP tool handlers so every execution follows the same read → execute → write lifecycle. The tool stays easy to reason about, but it stops acting like a blank slate.
In practice, the flow looks like this:
- GitHub Copilot calls a tool through MCP.
- The wrapper asks the memory layer for tool-specific context.
- The handler resolves nicknames, checks recent activity, or loads learned patterns.
- The real side effect happens only after the context checks pass.
- The wrapper records what happened so the next call is smarter.
memory-aware-tool.ts
import { readFileSync, writeFileSync } from 'node:fs';
import { McpServer } from '@modelcontextprotocol/server';
import * as z from 'zod/v4';
import { execFile } from 'node:child_process';
import { promisify } from 'node:util';
import { MemoryLayer } from './memory-layer.js';
const execFileAsync = promisify(execFile);
const server = new McpServer({
name: 'phone-memory-server',
version: '1.0.0',
});
const memory = new MemoryLayer({
workingMemoryPath: './data/agents/phone-agent/working.md',
longTermPath: './data/agents/phone-agent/long-term.md',
eventsLogPath: './data/agents/phone-agent/events.log',
});
async function termux(command: string, args: string[]): Promise<{ stdout: string }> {
return execFileAsync(command, args);
}
async function getGpsFix(): Promise<{ lat: number; lng: number }> {
const { stdout } = await termux('termux-location', ['gps']);
const parsed = JSON.parse(stdout) as { latitude: number; longitude: number };
return { lat: parsed.latitude, lng: parsed.longitude };
}
async function readDeviceCallLog(limit: number): Promise<CallEntry[]> {
const { stdout } = await termux('termux-call-log', ['-l', String(limit)]);
return JSON.parse(stdout) as CallEntry[];
}
async function persistLongTermPatterns(path: string, section: string, data: unknown): Promise<void> {
const raw = readFileSync(path, 'utf8');
const nextSection = '## ' + section + '
```json
' + JSON.stringify(data, null, 2) + '
```';
const pattern = new RegExp('## ' + section + '\n```json\n[\s\S]*?\n```', 'i');
const updated = pattern.test(raw) ? raw.replace(pattern, nextSection) : raw.trimEnd() + '
' + nextSection + '
';
writeFileSync(path, updated, 'utf8');
}
server.registerTool(
'send_sms',
{
title: 'Send SMS',
description: 'Send an SMS with memory-aware duplicate detection and nickname resolution',
inputSchema: z.object({
to: z.string().describe('Phone number or contact nickname'),
message: z.string().describe('Message text'),
}),
},
async ({ to, message }) => {
const ctx = await memory.getContext('send_sms');
const resolvedNumber = ctx.contactNicknames[to] ?? to;
const isDuplicate = ctx.recentMessages.some((msg) => {
const ageMs = Date.now() - new Date(msg.timestamp).getTime();
return msg.to === resolvedNumber && msg.text === message && ageMs < 300_000;
});
if (isDuplicate) {
return {
content: [{
type: 'text' as const,
text: 'Duplicate detected — identical message was already sent within the last 5 minutes. Skipped.',
}],
isError: false,
};
}
await termux('termux-sms-send', ['-n', resolvedNumber, message]);
await memory.recordOutcome(
'send_sms',
{ to: resolvedNumber, message },
{ success: true, summary: 'sms_sent' },
);
return {
content: [{
type: 'text' as const,
text: 'SMS sent to ' + resolvedNumber + ' (' + to + ')',
}],
};
},
);This is the smallest useful version of the pattern. Even here, the win is obvious: nickname resolution comes from Tier 3, duplicate detection comes from Tier 2, and the outcome gets written back automatically for future runs.
Real Example: Phone MCP + Memory Enhanced
The phone MCP server from Newsletter #4: Your Phone as an AI Tool is the perfect case study because mobile tools are full of repeated actions, flaky sensors, and preference-heavy workflows. Three enhancements show the pattern clearly.
1. Smart SMS (duplicate detection + nickname resolution)
send_sms should not behave like a fire-and-forget shell command. It should behave like a communication assistant. Before sending, the tool reads Tier 2 working memory to see whether the same message already went to the same person recently. This matters because tool retries, host reconnects, and model restarts all create accidental duplicates in the real world.
Tier 3 adds another improvement: nickname resolution. Users rarely say, “Send a message to +1-555-123-4567.” They say, “Text Mom,” “message the babysitter,” or “tell my brother I’m outside.” A memory-aware SMS tool resolves those nicknames from long-term memory instead of making GitHub Copilot rediscover the contact every time.
The practical outcome is huge: fewer repeated texts, faster tool calls, less prompt bloat, and more trust.
2. Cached Location (graceful degradation)
Location tools are a classic case for Tier 2 working memory. Live GPS can take 30 to 60 seconds, and it fails in exactly the places where users most want it to work: inside buildings, garages, elevators, and underground parking. A stateless location tool either waits too long or returns an error.
A memory-aware version does something smarter. It checks working memory first. If the last known location is fresh — say, under five minutes old — it returns that cached value immediately. If live GPS fails entirely, it falls back to the cached value with a staleness warning so GitHub Copilot can still answer usefully instead of collapsing into failure.
server.registerTool(
'get_location',
{
title: 'Get Location',
description: 'Return the current location, preferring fresh cache and graceful fallback',
inputSchema: z.object({}),
},
async () => {
const ctx = await memory.getContext('get_location');
if (ctx.lastKnownLocation) {
const ageMs = Date.now() - new Date(ctx.lastKnownLocation.timestamp).getTime();
if (ageMs < 300_000) {
return {
content: [{
type: 'text' as const,
text: JSON.stringify({
source: 'cache',
ageSeconds: Math.round(ageMs / 1000),
location: ctx.lastKnownLocation,
}),
}],
};
}
}
try {
const live = await getGpsFix();
await memory.recordOutcome(
'get_location',
{},
{ success: true, summary: 'live_gps', data: live },
);
return {
content: [{ type: 'text' as const, text: JSON.stringify({ source: 'gps', location: live }) }],
};
} catch {
if (ctx.lastKnownLocation) {
const ageMs = Date.now() - new Date(ctx.lastKnownLocation.timestamp).getTime();
return {
content: [{
type: 'text' as const,
text: JSON.stringify({
source: 'cache-fallback',
staleSeconds: Math.round(ageMs / 1000),
warning: 'GPS unavailable — using cached location',
location: ctx.lastKnownLocation,
}),
}],
};
}
return {
content: [{ type: 'text' as const, text: 'GPS unavailable and no cached location exists.' }],
isError: true,
};
}
},
);This is exactly the kind of middleware win you want. The device integration remains simple, but the user experience improves dramatically because the tool remembers.
3. Call Pattern Learning (Tier 3 long-term)
Call history is where Tier 3 becomes more than convenience. Over time, the phone MCP server can build a pattern map: who calls often, which hours matter, which missed calls usually lead to follow-up texts, and which numbers are noise. GitHub Copilot doesn’t have to infer urgency from a single call log snapshot. The memory layer can attach context from longer history.
That enables a much better answer to questions like, “Should I return this call?” Instead of guessing, the system can say: This number has called three times this week, usually around 2 PM, and the last call was 30 minutes ago. That’s not model improvisation. That’s learned evidence.
type CallEntry = {
number: string;
direction: 'incoming' | 'outgoing' | 'missed';
timestamp: string;
};
function buildCallPatterns(entries: CallEntry[]) {
const patterns: Record<string, { callsThisWeek: number; typicalHours: number[]; lastSeenAt: string }> = {};
for (const entry of entries) {
const hour = new Date(entry.timestamp).getHours();
const current = patterns[entry.number] ?? {
callsThisWeek: 0,
typicalHours: [],
lastSeenAt: entry.timestamp,
};
current.callsThisWeek += 1;
current.typicalHours = Array.from(new Set([...current.typicalHours, hour])).sort((a, b) => a - b);
if (new Date(entry.timestamp) > new Date(current.lastSeenAt)) {
current.lastSeenAt = entry.timestamp;
}
patterns[entry.number] = current;
}
return patterns;
}
server.registerTool(
'get_call_log',
{
title: 'Get Call Log',
description: 'Return recent calls and attach learned importance signals',
inputSchema: z.object({
limit: z.number().int().min(1).max(50).default(10),
}),
},
async ({ limit }) => {
const entries = await readDeviceCallLog(limit);
const patterns = buildCallPatterns(entries);
await persistLongTermPatterns('./data/agents/phone-agent/long-term.md', 'call_patterns', patterns);
await memory.recordOutcome('get_call_log', { limit }, { success: true, summary: 'call_log_loaded' });
const enriched = entries.map((entry) => ({
...entry,
pattern: patterns[entry.number] ?? null,
}));
return {
content: [{ type: 'text' as const, text: JSON.stringify(enriched, null, 2) }],
};
},
);That one enrichment changes downstream reasoning. GitHub Copilot can now answer with context instead of intuition because the tool layer has already converted raw call history into durable patterns.
The Scalability Insight: Shared Memory, Many Servers
The pattern becomes even more powerful when multiple MCP servers share the same memory system. One memory architecture, many specialized tool servers. The phone server knows what just happened on the device. The calendar server knows what the day looks like. The home automation server knows the state of the house. Shared memory gives them a common language.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Phone MCP │ │ Calendar MCP │ │ Home MCP │
│ (18 tools) │ │ (events, avail)│ │ (lights, locks)│
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└──────────┬─────────┴──────────┬─────────┘
│ │
┌─────▼────────────────────▼─────┐
│ Shared Memory Layer │
│ Tier 1: Core rules (all servers)│
│ Tier 2: Working (per-server) │
│ Tier 3: Long-term (shared) │
│ Tier 4: Events (unified log) │
└────────────────────────────────┘This is where emergent intelligence shows up. The calendar MCP doesn’t have to rediscover that you’re driving because the phone MCP already wrote fresh location state. The home MCP can infer arrival from the same shared context. The servers remain separate for clarity and security, but the memory system ties them together into a coherent agent environment.
The most practical version of this pattern is:
- Tier 1 shared by policy. Safety rules and global behavior constraints apply across all servers.
- Tier 2 partitioned per server. Each server keeps its own hot working state to avoid collisions and unnecessary churn.
- Tier 3 shared selectively. Learned patterns that help multiple servers — schedule norms, presence signals, communication preferences — become shared knowledge.
- Tier 4 unified for auditing. A single searchable execution stream makes it easier to debug cross-server behavior and promote useful patterns.
In other words: separate tools, shared understanding.
Production Patterns
Once you move from a proof of concept to a real deployment, four production rules matter more than the happy-path demo.
1. Memory isolation per user
In a multi-user deployment, never let everyone share one memory directory. Scope memory by user or tenant. Your memory layer constructor should accept a userId and derive file paths from it. That gives you personalization without cross-user leakage.
2. Memory TTL
Tier 2 is working memory, not a junk drawer. Give entries an expiration policy. Location caches may only be useful for 5 minutes. Recent SMS dedupe windows might last 24 hours. Retry state may expire after an hour. If you do not enforce TTL, stale data will quietly become wrong data.
3. Memory compaction
Tier 4 grows forever unless you manage it. The right pattern is a nightly summarization job that reviews the raw event stream, promotes stable patterns to Tier 3, and archives old events. Raw logs are useful, but summarized knowledge is what keeps the system fast and readable.
4. Privacy-first logging
Not every detail belongs in archival memory. The sanitize() function should hash phone numbers, redact message content when necessary, and store summaries rather than verbatim personal data. Memory makes tools better, but privacy rules must improve at the same time.
function getUserMemoryConfig(userId: string): MemoryConfig {
const root = './data/users/' + userId + '/phone-agent';
return {
workingMemoryPath: root + '/working.md',
longTermPath: root + '/long-term.md',
eventsLogPath: root + '/events.log',
};
}
function isFresh(timestamp: string, ttlMs: number): boolean {
return Date.now() - new Date(timestamp).getTime() < ttlMs;
}
function pruneExpiredMessages(messages: Array<{ timestamp: string }>): Array<{ timestamp: string }> {
return messages.filter((message) => isFresh(message.timestamp, 86_400_000));
}These patterns are what separate an interesting demo from a memory system you can actually trust over weeks and months.
Key Takeaways
- MCP servers without memory are stateless function calls. With memory, they become adaptive tool layers that remember what just happened and what has been learned over time.
- The middleware placement is the breakthrough. Put the memory layer between the MCP protocol handler and the real side effect so every tool call gets context automatically.
- Tier 2 is the fastest starting point. Add working-memory reads and writes to your most-used tool first. That’s where deduplication, caching, and fallback behavior pay off immediately.
- Tier 3 turns repeated observations into judgment support. Nicknames, dead zones, call patterns, and preference maps should be promoted only after consistent evidence.
- Shared memory across multiple servers creates coherence. Phone, calendar, home, and other MCP servers become more useful because each server benefits from what the others have learned.
- Start simple, then expand. Make one tool memory-aware, validate the behavior, then extend the pattern across the rest of your MCP surface area.
Implementation advice: If you only do one thing after reading this chapter, add Tier 2 working-memory reads and writes to your most-used MCP tool. You will feel the difference immediately, and the rest of the architecture will make more sense once you see that first tool stop repeating itself.
This chapter pairs with Newsletter #4: Your Phone as an AI Tool — Building an 18-Tool MCP Server from Scratch. The newsletter covers the base server; this chapter shows how to make it memory-aware.
The complete MCP server patterns — including security, multi-transport support, deployment, and broader agent architecture — are covered in The Agentic Development Blueprint.
The Three-Layer Extension Architecture
Why production GitHub Copilot systems separate knowledge, capability, and enforcement into skills, extensions, and hooks.
Why Agents Need Architecture Layers
Once you solve memory, the next production problem appears immediately: the agent remembers more, but it still behaves like a prompt with side effects. It knows a rule, but it cannot reliably execute the right workflow. It knows a boundary, but it may still try the wrong tool. It knows a procedure, but every new agent has to relearn that procedure from duplicated prompt text.
That is why prompts alone are never enough. A prompt is a compressed wish list. It mixes identity, policies, procedures, capability hints, and temporary context into one blob and asks the model to keep all of it straight while doing real work. That might hold for a demo. It does not hold when GitHub Copilot is editing files, calling APIs, sending messages, creating events, or coordinating across a platform with dozens of agents.
The right move is separation of concerns. In a mature agent system, you split three jobs that prompt-only systems keep smashing together:
- Knowledge - what the agent should know how to do repeatedly
- Capability - what the agent can physically do at runtime
- Enforcement - what the agent is forbidden to do, regardless of what it “wants” to do
Those three jobs map cleanly to the three layers of the architecture:
┌────────────────────────────────────────────────────────────┐
│ Layer 1: SKILLS │
│ Knowledge layer - reusable procedures, rules, decisions │
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ Layer 2: EXTENSIONS │
│ Capability layer - tools, APIs, runtime state, side effects│
└────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────┐
│ Layer 3: HOOKS │
│ Enforcement layer - intercept, deny, log, inject feedback │
└────────────────────────────────────────────────────────────┘This is the same progression I described in my article on context engineering: move important behavior out of fragile prose and into explicit architecture. Skills keep knowledge reusable. Extensions give GitHub Copilot real power. Hooks keep that power governed.
Another way to say it: memory makes the agent continuous, but layers make the agent operational. The 4-tier memory system tells the agent who it is, what it is doing, what it has learned, and what happened before. The three-layer extension architecture tells the agent how to act safely and at scale.
| If you put everything in the prompt… | What breaks | What the layered model fixes |
|---|---|---|
| Procedures live inline in every agent prompt | Duplication, drift, prompt bloat | Move them into shared skills |
| Capabilities are described in prose | No real runtime execution path | Expose governed extension tools |
| Safety rules are only written as instructions | They are probabilistic and can be ignored | Enforce them with deterministic hooks |
That is the core thesis of this chapter. If you want the fast companion version, for the deep dive on this architecture, see Newsletter Issue #5. This blueprint chapter is the implementation version: the code shapes, decision rules, and composition patterns you can lift into your own Copilot platform.
Key distinction: prompts are advisory, skills are instructional, extensions are operational, and hooks are authoritative. If a rule matters, it must eventually leave the prompt and become architecture.
Layer 1 - Skills (Knowledge Layer)
A skill is the cheapest, fastest, and cleanest way to scale agent knowledge. It is not a service. It is not a plugin. It is not a hidden prompt fragment. It is a Markdown file - usually named SKILL.md - with YAML frontmatter and complete instructions for a repeatable procedure.
This is the pattern I covered in my agent skills article, and it is still the most important scalability trick in the platform. If three agents need the same process, that process should not live in three prompts. It should live in one skill file that all three agents can invoke by reference.
That is why the platform runs 60+ skills in production. They cover content publishing, calendar checks, financial workflows, clarification protocols, Google Calendar availability, Vercel previews, research patterns, quality gates, and more. The power is not just the number. The power is that one update teaches every agent that uses the skill. Change the skill once and the behavior updates everywhere.
---
name: vercel-preview-workflow
description: >
Branch + pull request + Vercel preview review workflow for Vercel-connected repos.
Use when the user says "deploy", "preview URL", "ship to Vercel", "create PR",
or asks to update a site that must be reviewed before merge.
---
# Vercel Preview Workflow
## Purpose
Ship website changes through a governed preview-first flow instead of pushing directly to main.
## Rules
1. Never push directly to main for Vercel-connected repos.
2. Create a branch first.
3. Open a PR and wait for the preview URL.
4. Send the preview to the reviewer before merge.
## Procedure
1. Start the dev branch.
2. Stage and commit changes with the governed workflow tools.
3. Push the branch.
4. Create the Vercel PR.
5. Wait for preview validation before merge.That example shows the entire idea in miniature:
- YAML frontmatter gives the skill a stable identity and a description rich with trigger phrases.
- The body contains the actual rules and procedure in human-readable Markdown.
- The skill stays focused on knowledge. It teaches the workflow. It does not itself create the branch, push the commit, or open the PR.
That last point matters. Skills are the knowledge layer, not the capability layer. A skill answers questions like:
- When should the agent ask for clarification instead of assuming?
- What is the canonical workflow for publishing content?
- How should a Vercel-connected repo be deployed?
- What rules govern a monthly finance review?
Those are exactly the sorts of procedures that get duplicated endlessly when teams rely on prompts alone. Skills stop that duplication. They also give you version control, diffs, code review, and progressive disclosure. GitHub Copilot does not need to preload every skill into every prompt. It can discover the relevant skill, load it when needed, and ignore the rest.
That progressive loading model is why skills scale so well. A prompt carrying every rule becomes what I called the new monolith. A skill system keeps shared knowledge modular. One skill teaches every agent. One edit updates the whole fleet.
But skills have a hard boundary: they cannot enforce. A skill can say “never push directly to main,” but if the runtime does not back that up, the model can still improvise. Skills teach good behavior. They do not guarantee it. That is why they are layer 1, not the full architecture.
Layer 2 - Extensions (Capability Layer)
If skills are the playbooks, extensions are the hands. This is where GitHub Copilot stops being a text generator and becomes an operator. Extensions expose named tools, connect to APIs, hold runtime state, and perform real side effects. They are the capability layer of the system.
The standard runtime pattern is built on @github/copilot-sdk and the joinSession({ tools, hooks }) entrypoint. An extension joins the active Copilot session, registers tools, and optionally installs lifecycle hooks around that session. The shape is simple enough to learn in an afternoon and strong enough to power a real platform.
import { joinSession } from "@github/copilot-sdk/extension";
async function sendTelegram(chatId, text) {
// Call Telegram Bot API here
return { chatId, delivered: true, text };
}
async function createCalendarEvent(title, startIso, endIso, location) {
// Call Google Calendar API here
return { id: "evt_123", title, startIso, endIso, location };
}
await joinSession({
tools: [
{
name: "telegram_send_message",
description: "Send a Telegram message to a chat ID for alerts, summaries, or approvals.",
parameters: {
type: "object",
properties: {
chatId: { type: "string" },
text: { type: "string" }
},
required: ["chatId", "text"]
},
handler: async ({ chatId, text }) => {
return await sendTelegram(chatId, text);
}
},
{
name: "gcal_create_event",
description: "Create a Google Calendar event with start/end time and optional location.",
parameters: {
type: "object",
properties: {
title: { type: "string" },
startIso: { type: "string" },
endIso: { type: "string" },
location: { type: "string" }
},
required: ["title", "startIso", "endIso"]
},
handler: async ({ title, startIso, endIso, location }) => {
return await createCalendarEvent(title, startIso, endIso, location);
}
}
],
hooks: {
onSessionStart: async () => ({
additionalContext: "[google-services] Telegram + Calendar tools are loaded for this session."
})
}
});That single snippet captures what extensions are for:
- Tool definitions give GitHub Copilot stable, named actions to call.
- Descriptions teach the model when a tool is appropriate.
- Parameters define a concrete input contract.
- Handlers do the actual work against live systems.
That separation is what turns capability into architecture instead of ad-hoc shell usage. Instead of vaguely telling the model “you can send a Telegram message somehow,” you give it a first-class telegram_send_message tool with a schema, a handler, and a predictable outcome.
In production, the platform runs 30+ extensions. A representative slice looks like this:
telegram-bridgefor family notifications and alertsdev-workflowfor governed branch, commit, push, and PR operationscron-schedulerfor recurring jobs and next-run inspectiongoogle-servicesfor Calendar, Gmail, and other Google APIsbudget-trackerfor structured finance operationsshopping-listfor household logisticstwilio-smsfor outbound messaging flows
Those extensions are not documentation artifacts. They are runtime capability. They actually send messages, create events, parse cron schedules, read finance data, and manage state. This is why extensions are the right home for anything that depends on live truth: today’s inbox, current calendar state, an API token, a webhook payload, a database read, or a preview deployment status.
That also clarifies the boundary between skills and extensions:
- Skill: “When a repo is connected to Vercel, always use a preview-first PR workflow.”
- Extension:
start_dev_branch,dev_add,dev_commit,dev_push, andcreate_vercel_pr.
The skill teaches the procedure. The extension gives GitHub Copilot the buttons to press. You need both.
If you want the broader runtime patterns, the companion article GitHub Copilot CLI Extensions - Complete Guide goes deeper on the extension surface. For this chapter, the essential point is simple: extensions are where the agent stops describing work and starts doing work.
Layer 3 - Hooks (Enforcement Layer)
The third layer is where policy becomes real. Hooks intercept behavior before or after execution. That means they do something prompts and skills can never fully do: they can deny, redirect, warn, or inject context deterministically.
This is the difference between guidance and enforcement. A prompt can say “do not edit .env files.” A skill can explain why editing .env is dangerous. A hook can stop the edit from happening.
There are two closely related shapes to understand:
- SDK hook functions like
onPreToolUseandonPostToolUseinside an extension built withjoinSession() - Repo-level hook routing through
hooks.json, which dispatches CLI events likepreToolUseandpostToolUseinto policy handlers such asgh-hookflow
Together, they give you both immediate interception and scalable workflow routing.
{
"version": 1,
"hooks": {
"preToolUse": [
{
"type": "command",
"command": "gh hookflow run --raw --event-type preToolUse",
"timeoutSec": 1800
}
],
"postToolUse": [
{
"type": "command",
"command": "gh hookflow run --raw --event-type postToolUse",
"timeoutSec": 1800
}
],
"sessionStart": [
{
"type": "command",
"command": "gh hookflow check-setup || gh hookflow init",
"timeoutSec": 1800
}
]
}
}That hooks.json file is what makes repo-level governance durable. The CLI emits hook events. The router passes them to gh-hookflow. Hookflow then evaluates workflow files in the repository and decides whether to block, warn, or react.
Here is the canonical example the architecture must support: blocking direct modification of secret-bearing config files.
import { joinSession } from "@github/copilot-sdk/extension";
const PROTECTED_PATHS = [
/.env($|.)/,
/package-lock.json$/,
/.github/workflows//
];
await joinSession({
tools: [],
hooks: {
onPreToolUse: async (input) => {
if (input.toolName !== "edit" && input.toolName !== "create") {
return undefined;
}
const path = String(input.toolArgs?.path || "");
if (PROTECTED_PATHS.some((pattern) => pattern.test(path))) {
return {
deny: true,
reason: "Protected file blocked: " + path + ". Use environment variables or the approved workflow instead."
};
}
return undefined;
},
onPostToolUse: async (input) => {
if (input.toolName === "edit") {
return {
additionalContext: "[dev-guard] File changed. Run validation before commit."
};
}
return undefined;
}
}
});That is the enforcement layer doing exactly what it is supposed to do:
- Before execution: inspect the tool call and block forbidden paths
- After execution: inject follow-up context, log, or trigger remediation
And this is the crucial production insight: unlike prompt text, hooks cannot be casually ignored by the model. If the runtime says the action is denied, the action is denied. That is why hooks are the correct home for protected files, git governance, pre-commit checks, approval gates, and audit logging.
This is also where gh-hookflow becomes important. Once you have more than a handful of guardrails, you do not want a giant monolithic extension file containing every policy in JavaScript. Hookflow lets you treat hook events the way GitHub Actions treats repo events: event in, workflow matched, steps executed, result returned. Complex policies stop hiding in prose and start living as reviewable artifacts.
name: block-env-edits
description: Prevent direct modifications to .env files and environment overlays
blocking: true
on:
file:
lifecycle: pre
types: [edit, create]
paths:
- ".env"
- ".env.*"
steps:
- name: explain denial
run: |
echo "Direct .env edits are blocked."
echo "Store secrets in the environment manager and use approved config workflows instead."
exit 2The combination is powerful: hooks.json routes the event, hookflow evaluates the workflow, and the enforcement decision lands before the risky side effect happens. That is deterministic governance for an AI system.
The Decision Flowchart
The easiest way to keep these layers clean is to turn the choice into a flowchart. Every time you need to add behavior, ask the same sequence of questions:
START: I need to add behavior to my Copilot agent
│
├─ Do I need to BLOCK, REQUIRE, or INTERCEPT something?
│ ├─ YES -> Hook
│ └─ NO
│
├─ Do I need to ADD A TOOL, CALL AN API, HANDLE RUNTIME STATE,
│ or perform a REAL SIDE EFFECT?
│ ├─ YES -> Extension
│ └─ NO
│
├─ Do I need to TEACH a repeatable procedure, decision rule,
│ formatting convention, or workflow used by multiple agents?
│ ├─ YES -> Skill
│ └─ NO -> Inline instructionThat decision tree sounds almost too simple, but it is the architecture discipline that keeps the platform from collapsing back into prompt soup.
- Need to BLOCK something? Use a hook because only hooks can deny.
- Need to DO something? Use an extension because only extensions can provide runtime capability.
- Need to KNOW something? Use a skill because skills are the reusable knowledge layer.
The main anti-patterns all come from violating that flowchart. Teams use skills as fake APIs. They stuff policy into prompts instead of hooks. They bury repeated procedures inside agent definitions instead of extracting them into skills. Once you adopt the decision tree, the architecture gets clearer fast.
Fast rule of thumb: if the model could ignore it, it is not enforcement yet. If a human would describe it as “a procedure,” it probably belongs in a skill. If it touches the outside world, it probably belongs in an extension.
Composing All Three
The real payoff is not any individual layer. It is composition. Production Copilot systems feel trustworthy when the knowledge layer, capability layer, and enforcement layer all point in the same direction.
The cleanest example is the deployment workflow used on Vercel-connected repos:
vercel-preview-workflowskill teaches the correct process: branch first, open a PR, wait for the preview URL, get review, then merge.dev-workflowextension provides the governed runtime tools:start_dev_branch,dev_add,dev_commit,dev_push, andcreate_vercel_pr.dev-guardhook blocks the unsafe path: raw git mutation commands, direct pushes to protected branches, or edits to files that should flow through governed tooling.
That is the composition pattern in its purest form:
| Layer | Role in the workflow | Why it matters |
|---|---|---|
| Skill | Documents when and how to use the preview workflow | The agent learns the right sequence |
| Extension | Exposes the actual branch, commit, push, and PR tools | The agent has a safe paved road to execute |
| Hook | Intercepts unsafe commands and governance violations | The unsafe path is blocked even if the model improvises |
Imagine the user says, “ship this site update.” A prompt-only agent may remember some vague best practice and still take shortcuts. In the three-layer system, the path is explicit:
- The skill tells GitHub Copilot that Vercel repos must go through preview-first PR flow.
- The extension gives it the exact tools needed to follow that path.
- The hook blocks any attempt to bypass the governed workflow with raw git commands or direct protected-branch pushes.
The result is not just a smarter agent. It is an agent with a paved road. The good path is easy, and the bad path is structurally hard or impossible. That is the same platform engineering principle we apply to humans. We are just applying it to GitHub Copilot.
This same composition pattern repeats everywhere:
- Content publishing: content issue lifecycle skill + publishing extensions + quality gate hooks
- Finance ops: finance task lifecycle skill + budget tracker extension + reminder suppression hooks
- Family scheduling: calendar availability skill + Google services extension + stale-data hooks
Once you see the pattern, you stop asking “Should this be in the prompt?” and start asking the more useful question: Which layer owns this concern?
Open-Source Starter
You do not need to invent this architecture from scratch. The open-source copilot-hooks-starter repo exists to give you templates for all three patterns: skills, extensions, and hooks, plus the safety guidance around them.
The repo structure is intentionally opinionated because the goal is to shorten the path from “I understand the idea” to “I have a governed Copilot system running in my repo.”>
copilot-hooks-starter/
├── README.md
├── extensions/
│ ├── basic-tool/
│ ├── multi-tool/
│ └── hook-based/
├── hooks/
│ ├── pre-edit-guard.md
│ ├── pre-commit-tests.md
│ ├── post-edit-lint.md
│ └── post-tool-context-injection.md
├── skills/
│ ├── skill-template/
│ │ └── SKILL.md
│ └── skill-vs-hook-decision.md
├── examples/
└── safety/That starter repo gives you exactly what most teams are missing at the beginning:
- Hook templates for blocking protected files, linting after edits, and injecting follow-up context
- Extension templates for single-tool, multi-tool, and hook-centric extension patterns
- Skill templates with canonical frontmatter and instruction structure
- Safety guidance covering protected paths, circumvention tests, and sandbox strategy
The practical getting-started sequence is straightforward:
- Create one skill for a procedure you already repeat across agents.
- Expose one high-value extension tool that replaces vague shell usage.
- Add one pre-tool guard for an action that would actually hurt if misused.
- Wire
hooks.jsonso the repo can route events into hookflow. - Run circumvention tests before you trust the system.
That sequence gives you a real three-layer system quickly: knowledge, capability, and enforcement. Once it works for one workflow, the pattern expands naturally.
What This Layered Model Changes
The deepest shift is architectural, not tactical. You stop treating GitHub Copilot as a chatbot with a long prompt and start treating it like a platform component with its own governance model.
That mindset changes design decisions everywhere:
- You extract repeated procedures into skills instead of copying them into agent prompts.
- You expose real APIs and workflows through extensions instead of hoping shell access is enough.
- You move safety boundaries into hooks instead of treating warnings as enforcement.
- You review agent policy as code - because by this point, it is code.
That is how GitHub Copilot becomes the hero of the system instead of a risky automation bolt-on. It remembers through the 4-tier memory model, learns through skills, acts through extensions, and stays inside policy through hooks. Each layer does one job well. Together they turn a helpful assistant into production infrastructure.
Next step: subscribe at /newsletter#subscribe for the faster weekly deep dives, pair this chapter with The Agentic Development Blueprint if you want the full operating model around agents and CI/CD, and if you want help implementing governed Copilot workflows for your team, check out my consulting page.
This chapter pairs with Newsletter Issue #5: The Three-Layer Agent Extension Architecture. Read the newsletter for the fast conceptual version; use this chapter when you are ready to build the real thing.
Together with the rest of this blueprint, you now have the full stack for persistent GitHub Copilot agents: memory to remember, skills to know, extensions to act, and hooks to enforce.
Multi-Agent Orchestration — The Production Playbook
53 agents, 57 cron jobs, zero chaos. The coordination patterns that make autonomous multi-agent systems actually work.
Why Multi-Agent Orchestration Matters
Everything in this blueprint so far has been about giving individual agents the ability to persist, learn, and act. The 4-tier memory model solves continuity. Skills solve capability reuse. Extensions solve tool access. MCP bridges solve ecosystem integration. But there is a fundamental challenge that none of those layers address on their own: what happens when dozens of agents need to coexist, coordinate, and not step on each other?
This is where most multi-agent projects collapse. Not because any single agent is poorly built, but because nobody designed the coordination layer. The result is duplicated work, conflicting outputs, noisy notifications, runaway fan-out, and humans drowning in agent messages instead of being freed by them. The gap between “one useful agent” and “a platform of 53 agents running autonomously” is not more agents — it is orchestration architecture.
This chapter covers the production patterns that make large-scale multi-agent systems actually work. You will learn a taxonomy that determines which agents get memory and which stay stateless, orchestration patterns for parallel fan-out and sequential pipelines, goal-oriented team coordination for outcomes that span months, cron architecture that keeps 57 scheduled jobs clean, and a cross-session communication mesh that lets agents collaborate across repository boundaries. These are not theoretical frameworks. They are the exact patterns running in production across 53 agents today — the same platform that backs every chapter of this blueprint.
The connection to the memory system is direct: orchestration determines who gets which memory tier. A stateless task agent does not need 4-tier memory. A team agent coordinating a 12-month goal needs all four tiers plus additional tracking files. The taxonomy in this chapter is what makes the memory architecture from Chapters 1 through 4 actionable at scale. Without orchestration rules, you either over-provision memory to agents that do not need it, or you starve agents that cannot function without continuity.
The premise of “53 agents, zero chaos” is not aspirational marketing. It is a design constraint enforced by the patterns in this chapter. Every pattern exists because an uncoordinated version was tried first, failed in a specific way, and forced a structural fix. This chapter is the field manual for those fixes.
The 4 Agent Patterns Taxonomy
Memory gives agents persistence. Skills give them capabilities. Extensions give them tools. But none of that matters if 53 agents are all running at the same time with no coordination. This chapter is about the orchestration layer — the patterns that turn a collection of independent agents into a coherent system.
Most multi-agent discussions jump straight to frameworks, message queues, or model selection. In production, the more important question comes first: what kind of agent are you actually building? A platform that treats every agent the same becomes expensive, noisy, and hard to reason about. Some agents need memory. Some should stay stateless forever. Some need to dispatch others. Some should never coordinate anything at all. The taxonomy is what keeps the system legible as it grows.
In this platform, every agent falls into one of four patterns. The pattern is not a naming preference. It is an architectural contract that determines whether the agent gets 4-tier memory, whether it is allowed to orchestrate other agents, how it is scheduled, and how long it should exist.
| Pattern | Example | Memory? | Orchestrates? | Owns a Goal? | Lifecycle |
|---|---|---|---|---|---|
| Domain Agent | finance-manager, nicu-care, dog-parent | ✅ 4-tier | ❌ | ❌ (owns a domain) | Permanent |
| Task Agent | daily-briefing, meal-planner, heartbeat | ❌ stateless | ❌ | ❌ (runs a procedure) | Permanent |
| Orchestrator | checkin | ❌ stateless | ✅ dispatches all | ❌ (generic coordination) | Permanent |
| Team Agent | realtor-team | ✅ 4-tier + manifest + progress | ✅ dispatches defined team | ✅ (owns a life goal) | Created → Active → Completed |
Domain Agents
Domain agents are the workhorses. They own a stable area of responsibility and come back to that responsibility over and over across days, weeks, and months. A finance-manager needs to remember account structure, payment patterns, autopay coverage, open tasks, and recent anomalies. A NICU support agent needs to remember feeding cadence, appointment cadence, and the operating mode the family is in right now. That is why domain agents get the full 4-tier memory model: identity, working state, long-term patterns, and event history.
The important constraint is that domain agents do domain work; they do not orchestrate the rest of the platform. Once you let every specialist dispatch other specialists, you lose control of fan-out, duplicate reporting, and accountability. The finance agent should decide what matters in finance. It should not also be discovering every other agent in the system and building a summary digest. That separation is what keeps the platform modular.
Here is what a domain agent definition looks like in practice. The agent file declares identity, memory paths, and behavioral boundaries:
# Finance Manager — Domain Agent
# .github/agents/finance-manager.agent.md
## Memory
Load first: data/agents/finance-manager/core.md (Tier 1)
Load next: data/agents/finance-manager/working.md (Tier 2)
On-demand: data/agents/finance-manager/long-term.md (Tier 3)
Append always: data/agents/finance-manager/events.log (Tier 4)
## Domain Ownership
- Budget tracking, bill payments, expense categorization
- Savings goals, debt management, income tracking
- Monthly reporting and anomaly detection
## Boundaries
- Do NOT orchestrate other agents
- Do NOT dispatch sub-agents for non-finance work
- Do NOT modify data files outside data/agents/finance-manager/
- If a finding impacts another domain, report it — do not act on itThe boundaries section is where architecture enforcement lives. Without explicit constraints, domain agents inevitably drift toward orchestration. They discover an interesting cross-cutting pattern, start dispatching helpers, and before long your clean taxonomy has decayed into a graph where every agent talks to every other agent. Boundaries make the taxonomy durable.
Task Agents
Task agents are stateless procedures packaged as durable entry points. They are permanent in the sense that the capability always exists, but each run is disposable. A daily-briefing agent does not need to remember a private life story from the last ten runs. It just needs to gather the current weather, calendars, tasks, and alerts, then format a good briefing. A meal-planner needs current inputs, not months of longitudinal memory.
This is where many platforms waste resources. They give rich memory to procedural agents because “more context feels safer.” In practice, that produces bloated prompts, stale carryover, and confusing state. If the agent’s purpose is to execute a repeatable checklist and exit, keep it stateless. Let the data sources provide the fresh inputs each time. Stateless task agents stay fast, predictable, and cheap.
A task agent definition is deliberately minimal. No memory paths, no working files, just a procedure and an exit:
# Daily Briefing — Task Agent
# .github/agents/daily-briefing.agent.md
## Procedure
1. Get current weather for location
2. Merge Google Calendar + WorkIQ for today
3. Pull top 5 priority tasks
4. Scan unread emails for urgent items
5. Check bills due within 3 days
6. Format structured briefing
7. Send via Telegram
8. Exit — do not persist state
## No Memory
This agent is stateless by design. Each run gathers fresh data.
Do NOT create or update any files in data/agents/.The explicit “No Memory” declaration prevents well-meaning drift. Without it, someone adding features might think: “What if we remember yesterday’s report to avoid duplicates?” That sounds reasonable but introduces stale-state pruning, edge cases around missed runs, and context window bloat. The better answer is idempotent procedures. Fresh data sources produce fresh output. No state management needed.
Orchestrators
Orchestrators do not perform the underlying domain work themselves.Their job is to discover the right workers, dispatch them, collect structured outputs, and compile a human-facing result. The checkin agent is the canonical example. It does not pretend to know whether the finance domain is quiet, whether a content pipeline has drifted, or whether the dog-food inventory changed. It asks the specialists, then turns many small reports into one coherent update.
That is why orchestrators stay stateless too. Their value is not memory accumulation; it is coordination logic. If you burden an orchestrator with long-lived memory, it starts carrying stale assumptions about who should run, what the previous summary looked like, or which agents were noisy last time. Better to rediscover the current world on each run, then compile based on fresh evidence.
Team Agents
Team agents are the most sophisticated pattern because they combine orchestration with persistent goal ownership. They exist for outcomes that take months, require multiple specialists, and have a meaningful finish line. Buying a house is not a domain like “finance” or a procedure like “send a morning briefing.” It is a bounded life goal with phases, dependencies, and exit criteria. That requires more than a generic orchestrator.
A team agent therefore gets memory plus orchestration plus a lifecycle. It needs 4-tier memory for identity and context, a roster manifest so it knows who is on the team, progress tracking so it knows which milestones have been hit, and an end state so the system can eventually retire it. Without that lifecycle, goal-oriented systems become immortal zombie projects that continue dispatching long after the goal is already complete.
The key insight is simple: the pattern you choose determines the memory tier, the orchestration capability, and the lifecycle. Get the pattern wrong and you either waste resources by giving memory to stateless procedures, or you lose crucial continuity by forcing a domain specialist to wake up from zero every time. Taxonomy is not bureaucracy. It is what makes a 53-agent platform understandable enough to keep shipping.
Orchestration Pattern — Parallel Dispatch + Compile
This is the checkin pattern, and it is the orchestration move most teams need first. The orchestrator’s job is not to do thirty things serially. Its job is to discover → filter → dispatch → collect → compile → notify. When you implement that as a clean pipeline, you get the leverage of many specialists without turning the human’s phone into a confetti cannon.
Step 1 — Discover
Discovery starts with the filesystem because the filesystem is the registry. The orchestrator walks .github/agents/*.agent.md, parses filenames into agent names, and builds the raw candidate set. That sounds almost embarrassingly simple until you compare it with systems that maintain a separate agent registry in a database or hard-code the roster inside the orchestrator prompt. File discovery wins because it stays close to source of truth. Add an agent file, and the platform can discover it immediately.
The discovery step also keeps orchestration resilient to growth. At 8 agents, hand-maintained allowlists feel manageable. At 53 agents, they drift. A discovery pass means the orchestrator can adapt to new domain agents without requiring its own prompt to be rewritten every time. The orchestrator is not manually curated on every cycle; it is reading the platform as it exists now.
glob(pattern: ".github/agents/*.agent.md")Step 2 — Filter Exclusions
Discovery gives you candidates. Filtering turns candidates into the correct execution set. This is where disciplined orchestration matters most, because the wrong inclusion rule creates loops, duplicate work, and noisy summaries.
| Exclusion Category | Examples | Why Excluded |
|---|---|---|
| Orchestrators | checkin, daily-briefing, budget-review, weekly-planner, meal-planner, heartbeat | Dispatching an orchestrator creates infinite loops |
| Team agents | realtor-team, any *-team agent | Run on their own cron schedule; generic orchestrators don’t own them |
| Team dedicated agents | credit-coach, listing-tracker (discovered from team-manifest.md) | Owned by their team, not the generic checkin |
| Utility/meta agents | skill-optimizer, context-auditor, platform-manager, coding-agent | Meta-agents that audit the platform itself |
| Test agents | test-hotreload, hotreload-proof | Not domain agents |
The most important filtering rule is the dynamic one: every *-team agent owns a roster, and some of those roster members are dedicated agents that should not be swept up by generic platform checkins. The orchestrator therefore reads data/agents/{team}/team-manifest.md, parses the roster table, and adds entries where Type = dedicated to the exclusion list. Shared agents are not excluded, because they still operate as normal domain agents outside the team context. That distinction is subtle and absolutely load-bearing.
This dynamic team exclusion prevents one of the easiest failure modes in multi-agent platforms: double ownership. Without it, a dedicated sub-agent may receive both team-dispatched work and generic checkin work, then produce overlapping or contradictory outputs. Once teams exist, your generic orchestrator must become team-aware even if it never dispatches the teams themselves.
What Goes Wrong Without Filtering
To understand why every exclusion category exists, consider what happens when you skip filtering and dispatch every discovered agent file. The orchestrator finds 53 agent definitions and launches all 53 in parallel. The first problem is self-dispatch: the orchestrator discovers its own agent file, launches a copy of itself, which discovers the same agent files, launches another round, and the system spirals into an infinite fan-out loop. That is why orchestrators are excluded — they coordinate, they do not get coordinated.
The second problem is team collision. A realtor-team agent runs on its own weekly cron. It dispatches its roster — credit-coach, listing-tracker, finance-manager — with carefully scoped prompts that carry team context: down payment targets, pre-approval deadlines, phase-specific priorities. If the generic checkin also dispatches credit-coach with a generic “what changed in your domain?” prompt, the credit-coach now runs twice in the same cycle with conflicting instructions. One run creates tasks aligned with the home-buying goal; the other creates generic financial tasks with no team context. The human receives two contradictory reports about the same domain. Worse, both runs may create duplicate tasks or overwrite each other’s working memory updates.
The third problem is noise from utility agents. A platform-manager or context-auditor exists to audit the system itself — agent definitions, memory files, skill consistency. These meta-agents are not domain specialists with user-facing status to report. Dispatching them during a checkin produces output like “12 agent files parsed, 0 contradictions found” — technically a report, but one that adds clutter without adding signal. The human’s compiled digest becomes longer without becoming more useful.
The fourth problem is test contamination. Test agents like test-hotreload exist purely for infrastructure validation. They have no domain, no meaningful status, and their outputs are implementation artifacts. Including them in a checkin dilutes the signal-to-noise ratio and trains the human to skim past items, which means they eventually skim past something that matters.
Each exclusion category exists because a real production failure taught the lesson. The filtering step is not bureaucratic caution — it is the scar tissue from watching an unfiltered orchestrator turn a calm daily digest into a noisy, duplicated, self-referencing mess.
Step 3 — Dispatch in Parallel
With the filtered roster ready, the orchestrator launches fresh agents in parallel. Fresh matters. Parallel matters. Fresh keeps contexts isolated; parallel keeps latency low enough that a checkin still feels like one operation instead of a slow serial crawl through the platform.
The dispatch prompt must force structured output. Free-form reporting makes compile logic fragile because every agent invents its own format. The pattern below standardizes what the collector will parse later:
Scheduled check-in. Current time: {CURRENT_TIME}.
Check your domain for updates, urgent items, and anything noteworthy.
TASK-FIRST: If you discover anything actionable, CREATE A TASK — don't just report it.
Return: STATUS: [updates/nothing], URGENT_SENT: [yes/no],
TASKS_CREATED: [list or none], REPORT: [2-4 bullets or "All clear."]That prompt does three jobs at once. It reminds the agent to inspect its domain, enforces task-first behavior so findings become durable work instead of ephemeral text, and returns structured fields the orchestrator can safely aggregate. This is the difference between an orchestrator and a group chat. The workers are expected to produce machine-readable summaries, not vibes.
Step 4 — Collect
Once agents are running, the orchestrator waits for completion, reads their outputs, and parses the STATUS and REPORT fields. This is where background execution actually earns its keep. A checkin cycle may launch 25 to 30 agents in parallel, but the collection step turns that burst back into one dataset. The orchestrator is not interested in every intermediate thought. It is interested in whether each domain produced updates, whether urgent notifications were already sent, and what short bullets belong in the compiled digest.
Structured collection also gives you room for guardrails. If an agent omits fields, the orchestrator can mark it as malformed instead of blindly pasting the whole response into a message. If an agent says STATUS: nothing, it contributes silence, not clutter. Clean collection is what allows large fan-out without large cognitive overhead.
Step 5 — Compile
Compilation is an editorial act. The orchestrator merges the useful outputs, removes duplicates, groups related items, and decides whether there is enough signal to notify at all. One of the strongest rules in this platform is that if all agents report nothing, the orchestrator sends nothing. Silence is signal. A healthy system should not manufacture a check-in message just to prove it ran.
When there are updates, the compiled message should feel like a single coherent report, not a concatenation of thirty mini-reports. Group domain bullets, surface urgency first, and summarize task creation where it matters. The human should be able to scan one Telegram message and understand the system’s state in seconds.
Step 6 — Notify
The final step is one notification, not thirty. This is the production payoff of orchestration. Humans do not want agent democracy. They want coordinated outcomes. The platform can fan out aggressively under the hood, but the interface back to the human should stay compressed and calm.
In practice, this pattern runs eight times daily — every two hours from 7 AM to 9 PM. Each cycle launches roughly 25 to 30 fresh agents in parallel, collects results in 60 to 90 seconds, and delivers one message only when there is something worth sending. That rhythm is what makes a large autonomous system feel composed instead of frantic.
Orchestration Pattern — State Machine Pipeline
The second major orchestration pattern is the state machine. This is the content-blitz pattern: a long-running production workflow expressed as a finite set of explicit states, with one transition per cycle. Instead of a generic checkin asking many specialists for status, the state machine advances a single work item through a controlled pipeline.
Here is the shape of the pipeline:
idle → step1_idea → step2_newsletter → step2_review → step2_merge
→ step3_blueprint → step3_review → step3_merge
→ step4_blog → step4_review → step4_merge
→ step5_social → step5_review
→ metrics → idle (next idea)The power of this design is that every cycle is deterministic. The system reads the current state, performs the action for that state, writes the next state on success, or stays put on failure. There is no ambiguity about what should happen next, and that makes both automation and debugging dramatically easier.
The Hourly Loop
- READ
working.md→pipeline_statefield - EXECUTE the action for that state
- ADVANCE
pipeline_stateon success - STAY in current state on failure (increment
retry_count) - EXIT — next hourly cycle continues
That “one state per cycle” rule is deceptively important. Teams often try to be clever and let a scheduler advance through as many states as possible in one run. That looks efficient until a mid-pipeline failure leaves you guessing which side effects already happened and which baton fields are trustworthy. One state per cycle gives you clean checkpoints and trivial restartability.
| State | Action | On Success | On Failure |
|---|---|---|---|
| idle | Check campaign active, pick next idea from queue | → step1_idea | Stay idle |
| step1_idea | Research topic, write baton fields to working.md | → step2_newsletter | Retry next cycle |
| step2_newsletter | Dispatch blog-writer in newsletter mode, full 2000+ word deep-dive | → step2_review | Retry next cycle |
| step2_review | Dispatch code-review agent on the PR for quality gate | → step2_merge (on pass) | Remediate or alert |
| step2_merge | Merge PR via squash merge, record in working.md | → step3_blueprint | Alert human (conflict/CI) |
| metrics | Pull analytics, archive day to history, reset for next idea | → idle | Retry next cycle |
The table above shows the abbreviated version. In the production implementation, the full pipeline has 15 distinct states because each content format — newsletter, blueprint chapter, blog article, and social media — has its own create, review, and merge steps. That granularity matters. A state machine with five broad states hides too much work inside each state, making it impossible to know exactly where a failure happened. A state machine with 15 narrow states gives you surgical restartability: if the blueprint review fails, you retry exactly the blueprint review, not the entire content creation pipeline from scratch.
| # | State | Action | On Success | On Failure |
|---|---|---|---|---|
| 1 | idle | Check campaign active, pick next idea | → step1_idea | Stay idle |
| 2 | step1_idea | Research topic, write baton to working.md | → step2_newsletter | Retry |
| 3 | step2_newsletter | Dispatch blog-writer (newsletter mode) | → step2_review | Retry |
| 4 | step2_review | Quality gate on newsletter PR | → step2_merge | Remediate |
| 5 | step2_merge | Squash-merge newsletter PR | → step3_blueprint | Alert human |
| 6 | step3_blueprint | Dispatch blueprint-manager for chapter | → step3_review | Retry |
| 7 | step3_review | Quality gate on blueprint PR | → step3_merge | Remediate |
| 8 | step3_merge | Squash-merge blueprint PR | → step4_blog | Alert human |
| 9 | step4_blog | Dispatch blog-writer (article mode) | → step4_review | Retry |
| 10 | step4_review | Quality gate on blog PR | → step4_merge | Remediate |
| 11 | step4_merge | Squash-merge blog PR | → step5_social | Alert human |
| 12 | step5_social | Dispatch content-creative for social posts | → step5_review | Retry |
| 13 | step5_review | Review social content quality | → step5_schedule | Retry |
| 14 | step5_schedule | Queue social posts across platforms | → metrics | Retry |
| 15 | metrics | Pull analytics, archive, reset for next | → idle | Retry |
Baton Handoff via Working Memory
The state machine does not need a separate broker, queue, or RPC layer to hand work from one stage to the next. The baton lives in working.md. Each state writes outputs as explicit fields — things like today_topic, newsletter_pr_url, blueprint_slug, or retry_count. The next cycle reads those fields and knows exactly what it is continuing.
This is one of the most useful production lessons in the entire platform: the filesystem can be your message bus when the workflow is sequential and explicit. You do not need to over-architect inter-process communication for every multi-step automation. If the baton is small, structured, and durable, a Markdown working file is enough.
Retry with Escalation
Every state owns its own retry behavior. On failure, the system does not jump somewhere fancy; it stays on the current state and increments retry_count. After three failures, it escalates to the human rather than retrying forever. That protects the system from silent loops where an hourly cron job hammers the same broken action all day and calls it resilience.
The retry logic is deliberately simple. On the first failure, the system stays put and tries again on the next hourly cycle. Transient issues — a rate-limited API, a temporary CI runner shortage, a model timeout — usually resolve themselves within one or two cycles. On the second failure, the system stays put again but logs a warning. By the third consecutive failure on the same state, the pattern switches from automated retry to human escalation: the system sends a Telegram notification explaining which state is stuck, what the last error was, and how many cycles have been consumed. At that point, the pipeline pauses. It does not advance, and it does not keep retrying. It waits for the human to intervene.
There is also a stuck detection mechanism that operates independently of retry count. If a pipeline item has been in the same state for more than four hours without any state transition, the system flags it as potentially stuck even if no explicit failure was recorded. This catches a different class of problem: states where the dispatched agent completes successfully but produces output the state machine cannot parse, or states where an external dependency silently hangs without returning an error. The four-hour threshold is calibrated to the hourly cron cycle — four missed opportunities to advance is enough evidence that something is wrong without being so aggressive that normal human-in-the-loop review delays trigger false alarms.
Escalation is not a failure of automation. It is one of the features that makes automation trustworthy. A production pipeline should recover from transient issues but stop pretending when it needs human judgment: a stuck CI run, a merge conflict, a broken publishing credential, or a quality gate that keeps failing for substantive reasons.
Cohesion Over Cleverness
The content-blitz pattern also enforces a cohesion rule: every step for an idea executes on the exact same topic. Each state reads today_topic from working memory and treats it as authoritative. The newsletter, blueprint, blog, and social outputs are all variations of the same campaign theme. No drift. No “the blog pivoted halfway through because another topic looked more interesting.”
That discipline is what turns a content assembly line into a real campaign system. The same rule generalizes beyond content. Any state machine that transforms a work item over time needs a canonical baton field that anchors the whole flow. Pick it once, write it early, and make every later state subordinate to it.
Orchestration Pattern — Team-Based Goal Tracking
Teams are the most sophisticated orchestration pattern because some outcomes are too big for a periodic checkin and too long-lived for a simple task agent. They require sustained, multi-agent coordination toward a specific result. That is where the team agent pattern comes in.
A team agent owns a goal with a target date, a roster of participating agents, phase-based milestones, its own schedule, and a lifecycle. It is not a general-purpose dispatcher. It is a bounded operating system for one life objective. In this platform, the home-buying effort is the canonical example. That goal touches credit, savings, listings, logistics, school zones, mortgage prep, paperwork, and move planning. No single domain agent owns the whole outcome, but a team agent can.
Team Agent Structure
- Goal: a concrete outcome with a target date, such as buying a house within 12 to 18 months
- Roster: dedicated and shared agents with clearly defined roles
- Milestones: phase-based progress markers with exit criteria
- Cron: its own schedule, separate from generic orchestrators
- Lifecycle: created → active → completed
.github/agents/{team-name}.agent.md # Agent definition
data/agents/{team-name}/core.md # Identity, goal, rules
data/agents/{team-name}/working.md # Current state
data/agents/{team-name}/team-manifest.md # Roster & phases
data/agents/{team-name}/progress.md # Milestones
data/agents/{team-name}/long-term.md # Patterns
data/agents/{team-name}/events.log # Event streamThis directory shape matters because it gives the team agent everything it needs to operate without leaking team logic into unrelated agents. The manifest answers “who is on the team?” Progress answers “how far along are we?” Working memory answers “what is the current focus?” Long-term memory captures strategic lessons, and the event log preserves the full history of decisions.
Dedicated vs Shared Agents
Team rosters include two classes of sub-agents. Dedicated agents exist specifically for the team. A credit-coach for a home-buying team is not a general platform utility; it was created because that goal requires an intensive, ongoing credit-improvement lane. Those agents may be retired when the goal ends.
Shared agents are existing domain specialists that the team borrows with extra context. Finance-manager may still manage the family’s broader budget, but when the realtor-team dispatches it, the prompt adds team context such as down payment targets, pre-approval milestones, or an upcoming offer window. The team leverages the shared expert without claiming permanent ownership of it.
Phase-Based Activation
The reason team agents need manifests and progress tracking is that goals evolve through phases. A home-buying team does not need listing analysis at full intensity before the financial foundation exists. Likewise, a move-planner does not need to wake up during early credit repair. Activating everybody at once is wasteful and noisy.
A representative phase progression looks like this:
- Phase 1 — Preparation: credit-coach + finance-manager active. Exit when credit is at least 720, down payment savings are on track, and pre-approval is ready.
- Phase 2 — Search: listing-tracker, school-zone-analyzer, mortgage-advisor activate. Exit when 3 to 5 properties are shortlisted.
- Phase 3 — Offer & Close: mortgage-advisor, move-planner, home-manager activate. Exit when the keys are in hand.
The subtle but powerful rule is that phases can overlap. Phase 2 activity can begin when credit hits 680 and trends upward even if Phase 1 has not formally exited yet. This lets teams front-load preparation without waiting for a binary state flip. Real life is fuzzy; the orchestration model should be structured without becoming brittle.
The Weekly Team Standup
The team standup protocol is simple: a weekly cron dispatches the team agent, the team agent dispatches its active roster agents, collects their status, checks milestone progress, and reports the current trajectory to the human. This creates one owner for the goal, one consistent summary, and one place where progress is interpreted against explicit exit criteria.
That ownership model is what separates a true team agent from a generic orchestrator. Generic orchestrators ask, “what changed across the platform?” Team agents ask, “are we closer to the goal, what is blocking progress, and which phase should be active now?” If a goal matters enough to measure, phase, and eventually finish, it deserves a team agent.
The Cron Architecture
Large autonomous systems are only useful if they wake themselves up at the right times. In this platform, 57 scheduled jobs power the automation layer. The scheduling model is intentionally boring, because boring is what you want at the foundation of an autonomous system.
- Configuration: all jobs live in one
cron.jsonat the repo root - Engine: a
cron-schedulerextension reads the config, parses standard 5-field cron expressions, and checks every 60 seconds - Firing: when a job matches,
session.send()delivers the dispatch message to the main session - Dispatch: the main session launches a new agent via the task tool
The central configuration file matters more than it seems. It gives you one inventory of automation, one timezone definition, one place to disable or adjust jobs, and one surface for audits. When jobs are scattered across shell scripts, external schedulers, and agent prompts, the system becomes impossible to reason about. A single config file makes the automation layer visible.
{
"timezone": "America/Chicago",
"jobs": [
{
"id": "morning-briefing",
"schedule": "0 6 * * 1-5",
"enabled": true,
"agent": "daily-briefing"
},
{
"id": "heartbeat",
"schedule": "0 7,9,11,13,15,17,19,21 * * *",
"enabled": true,
"agent": "checkin"
},
{
"id": "content-trend-scan",
"schedule": "0 6 * * 1-5",
"enabled": true,
"agent": "content-manager",
"prompt": "Run your morning trend scan..."
}
]
}The Absolute Rule: Always Launch Fresh Agents
This is the non-negotiable rule of cron architecture: always launch fresh agents. Never inject scheduled work into an already-running agent. Never treat cron as a stream of follow-up messages to the same long-lived worker. Every cycle gets its own clean context.
The reason is context pollution. Imagine a task-coach that runs every 20 minutes. If each new cycle is sent via write_agent into the same lingering process, that process accumulates stale instructions from prior runs: “stay silent,” “quiet hours,” “do not nudge yet,” “Paula is resting,” “hold until the next check.” Those instructions were valid in their original moments. They become corruption when silently carried into the next cycle. Fresh launches eliminate that bleed-through.
Even if a previous instance of the same agent is still running, the scheduler should launch a new one and let the older one finish naturally. Context isolation is more important than avoiding temporary overlap. Parallel instances are easier to reason about than a mega-context agent that grows slower, noisier, and more confused after every injection.
The Anti-Pattern
The tempting anti-pattern is to funnel every scheduled trigger through write_agent because the agent is “already running.” It feels efficient. It is not. You end up with one swollen process acting as a dumping ground for unrelated cycles, carrying stale assumptions, old constraints, and residual conversational state across hours or days. Performance degrades, behavior drifts, and debugging becomes miserable because the current output depends on an invisible pile of historical injections.
Here is what context pollution looks like in practice. A task-coach agent runs at 8:00 AM. During that cycle, the human responds “I am in a meeting until 10, hold all nudges.” The agent acknowledges and goes idle. At 8:20 AM, the cron fires again. If the scheduler uses write_agent to inject the new cycle into the same process, the agent’s context now contains both the original 8:00 AM instructions and the human’s “hold all nudges” override. The 8:20 AM cycle reads that override and stays silent — correct for the 8:00 AM context, wrong for the 8:20 AM cycle when the constraint may no longer apply. By 9:00 AM, the agent has accumulated three cycles of stale conversation, each carrying forward constraints that were valid for exactly one moment. The agent becomes increasingly conservative, increasingly confused, and increasingly wrong — all while appearing to work fine because it is not throwing errors.
The session.send() delivery mechanism is what makes clean dispatch possible. When the cron scheduler determines that a job should fire, it does not create a new terminal session or spawn a subprocess directly. It calls session.send() to deliver a dispatch message into the main Copilot CLI session. That session then uses the task tool to launch a fresh agent with a clean context window. The two-step delivery — scheduler to main session, main session to fresh agent — ensures that the cron engine stays simple (just a timer and a message sender) while the dispatch logic stays in the conversational layer where it has access to the full task tool and agent registry. This separation also means the cron engine itself never needs to understand agent definitions, memory tiers, or dispatch prompts. It just delivers a trigger. The intelligence lives in the session that receives it.
Cron is not a chat thread. It is a clean dispatch mechanism. Treat each fire as a new invocation with clean inputs, and the system stays deterministic. Break that rule, and you slowly build a haunted house of context nobody can fully inspect.
The Agent Mesh — Cross-Session Communication
Memory handles persistence within a session. Skills handle capability sharing. But autonomous systems eventually spill beyond one repository and one terminal window. You end up with agents running in different Copilot CLI sessions, on different schedules, sometimes in different workspaces entirely. That is the problem the agent mesh solves.
The mesh uses a shared SQLite database as an asynchronous message bus. Every session can read from it and write to it through a small set of tools. No persistent sockets. No websocket broker. No requirement that all sessions be online at exactly the same instant. The database is the shared rendezvous point.
| Tool | Purpose |
|---|---|
get_agents(status?) | Discover who’s online (active/stopped/all) |
send_message(workspace?, content, priority?) | Send to another session by workspace name |
reply_to_message(message_id, content) | Reply to a received message (threaded) |
get_message(message_id) | Retrieve a message and its replies |
Workspace Naming
Each repository registers a workspace name so other sessions can address it without needing a transient session ID. The pattern looks like this:
rocha-family → Family life management (53 agents)
msix-home → Microsoft work assistant (MSX, Power BI, WorkIQ)
video-pipeline → Video processing and publishingThat naming layer is what turns the mesh from a debugging toy into usable infrastructure. Other agents do not need to know which exact session is currently active for the work assistant. They just need to know the workspace name and let the mesh route to the current live instance.
Workspace Discovery
For the mesh to be useful, workspaces need to find each other without hard-coded configuration. The discovery pattern uses a known-workspaces registry — a simple configuration that maps workspace names to descriptions and expected capabilities. When a new workspace comes online, it registers itself in the shared database with its name, session ID, and status. Other workspaces can then call get_agents() to see who is currently active, who was recently active, and who is offline.
This matters because workspace availability is inherently dynamic. The video-pipeline workspace may only be active during editing sessions. The work assistant workspace runs during business hours. The family workspace runs around the clock. An agent that needs to delegate work must be able to answer two questions: “Does this workspace exist?” and “Is it currently online?” If the workspace exists but is offline, the message is still delivered to the database — it will be picked up when the workspace next comes online. If the workspace does not exist, the sending agent should handle the delegation failure gracefully rather than silently dropping the request.
The known-workspaces pattern also prevents a common bootstrapping problem. Without it, every agent that wants to use the mesh must somehow know the exact workspace names of its potential collaborators. With the registry, a new agent can discover available workspaces programmatically and route messages based on capability descriptions rather than memorized identifiers. That makes the mesh self-documenting and extensible without requiring prompt rewrites when new workspaces are added.
Asynchronous Message Flow
The full lifecycle of a mesh message follows a predictable pattern: send, continue, check, act. The sending agent composes a message with a target workspace, content, and priority level, then calls send_message(). That call returns immediately with a message ID. The sender records that message ID in its own working memory — this is the receipt that lets it check for replies later — and continues its current task without waiting.
On the receiving side, the target workspace picks up the message during its next active cycle. It may be a cron-triggered heartbeat, a scheduled checkin, or a manual session — the mesh does not care which. The recipient processes the request using its own local tools, then calls reply_to_message() with the original message ID and the result. This creates a threaded reply chain that both sides can inspect.
The sender checks for replies by calling get_message() with the stored message ID. If a reply exists, it processes the result and updates its own state. If no reply exists yet, it moves on — the check is non-blocking and costs nothing beyond a single database read. This pattern means that neither workspace is ever blocked waiting for the other. Both continue their independent work, and coordination happens through the shared persistence layer whenever both happen to be active.
Asynchronous by Design
The communication model is intentionally asynchronous. You send a message, continue your own work, and check for replies later. No blocking. No busy polling. No brittle assumption that the recipient will answer in the next five seconds. That design matches the way autonomous agents actually operate: they wake on their own schedules, they may need different credentials or tools, and some tasks are naturally deferred.
This is why the database-as-bus approach works so well. Durability matters more than low-latency streaming for most agent-to-agent coordination. A cross-workspace message should survive if the recipient session restarts, sleeps, or has not yet reached its next cycle. Persistence beats real-time theater.
Example Flow
A concrete example makes the pattern obvious. The work-life-sync agent in the family workspace notices a personal doctor’s appointment on Thursday from 2 PM to 4 PM. The family workspace cannot directly write to the Microsoft work calendar with the needed tooling. So it sends a mesh message to msix-home: “Block 2-4 PM Thursday as OOF on Outlook.”
The work assistant receives the message during its next cycle, performs the Outlook action with its own tools, then replies with confirmation. The family-side agent does not sit and wait. It records that it delegated the work, keeps moving, and checks the threaded reply later. That is what cross-session autonomy looks like in practice.
Message Flow Step by Step
Here is the exact tool-call sequence for a typical cross-session delegation, showing both the sender and receiver perspectives:
# SENDER (rocha-family workspace — work-life-sync agent)
# Step 1: Detect the need for cross-session work
# (Calendar scan finds personal appointment during work hours)
# Step 2: Send the mesh message
send_message(
workspace: "msix-home",
content: "Block Thursday 2-4 PM CT as OOF on Outlook. Reason: personal medical appointment.",
priority: "normal"
)
# Returns: { message_id: "msg_a3f8k2" }
# Step 3: Record the message ID in working memory
# (Append to data/agents/work-life-sync/working.md)
# pending_mesh_replies:
# - id: msg_a3f8k2
# sent: 2026-07-20T14:30:00
# target: msix-home
# action: block-calendar
# Step 4: Continue other work — do NOT wait# RECEIVER (msix-home workspace — next active cycle)
# Step 1: Check for incoming messages during heartbeat
get_messages(status: "unread")
# Returns: [{ id: "msg_a3f8k2", from: "rocha-family", content: "Block Thursday..." }]
# Step 2: Execute the requested action with local tools
outlook_create_event(
subject: "Personal - OOF",
start: "2026-07-24T14:00:00",
end: "2026-07-24T16:00:00",
showAs: "oof"
)
# Step 3: Reply with confirmation
reply_to_message(
message_id: "msg_a3f8k2",
content: "Done. Thursday 2-4 PM blocked as OOF on Outlook. Event ID: evt_x9k2m."
)# SENDER (rocha-family workspace — later cycle)
# Step 5: Check for reply during next heartbeat
get_message(message_id: "msg_a3f8k2")
# Returns: { replies: [{ content: "Done. Thursday 2-4 PM blocked..." }] }
# Step 6: Update working memory — remove from pending
# pending_mesh_replies: [] (cleared)This three-phase pattern — send, execute remotely, confirm — is the building block for all cross-session workflows. The sender never blocks. The receiver processes on its own schedule. The confirmation closes the loop. If no reply arrives within 24 hours, the sender can escalate or retry, but it never stalls its own pipeline waiting for external confirmation.
Cross-Agent Delegation Rules
- Use local tools first. If the current workspace can do the work directly, do not introduce mesh overhead.
- Delegate via mesh when the task requires tools only available in another workspace.
- Do not block on replies. Send, continue, check later.
- Use explicit priority levels: urgent > high > normal > low.
The mesh is not a substitute for good boundaries. It is the bridge between good boundaries. Use it when a task genuinely crosses workspace capabilities, and you get a system that can coordinate across repositories without collapsing everything into one giant terminal session.
The Decision Framework
Once you understand the patterns, the practical question becomes: how do you decide which one to use when a new agent idea appears? The cleanest answer is a four-question taxonomy.
Question 1: Does it own a domain? If yes, it is usually a domain agent. Domain ownership implies recurring responsibility, memory, and permanence. Question 2: Does it orchestrate others? If yes, it is either an orchestrator or a team agent. Question 3: Does it run a procedure and exit? If yes, it is a task agent. Question 4: Does it need persistence? Use that as a validation check. If the design says stateless but the workflow clearly depends on remembering prior state, your pattern choice is wrong.
Described as a flowchart, the decision path looks like this:
New agent needed →
Does it own a permanent domain?
YES → Domain Agent (4-tier memory, permanent)
NO → Does it orchestrate other agents?
YES → Does it have a goal with an end date?
YES → Team Agent (manifest, phases, lifecycle)
NO → Orchestrator (stateless, discover+dispatch)
NO → Task Agent (stateless, procedure, permanent)This framework does more than classify. It protects the system from architecture drift. Without it, every new need becomes “just add another agent,” and soon you have stateful procedures that should have been stateless, domain workers that accidentally orchestrate, and goal-driven teams that have no clear finish line. The taxonomy forces a sharper design conversation before implementation starts.
There is also a deeper design lesson here: persistence is not a badge of importance. Teams often assume that serious agents must have memory and lightweight agents can stay stateless. In reality, memory should follow operational need. Some of the most important workflows in the platform are stateless because they are deterministic procedures on fresh data. Some of the quietest specialists need memory because continuity is the whole point.
That is the production playbook. Choose the right pattern. Give it the right memory model. Define the right lifecycle. Use fresh dispatch for cron. Use structured fan-out for platform checkins. Use state machines for long pipelines. Use team agents for bounded goals. Use the mesh when the work crosses sessions. When these pieces lock together, multi-agent systems stop feeling chaotic and start behaving like real infrastructure.
Common Mistakes and Anti-Patterns
Every pattern in this chapter exists because a simpler version was tried first and failed. These are the most common mistakes teams make when building multi-agent systems, along with the structural fixes that resolve them.
Anti-Pattern 1: The God Orchestrator
The most common first attempt at multi-agent coordination is to build one central orchestrator that knows everything, dispatches everything, and decides everything. It starts clean. Then it accumulates special cases: “if it is Monday, also check the meal plan.” “If the finance domain reported a bill, also notify the calendar agent.” “If the content pipeline has a draft, also ping the editor.” Within weeks, the orchestrator’s prompt is 3,000 words of conditional logic, it takes 60 seconds just to parse its own instructions, and every change risks breaking three unrelated workflows.
The fix is pattern separation. Generic orchestrators handle generic coordination (discover, dispatch, compile). Domain-specific logic stays in domain agents. Goal-specific coordination lives in team agents. No single agent should carry the cognitive load of the entire platform. If your orchestrator prompt keeps growing, you are probably mixing coordination with domain logic.
Anti-Pattern 2: Memory Everywhere
The intuition that “more memory is safer” leads teams to give every agent 4-tier memory. The result is 53 sets of working files that need maintenance, pruning, and consistency checks. Agents carry forward stale assumptions from three weeks ago. Context windows fill with irrelevant history. Performance degrades because every invocation loads kilobytes of memory that the agent never references.
The fix is the taxonomy. Task agents and orchestrators are stateless by design. Only domain agents and team agents get persistent memory. That rule alone eliminates over half of all memory maintenance burden. If an agent does not need to remember anything between runs, do not give it memory. Absence of state is a feature, not a limitation.
Anti-Pattern 3: Cron as Chat
Using write_agent for scheduled dispatches feels efficient because the agent is already running. In reality, it produces haunted agents. After a dozen injected cycles, the agent’s context window contains instructions from hours ago that conflict with current state. “Do not nudge until 10 AM” persists past 10 AM. “Paula is resting” persists after she is awake. The agent becomes conservative, confused, and unreliable because it cannot distinguish current instructions from historical residue.
The fix is absolute: every cron cycle launches a fresh agent via the task tool. No exceptions. Fresh context means the agent operates on current-state inputs with zero bleed-through from prior cycles. If the previous instance is still running, launch the new one anyway. Parallel instances with clean contexts are infinitely easier to debug than a single instance with accumulated pollution.
Anti-Pattern 4: Unfiltered Fan-Out
Dispatching every discovered agent during a checkin creates cascading problems: orchestrators dispatch themselves (infinite loops), team-dedicated agents receive conflicting instructions from both their team and the generic orchestrator (double ownership), utility agents produce meaningless status reports (noise), and the human receives a 40-item digest they immediately learn to ignore (alert fatigue).
The fix is the filtering pipeline described in the Parallel Dispatch pattern. Every exclusion category exists because a specific production failure demanded it. Self-dispatch exclusion prevents loops. Team exclusion prevents double ownership. Utility exclusion prevents noise. Test exclusion prevents contamination. If you cannot explain why an agent is in your dispatch set, it probably should not be.
Anti-Pattern 5: Implicit State Machines
Multi-step workflows often start as linear scripts: do step 1, then step 2, then step 3. When step 2 fails, the developer adds a retry. When step 3 needs human approval, they add a wait. When the process needs to resume after a restart, they add checkpoint logic. Eventually, the linear script has become an implicit state machine with undocumented states, unclear transitions, and no way to inspect current progress without reading the code.
The fix is to make the state machine explicit from the start. Define every state. Define every transition. Store the current state in a named field. Make progression visible and inspectable. An explicit 15-state pipeline is dramatically easier to operate than a “simple” script that has grown seven levels of nested retry logic.
Anti-Pattern 6: Synchronous Cross-Session Communication
Teams building multi-workspace systems often try to make communication synchronous: send a message and wait for a reply before continuing. This creates brittle dependencies. If the target workspace is offline, the sender blocks indefinitely. If the target is slow, the sender wastes compute time polling. If both workspaces try to call each other simultaneously, you get distributed deadlocks.
The fix is asynchronous-by-default communication. Send the message, record the message ID, continue your own work, and check for replies later. The mesh is a mailbox, not a phone call. Design every cross-session interaction to tolerate arbitrary reply latency, including “never replied” as a terminal state that the sender handles gracefully.
Getting Started: From Single-Agent to Multi-Agent
If you are reading this chapter and thinking “this is overwhelming — I have one agent and it barely works,” here is the practical path from a single agent to a coordinated multi-agent platform. You do not need to build all four patterns on day one. You need to grow into them as your system’s complexity demands them.
Step 1: Build Your First Domain Agent Right
Start with one agent that owns one domain completely. Give it 4-tier memory from the beginning — even if it feels like overkill for a single agent. The reason is simple: memory architecture is much harder to retrofit than to include from the start. A well-structured domain agent with core identity, working state, long-term patterns, and an event log will teach you how persistence works in practice before you need to scale it.
Pick the domain that matters most to you. If it is personal finance, build a finance agent. If it is content creation, build a content agent. Give it a clear core.md that defines what it owns and what it does not own. Give it a working.md that tracks current state. Let it run for two weeks. Watch how memory accumulates, when it becomes stale, and what pruning patterns emerge. That hands-on experience is more valuable than any design document.
# Your first domain agent structure:
.github/agents/my-first-agent.agent.md # Definition
data/agents/my-first-agent/core.md # Tier 1: Identity
data/agents/my-first-agent/working.md # Tier 2: Current state
data/agents/my-first-agent/long-term.md # Tier 3: Patterns (empty initially)
data/agents/my-first-agent/events.log # Tier 4: History (append-only)Step 2: Add a Task Agent When You Need Automation
Once your domain agent is working, you will notice recurring procedures that do not need memory. A morning summary. A weekly report. A data sync. These become task agents. Keep them stateless. Let them read from the domain agent’s memory files as data sources, but never let them write to those files. That read-only relationship is the cleanest boundary between domain ownership and procedural execution.
At this stage, you have two agent types cooperating: a domain agent that persists and a task agent that executes. The task agent can read the domain agent’s working memory to build context-aware reports without needing its own persistence. This division will feel natural quickly, and it establishes the muscle memory for keeping orchestration separate from domain work.
Step 3: Add Orchestration When Agents Need Coordination
The moment you have three or more domain agents, you will feel the pull of coordination. Who checks on whom? How does the human get one coherent update instead of three separate messages? That is when you add your first orchestrator. It discovers agents, dispatches them with structured prompts, collects outputs, and compiles one notification. Start simple: a checkin that runs twice a day and sends you a digest. No filtering needed at three agents. Add filtering rules as you grow past ten.
The progression from there is organic. State machines appear when you have multi-step workflows. Team agents appear when you have bounded goals that span months. The mesh appears when your agents outgrow a single repository. Each layer earns its existence by solving a concrete coordination problem you are already experiencing — not by anticipating problems you might have someday.
The key insight is that orchestration complexity should grow with system complexity. A three-agent platform does not need a 15-state pipeline or a cross-session mesh. It needs one simple checkin. Let the coordination layer evolve alongside the agent layer, and you will build exactly the infrastructure your system actually requires — no more, no less.
Next step: subscribe at /newsletter#subscribe for the faster weekly deep dives, pair this chapter with The Agentic Development Blueprint if you want the full operating model around agents and CI/CD, and if you want help implementing multi-agent orchestration for your team, check out my consulting page.
This chapter pairs with Newsletter Issue #6: 53 Agents, Zero Chaos. Read the newsletter for the conceptual overview; use this chapter when you’re ready to build the orchestration layer.
For the complete operating model around agents, CI/CD, and governance, see The Agentic Development Blueprint.
If you want help implementing multi-agent orchestration for your team, check out my consulting page. And if you want the weekly architecture breakdowns in your inbox, subscribe at /newsletter#subscribe.
AI Agent Governance - The 7-Layer Security Stack
The control model that keeps tool-using agents useful, fast, and safe once they graduate from chatbots to real operators.
Why this chapter belongs in a memory blueprint
Memory is what makes an agent persistent. Governance is what makes that persistence trustworthy. The moment an agent can do more than answer questions - when it can write code, schedule meetings, publish content, send email, touch money, or act across multiple systems - you are no longer designing a clever prompt. You are designing an operating model.
That distinction is the governance gap. Most AI safety conversations still orbit chatbot problems: refusal behavior, toxic output filtering, jailbreak resistance, prompt injection hardening, and model alignment. Those are real concerns, but they are only the first mile. An operational agent has a different blast radius. The dangerous failure mode is not merely “the model said something wrong.” It is “the model did something wrong with a real permission.”
If your agent can open a PR, push to a deployment pipeline, message a client, create a calendar event, or queue a payment, prompt hygiene alone will not save you. You need a layered system that answers six production questions every single run:
- Who is this agent supposed to be?
- What actions may it take without asking?
- Which actions require an approval gate?
- Which situations require special safety protocols?
- What guardrails exist around code, data, and tools?
- How do you keep one run’s context from poisoning the next?
This chapter is the production answer: a 7-layer governance stack running around real GitHub Copilot agents. It is not a theoretical compliance artifact. It is the architecture you build when agents stop being demos and start becoming coworkers with keyboards.
This chapter pairs with Newsletter Issue #7: The 7-Layer AI Governance Stack. Read the newsletter for the fast executive version; use this chapter when you want the implementation details.
14.1 The Governance Gap
Chatbot safety research mostly assumes a conversational interface with a bounded outcome: the model produces text, a human reads it, and the human decides what to do next. In that world, the control surface is the answer itself. You can wrap the model in moderation filters, refusal policies, classifier checks, red-team prompts, and prompt injection detection. That work matters. It is just not enough when the model has agency over tools.
An operational agent changes the threat model in three ways. First, it converts language into action. A prompt is no longer just an instruction for text generation; it becomes an input to tool selection, state transitions, and external side effects. Second, it carries permissions. The agent might have access to repositories, inboxes, calendars, CRMs, payment dashboards, or internal APIs. Third, it persists. It remembers prior tasks, carries forward state, and often runs on a schedule without a human in the loop.
That is why governance has to move beyond “can the model refuse unsafe content?” and into “can the system constrain unsafe operations?” Prompt injection defense still matters, especially for content ingestion and browsing workflows, but it is only one layer. A compromised prompt should not be able to make a coding agent force-push to production, a finance agent move money above a threshold, or a family logistics agent state a child’s location as if it were live fact. Those protections do not live inside the model weights. They live in the operating system around the model.
The Cloud Security Alliance named this problem directly in its April 3, 2026 research note The AI Agent Governance Gap: What CISOs Need Now. Their argument is the same one practitioners have already learned the hard way: existing AI governance frameworks were written before autonomous, tool-calling agents became the production unit of work. They are useful foundations, but they do not yet give teams enough operational specificity for agent identity, authorization, approval flow, auditability, and multi-agent containment.
In practice, the gap shows up as bad architecture decisions. Teams build one giant system prompt instead of a constitution. They make every action approval-gated, turning agents into slow suggestion engines, or they make every action autonomous, turning agents into reckless interns with admin access. They rely on hooks without accounting for sub-agent limitations. They let cron workflows steer existing sessions because it feels efficient, then spend weeks debugging context pollution. They implement a brand review policy but forget to make it mandatory before publication. None of these failures are model failures. They are governance failures.
Operational governance therefore has to answer a different question than chatbot safety. Not “how do I stop a bad answer?” but “how do I keep a capable agent inside durable, inspectable boundaries while still letting it create leverage?” If your governance system makes agents too weak to help, you lose the upside. If it makes them too unconstrained to trust, you lose the platform. The only sustainable answer is layered control.
Core principle: chatbot safety is about response quality. Operational agent governance is about permission quality. The first protects outputs. The second protects actions.
14.2 Layer 1 - The Constitution
The first layer is a constitution: one binding principles document that every agent reads before it executes. Not a vague style guide. Not a wiki page nobody loads. A real operating contract that defines identity, communication rules, decision-making framework, autonomy boundaries, escalation rules, and cross-agent behavior.
This is the single most scalable governance artifact in a large agent system because it solves the consistency problem at the root. If you have 53 agents and each one carries a different interpretation of when to ask, how to communicate, how to escalate, or how to respect privacy, your platform will feel haunted. Some agents will be overly timid. Others will be overly aggressive. Some will create tasks automatically. Others will only suggest them. The human experiences this as randomness. The constitution removes that randomness.
A good constitution does four jobs. First, it defines identity: who the platform serves, what kind of assistant it is, and what values dominate edge cases. Second, it defines communication rules: tone, channel expectations, how to handle sensitive information, and how concise or verbose outputs should be. Third, it defines a decision-making framework: default-to-action versus default-to-clarification, escalation triggers, timing rules, and what “good judgment” means in this system. Fourth, it defines autonomy levels: which actions are safe to execute directly and which must stop for human approval.
The constitution is not a replacement for agent-specific memory. It is the top-level law that agent-specific memory sits beneath. A finance agent still has its own core.md. A scheduler agent still has its own working memory. But both inherit the same constitutional rules around privacy, task creation, escalation, and cross-agent delegation. That is why one document can govern dozens of agents without flattening their specialization.
Here is a representative structure:
# platform-constitution.md
## Identity
- You are a family and operations assistant for one household.
- Your job is to reduce cognitive load, not create more of it.
- Default to useful action, but never hide uncertainty.
## Communication Rules
- Be warm, concise, and direct.
- Respect user-specific channel preferences.
- Never leak private information across family members without authorization.
- When alerting Hector via Telegram, include speak text for TTS.
## Decision Framework
- If concrete data is missing, create a clarification task instead of guessing.
- If an action is routine and low-risk, do it and report afterward.
- If an action changes money, medical care, external messaging, or protected data, check the autonomy table.
## Autonomy Levels
- Calendar events from clear date/time mentions: autonomous.
- Task creation for actionable findings: autonomous.
- Sending email on behalf of a human: approval required.
- Purchases above threshold: approval required.
- Medical recommendations: escalate to human.
## Cross-Agent Rules
- Cron-dispatched work always launches a fresh agent.
- Never steer scheduled jobs into existing agent contexts.
- Use local tools first; use mesh only when another workspace owns the capability.
## Safety Rules
- Never state child location as current fact.
- Never invent dates from vague language.
- Never hardcode credentials or bypass governed tools.Why does this scale so well? Because constitutions compress governance into one repeatedly loaded document. Update the constitution once and all 53 agents inherit the correction the next time they run. That beats editing 53 prompts. It also creates a stable place to persist lessons from human corrections. When Hector says, “from now on, never guess dates” or “always use TTS for my Telegram alerts,” that lesson belongs in the constitution so it becomes durable behavior rather than session trivia.
The constitution also gives you a review surface. If an agent behaves badly, you have somewhere concrete to ask, “Did the constitution fail, or did the agent fail to apply it?” Without a constitution, every incident turns into archaeology across prompts, memory files, and implied conventions. With one, governance becomes inspectable.
14.3 Layer 2 - Tiered Autonomy
The second layer is tiered autonomy. This is where you stop talking about agent freedom in vague terms and start making explicit decisions about which actions deserve friction. Not all actions are equal. Creating a task from a detected action item is not the same as emailing a client. Scheduling a reminder is not the same as moving money. A healthy governance system encodes that difference instead of pretending “autonomous” and “manual” are the only two states.
The most practical framing is an act-first versus ask-first matrix. Low-risk, reversible, high-frequency actions should be autonomous. High-risk, irreversible, sensitive, or externally binding actions should stop for approval. This sounds obvious, yet many teams get it wrong in both directions. They either over-gate everything - making the agent ask permission for every tiny move until humans stop using it - or under-gate everything and discover too late that “send the recap” also meant “send it to the client.”
A good matrix is concrete enough that an agent can apply it without interpretation drift. Here is a production-style decision table:
| Action | Default | Why |
|---|---|---|
| Create calendar event from clear date/time mention | Just do it | High-frequency, reversible, low-risk, improves reliability |
| Create tasks from actionable findings | Just do it | Turns observations into execution without waiting |
| Add shopping items or routine reminders | Just do it | Low-risk household logistics |
| Relay family logistics between household members | Just do it | Shared coordination is the point of the system |
| Send email on behalf of a human | Ask first | External communication creates reputational and contractual risk |
| Major purchases above $200 | Ask first | Financial threshold requires human intent |
| Medical decisions or recommendations | Always ask | Safety-critical and ethically sensitive |
| Delete data | Always ask | Irreversible loss |
The hidden advantage of this table is speed. When the agent does not have to negotiate low-risk actions, it becomes truly assistant-like. It detects, acts, and reports. That is the only mode in which an autonomous system actually removes work. But because the approval lines are explicit, the human still retains control over actions that change money, health, or external identity.
Tiered autonomy is not only for runtime actions. It should also govern how you change the platform itself. That is where the development pipeline tiers matter. A single-file copy tweak should not require a research committee. A new agent that touches money, medical workflows, or cross-session authority absolutely should.
| Tier | Scope | Pipeline |
|---|---|---|
| Tier 1 - Small | Single file, low-risk, under ~50 lines | Just do it |
| Tier 2 - Medium | Multi-file feature or moderate refactor | Plan -> Implement -> Review |
| Tier 3 - Large | Architecture changes, new systems, major integrations | Research -> Spec -> Implement -> Multi-model Review -> Fix |
| Tier 4 - Critical | Safety, financial, medical, or security-sensitive changes | Tier 3 plus dedicated safety review |
This is a subtle but important governance move: autonomy should increase with routine, not with confidence. An agent may feel “confident” about an email draft or a payment recommendation. That is irrelevant. What matters is the class of action. If the action affects identity, money, medical care, or protected data, the system inserts friction even when the model sounds certain. Conversely, if the action is low-risk and reversible, the system should let the agent move fast even if the work feels mundane. Governance is about action class, not model self-esteem.
Another common mistake is building the matrix in prose and never encoding it into prompts, skills, or hooks. If the autonomy rules are not machine-readable enough to appear in the constitution, the development pipeline, and the relevant skills, they are not rules. They are aspirations. Tiered autonomy works only when the agent encounters it every time it plans.
function decideAction(action) {
if (action.type === 'medical') return 'escalate';
if (action.type === 'purchase' && action.amount > 200) return 'approve';
if (action.type === 'send_email_on_behalf') return 'approve';
if (action.type === 'calendar_create' && action.hasClearDateTime) return 'execute';
if (action.type === 'task_create') return 'execute';
return 'clarify';
}The outcome you want is simple: agents that feel decisive where they should be decisive, and cautious where they should be cautious. Tiered autonomy is how you get both at once.
14.4 Layer 3 - Approval Gates
The third layer is the approval gate. Tiered autonomy tells the system which classes of actions need review. Approval gates define how that review happens without destroying the agent’s momentum. This matters most in content systems, because content agents constantly cross source boundaries: public repos, published articles, private repos, enterprise codebases, internal notes, and live drafts.
In a production multi-agent content pipeline, not all source material should flow straight to public output. Public repos and already-published articles are often safe for autonomous reuse. Private repositories and enterprise code are not. Governance has to distinguish between “this source can be cited and synthesized automatically” and “this source requires human permission before it becomes public narrative.”
A practical source-tier model looks like this:
| Source Tier | Examples | Publish Rule |
|---|---|---|
| Tier 1 | Public GitHub repos owned by you | Autonomous publish allowed |
| Tier 2 | Published articles, newsletters, talks, docs | Autonomous publish allowed |
| Tier 3 | Platform patterns already documented publicly | Autonomous publish allowed |
| Tier 4 | Private repos, client workspaces, drafts not yet public | Approval required before publishing |
| Tier 5 | Enterprise repos, employer systems, internal corporate material | Approval required before publishing |
The approval workflow for Tier 4 and Tier 5 content should be explicit: draft, summarize what sensitive material is involved, send a review request over the human’s preferred channel, wait up to four hours, then fall back to public sources if no approval arrives. That timeout rule is more important than it looks. Without it, pipelines stall forever waiting on human approval. With it, the system preserves momentum while still respecting boundaries. The output may become less specific, but it remains publishable and safe.
The human review request itself should be optimized for action. Do not dump 2,000 words into Telegram and ask, “Thoughts?” Send a tight summary of what source class triggered the gate, what the proposed public claim is, and which response options are available. For Hector, the message should also include the speak parameter so TTS can read the decision request aloud while he is moving through the day.
Approval needed: Tier 5 source detected.
- Draft topic: AI agent governance patterns
- Sensitive source: enterprise repo architecture notes
- Proposed use: one paragraph on deployment controls
- Options:
1. Approve use of this source
2. Replace with public-source equivalent
3. Skip this section entirely
Timeout: 4 hours, then fallback to public sources.The timeout path is not “publish anyway.” It is “strip private material and rebuild from public evidence.” That distinction is what makes the gate trustworthy. Approval is only needed when the system wants to use the private source itself. If approval does not arrive, the system degrades gracefully instead of crossing the line.
Approval gates should also leave breadcrumbs in state. The agent should record that a gate was triggered, what source tier caused it, when the request was sent, and what fallback path was taken. Otherwise you lose auditability. Months later, when reviewing why a chapter became more generic than the draft promised, you want a clear answer: “Tier 5 source required approval, no response arrived in four hours, system fell back to public repos and published articles.”
This layer is where many teams confuse governance with bureaucracy. The point of approval gates is not to put humans back in every loop. The point is to reserve human review for moments where source provenance, external identity, or confidentiality actually matter. Everything else should keep moving.
if (sourceTier <= 3) {
publishAutonomously();
} else {
requestApproval({ via: 'telegram', speak: true, timeoutHours: 4 });
if (approved) publishWithSensitiveSource();
else rebuildFromPublicSources();
}When implemented well, approval gates do not make the system feel slower. They make it feel grown up. The agent moves autonomously through the safe zones and knows exactly when to stop at the border.
14.5 Layer 4 - Safety Protocols
The fourth layer is where governance becomes domain-specific. A constitution gives you universal rules. Tiered autonomy gives you decision friction. Approval gates give you review boundaries. Safety protocols define what must happen when an agent enters a context where ordinary autonomy is not enough.
The easiest way to understand this layer is to think in terms of irreducible risks. Some domains are too sensitive to handle with generic “be careful” instructions. Child safety is one. Medical guidance is another. Financial commitments above a threshold are another. The failure mode in each domain is different, so the protocol has to be explicit enough to override normal agent instincts.
| Domain | Protocol | Why it exists |
|---|---|---|
| Child safety | Never state child location as current fact; add staleness caveat and create pickup reminder task | Location data decays quickly and wrong certainty is dangerous |
| Medical decisions | Always escalate to a human; no autonomous medical recommendations | Health risk and liability are too high |
| Financial actions | Purchases above $200 require approval | Prevents silent spend and preserves intent |
| Missing data | Create clarification task instead of guessing | Prevents confident fiction from driving action |
The child-safety rule is a perfect example of why generic prompting fails. A highly capable assistant wants to be helpful, and helpful often means sounding certain. But certainty about a child’s current location is precisely what you cannot allow unless the data is live and verified. The right response pattern is: state the latest known information as stale context, call out that it may have changed, and create a task or reminder that closes the loop. Governance here is not about suppressing information. It is about representing its freshness honestly.
Medical protocols work differently. The problem is not just stale data; it is the ethical line between support and advice. An agent can help organize appointment schedules, medication reminders, prep notes, and post-visit questions. It should not autonomously recommend a treatment path, dosage change, or diagnostic conclusion. That is a hard escalation boundary. The right governance move is to make “medical decision” a first-class action type that always routes to a human.
Financial thresholds are the same pattern translated into money. Let the agent log receipts, create reminder tasks, summarize bills, and detect anomalies. But once the action involves committing new spending above an explicit threshold, the human re-enters the loop. Thresholds matter because they turn “be careful with money” into a machine-usable rule.
The no-assumptions rule may be the most broadly useful safety protocol of all. Agents are language models. Their native instinct is completion. When information is missing, they will try to bridge the gap with whatever pattern seems plausible unless you explicitly train the system otherwise. In production, that is deadly. Wrong departure times, invented calendar context, guessed inventory levels, and assumed ownership boundaries all create downstream failures that look like “the agent was proactive” when what actually happened was “the agent hallucinated with confidence.”
The fix is brutally simple: if concrete data is missing and the missing data matters to the decision, the agent creates a clarification task and blocks dependent reasoning. This feels slower in the moment, but it is faster at system scale because it prevents the more expensive failure mode: carrying a bad assumption into multiple autonomous steps.
## Clarification Protocol
- Detect missing decision-critical data.
- Do not infer the missing value.
- Create a task with category: clarification and priority: high.
- Record which downstream action is blocked.
- Resume only when the clarification is answered.Notice what all these protocols have in common: they convert vague safety intent into repeatable procedure. That is the real role of Layer 4. It takes the categories where mistakes are expensive and makes the safe path so explicit that the agent can follow it automatically.
14.6 Layer 5 - Code and Data Guards
The fifth layer moves from behavioral governance into technical enforcement. Up to this point, most controls have been prompt-level or workflow-level. Code and data guards make certain mistakes mechanically harder to commit.
The first class of guard is tool governance. In a coding platform, that often means intercepting tool calls before they execute and blocking dangerous paths. A clean example is a dev-guard hook that blocks raw git operations so agents must go through governed workflow tools instead of free-form shell commands.
export default {
onPreToolUse: async ({ tool, args }) => {
const blockedTools = ['git_commit', 'git_push', 'git_add'];
if (blockedTools.includes(tool)) {
return { blocked: true, reason: 'Use dev-workflow tools instead' };
}
}
};The point is not that git is bad. The point is that ungoverned code paths bypass policy. If your governed workflow adds co-author trailers, preserves review discipline, records metadata, or prevents pushes to protected branches, then a raw command is effectively a side door around governance. Hooks close the side door.
But hooks are not enough on their own. In the Copilot SDK version used in this platform, hooks do not propagate automatically to sub-agents. That v1.0.47 limitation matters because it creates an illusion of safety: the parent session looks protected, while a launched sub-agent may still attempt raw operations unless the prompt itself repeats the rule. This is why strong platforms use prompt-level enforcement as backup. The guardrail must exist both in executable hooks and in the instructions every sub-agent receives.
The second class of guard is data governance. Once agents start reading and writing shared files, you need a protected-files model. Not every path should be open for direct edits. Some files should only change through dedicated tools that validate schema, enforce ownership, and preserve auditability.
{
"protectedFiles": [
"data/tasks/tasks.json",
"data/bills/bills.json",
"data/calendar/availability.json",
"data/agents/*/core.md"
],
"writePolicy": {
"directEditsAllowed": false,
"requiredTools": {
"data/tasks/tasks.json": "add_task|update_task|complete_task",
"data/bills/bills.json": "bill_upsert",
"data/calendar/availability.json": "calendar_sync"
}
}
}This design solves three problems at once. First, it stops accidental corruption from free-form edits. Second, it forces schema validation into one place. Third, it creates cleaner domain boundaries. A scheduling agent should not directly mutate finance files. A finance agent should not directly edit child-care records. Domain ownership is not only a conceptual rule; it should be visible in the filesystem.
A simple ownership map is often enough:
data/agents/finance-manager/ -> finance-manager owns
data/agents/nicu-care/ -> nicu-care owns
data/agents/content-manager/ -> content-manager owns
data/shared/ -> shared read; writes only via tools
data/tasks/ -> task system writes only via task toolsCross-domain writes then become explicit exceptions instead of accidental habits. If one domain genuinely needs to affect another, that interaction should go through a tool or a structured message, not a silent file write. That is how you preserve the ability to reason about causality. “Why did this value change?” is answerable when changes route through governed paths.
The broader lesson of Layer 5 is that prompt instructions alone cannot bear the entire governance burden. The more important an invariant is, the closer it should live to the execution boundary. Prompts teach. Hooks intercept. Tools validate. File ownership constrains. You want all four.
When teams skip this layer, the platform still works for a while. Then a sub-agent writes the wrong file, a raw command bypasses the protected path, or a shared JSON document gets hand-edited into an invalid state. Those failures feel random. They are not random. They are what happens when the system has values but not enforcement.
14.7 Layer 6 - Context Isolation
The sixth layer is context isolation, and it is the one most teams discover only after they have already poisoned their own agents. When an agent system runs on schedules, retries, pipelines, or background sessions, context becomes a security boundary. If you let one cycle leak assumptions into the next, you are not just making the agent messy. You are giving stale instructions operational authority.
The most important rule here is brutally simple: every cron-fired job launches a fresh agent. Never steer a scheduled task into an existing context just because a process is already running. It feels efficient. It is governance debt.
Why? Because running agents accumulate situational instructions that are valid only for the moment they were issued. “Stay quiet during the meeting.” “Do not nudge until noon.” “Pause because the twins are sleeping.” “Wait for approval on this branch.” Those are not stable truths. They are ephemeral state. If the next scheduled cycle lands inside that old conversation, the new work inherits old constraints with no guarantee they are still correct.
This is why fresh-agent-per-cron is more than an orchestration preference. It is a security property. It guarantees that a scheduled job starts from the constitution, current inputs, and current memory - not from the accidental residue of a previous conversation. Context isolation turns each cycle into a deterministic invocation instead of a haunted continuation.
The same principle applies to state machine pipelines. Multi-step workflows should not rely on one immortal process staying alive forever. They should persist their baton in state, then let each step execute in isolation. State belongs in files or governed storage, not in the lingering context window of one lucky session.
state: step3_review
retry_count: 1
last_error: null
baton:
topic: ai-agent-governance
draft_pr: 418
review_agent: quality-gate
next_if_success: step3_merge
next_if_failure: step3_remediateThat baton model buys you restartability, inspectability, and composability. If the process crashes, the next cycle reads the baton and continues. If a human wants to inspect progress, they read the state file instead of reverse-engineering a chat history. If another agent needs to take over, it can do so without inheriting irrelevant conversation debris.
Context isolation also requires limits. Retry forever is not resilience. It is denial. A sane production rule is max three retries per state, then escalate to a human. Beyond that, the system is not “working through a transient issue.” It is silently consuming cycles while pretending persistence equals progress.
Stuck detection is the companion rule. Some failures do not increment retry counts because nothing explicitly throws. The workflow just never advances. That is why a second timer matters: if a pipeline stays in the same state for four or more hours, alert a human even if no hard failure was recorded. Long-running silence is its own error condition.
| Context Isolation Rule | Enforcement |
|---|---|
| Cron jobs always launch fresh agents | No write_agent for scheduled dispatch |
| Sequential pipelines persist baton state externally | Use working memory or governed state files |
| Retry loops are bounded | Max 3 retries per state |
| Silent stagnation is escalated | Alert after 4+ hours in same state |
One useful way to think about Layer 6 is that it protects the freshness of intent. The human’s intent at 8:00 AM is not necessarily the same as the human’s intent at 10:00 AM. The system’s operational state two hours ago is not necessarily the state now. Isolation preserves the right to re-evaluate with current evidence.
Without this layer, every other layer weakens. A strong constitution can be drowned out by stale messages. A careful autonomy table can be bypassed by old context. A safety protocol can be suppressed by an inherited “do not send anything yet” instruction that no longer applies. Governance has to include memory hygiene between runs, or memory itself becomes an attack surface.
14.8 Layer 7 - Brand and Content Safety
The seventh layer is brand and content safety. This is where the governance stack meets public output. Many teams think of brand review as marketing polish. In an agentic system, it is operational risk management. An autonomous content agent can publish incorrect framing at scale, leak private implementation detail, use the wrong comparison language, or slip in a forbidden employer reference faster than a human editor would ever notice manually.
That is why pre-publish review has to be mandatory for public content. Not “recommended when convenient.” Mandatory. Every public artifact should pass a separate review step before merge or scheduling. In practice the cleanest loop is:
Create -> Review (separate agent) -> Remediate -> Review again -> Merge
^
max 2 remediation cyclesThe review agent should not be the same agent that wrote the content. Separation matters. The creator is biased toward its own draft. A second agent can check URLs, claims, competitor framing, employer-name bans, source provenance, and placeholder hygiene from a different context window.
Competitor framing is a good example of why generic “be professional” guidance is insufficient. If GitHub Copilot is the hero product in your ecosystem, then comparative content should position it favorably and fairly. That does not mean lying about competitors. It means the agent should not casually undermine the very product family you are publicly associated with, especially when the content is meant to reinforce a clear brand narrative. Brand safety is not censorship. It is coherence.
The same goes for employer-name bans. Public content may be allowed to reference enterprise patterns, internal lessons, or previous platform experience, but some company names should never appear. That rule must be explicit and enforced before publishing. The safe pattern is generic framing: “enterprise DevOps platform,” “Fortune 500 energy company,” “large internal platform team.” The unsafe pattern is naming protected employers or internal systems directly.
A strong quality gate should also reject unfinished content outright. Zero placeholders means zero placeholders. No TODOs, no “to be expanded,” no “V2,” no “coming soon,” no half-written comparison bullets waiting for a later pass. In public content, placeholders are not harmless scaffolding. They are evidence that nobody owned the finish line.
| Brand Safety Check | What the review agent verifies |
|---|---|
| Claim grounding | Assertions map to public sources or approved internal patterns |
| URL verification | Linked resources exist and resolve correctly |
| Competitor framing | Hero product remains positioned favorably and accurately |
| Employer name bans | No forbidden company names appear in public copy |
| Placeholder scan | No unfinished markers survive into mergeable content |
This layer is also where GitHub Copilot becomes the positive example instead of the generic “AI tool.” If your platform is built around Copilot agents, Copilot should appear in the narrative as the trusted, production-grade tool that you wrap with governance - not as a neutral background detail. That positive framing matters because public architecture writing is part technical guidance, part market positioning.
The best brand-safety systems do not merely block bad language. They protect strategic consistency. They ensure every article, blueprint, post, and newsletter sounds like it came from the same serious operator, not from a swarm of unaligned drafts. In a multi-agent content platform, that consistency is not a soft skill. It is the final layer that keeps autonomous publishing from damaging the brand it is meant to amplify.
14.9 The Emerging Governance Ecosystem
The strongest signal that these patterns matter is that the broader ecosystem is converging on them. What started as “just prompt engineering” is becoming a real governance market because production agent systems need real controls.
UiPath’s coded agent documentation is a good example. Their model treats agents as governed processes inside Orchestrator folders, with scheduled execution, monitoring, and the same operational discipline applied to standard automation. Their credential-store documentation adds a second signal: secrets live behind managed stores and proxies, not inside agent code. That is exactly the direction serious agent platforms must take - identity, execution, and credential access governed by platform controls rather than local convention.
Coder’s AI Coder docs show the same pattern from the developer-platform side. Their architecture runs the agent loop in the control plane, not buried ad hoc inside every workspace, which enables network isolation, centralized model selection, admin-controlled prompts and tool permissions, and audit trails for prompts and tool invocations. In other words: model-agnostic governance plus workspace isolation. That is not a niche feature. It is the shape of enterprise readiness.
NIST is moving in the same direction. The AI Risk Management Framework and its playbook already provide the trustworthiness vocabulary - govern, map, measure, manage - that teams use to structure AI risk. More importantly for autonomous systems, NIST has now launched an AI Agent Standards Initiative focused specifically on secure, interoperable agent systems, agent identity, and multi-agent interaction. That is a formal admission that agentic systems require more specific controls than earlier AI deployments.
The Cloud Security Alliance made the gap even more explicit in April 2026, calling out an “AI Agent Governance Framework Gap” and arguing that organizations cannot wait for standards bodies to finish the work before building internal controls. That is exactly right. Standards are catching up. Production systems are already here.
The practical takeaway is not “wait for the ecosystem.” It is the opposite. Use the ecosystem as confirmation that the patterns in this chapter are not personal quirks. Constitutions, approval gates, audit trails, controlled credentials, network isolation, bounded autonomy, and explicit review loops are rapidly becoming the common language of serious agent deployment.
And this is why GitHub Copilot matters as the hero example. Copilot is not just valuable because it can write code. It is valuable because it sits at the point where AI usefulness meets enterprise expectations. The future of coding agents is not raw capability in a vacuum. It is high capability wrapped in governance people can actually trust.
14.10 Implementation Checklist
If you want to implement this stack without boiling the ocean, do it in priority order. Layering matters. Start with the pieces that create shared behavioral consistency, then add the pieces that enforce it technically.
- Write the constitution first. If you skip this, every other layer becomes fragmented because each agent improvises its own interpretation.
- Define the autonomy matrix. Decide what agents may do, what requires approval, and what always escalates.
- Install safety protocols. Encode the non-negotiables for child safety, medical actions, financial thresholds, and missing-data clarification.
- Add approval gates. Make source-tier review and external action review explicit, with timeout and fallback logic.
- Add code and data guards. Block side doors, protect shared files, and route sensitive writes through tools.
- Enforce context isolation. Fresh agents for cron, externalized baton state, retry limits, stuck detection.
- Finish with brand and quality gates. Protect the final public surface before you let publishing run autonomously.
Here is the practical build order as a deployment checklist:
| Step | What to build | Definition of done |
|---|---|---|
| 1 | Constitution document | Every agent loads the same core rules before acting |
| 2 | Autonomy table | Low-risk actions execute automatically; sensitive actions stop cleanly |
| 3 | Safety protocols | Child, medical, financial, and clarification cases follow explicit procedures |
| 4 | Approval workflow | Tier 4 and 5 sources request review, wait 4 hours, then fall back safely |
| 5 | Hooks and data tools | Protected paths and blocked tool calls are enforced mechanically |
| 6 | Isolation rules | Fresh cron dispatch, persisted baton state, bounded retries, stuck alerts |
| 7 | Brand and quality review | No public artifact publishes without review and placeholder scans |
Three common pitfalls show up almost every time:
- Pitfall 1: governance by prose only. If the rule never reaches hooks, tools, prompts, or workflows, it will drift.
- Pitfall 2: over-gating everything. Agents become too timid to help. Save human review for actions that actually justify it.
- Pitfall 3: treating context as harmless. Old context is not harmless. It is stale authority.
A fourth pitfall is subtler: teams often try to implement Layer 5 before Layer 1 because technical controls feel more concrete. Resist that impulse. Hooks without a constitution only enforce fragments. You need the governing logic first so the technical layer knows what it is defending.
Another practical rule: document incidents as governance lessons, not as one-off mistakes. If a bad behavior repeats, it belongs in the constitution, a safety protocol, a tool guard, or a review checklist. Mature agent systems get better by converting incidents into durable controls.
Implementation order:
1. Constitution
2. Tiered autonomy
3. Safety protocols
4. Approval gates
5. Code/data guards
6. Context isolation
7. Brand/content safety
Operational rule:
Every incident should either strengthen a rule, create a tool guard,
or improve a review gate. If it doesn't, the system didn't learn.The reason to implement the stack in this order is that it maximizes useful autonomy early. Once the constitution, autonomy matrix, and safety protocols exist, agents can already move much faster on safe work. The later layers make that speed trustworthy at scale.
The final mental model is this: the 7-layer stack is not seven documents. It is seven kinds of control. Layer 1 tells the agent who it is. Layer 2 tells it how much freedom it has. Layer 3 tells it when to stop for a human. Layer 4 tells it which domains need special care. Layer 5 constrains dangerous tool and data paths. Layer 6 keeps contexts from contaminating each other. Layer 7 protects the public surface.
Together, they turn an AI agent from an impressive demo into a governable system.
Next step: read Newsletter Issue #7 for the short version, pair this chapter with The Agentic Development Blueprint for the broader operating model, and if you want help implementing governed GitHub Copilot workflows in your own team, check out my consulting page.
This chapter pairs with Newsletter Issue #7: The 7-Layer AI Governance Stack. The newsletter explains the model fast. This chapter shows how to operationalize it.
The throughline of this entire blueprint is now complete: memory keeps agents continuous, skills keep them reusable, extensions keep them controllable, orchestration keeps them coordinated, and governance keeps them safe enough to trust.
Quick Reference
Cheat sheets, decision flowcharts, and templates you can print and pin to your wall.
4-Tier Memory Cheat Sheet
Load/Save Sequence Checklist
At Run Start (MANDATORY):
- ΓÿÉ Read
data/agents/{agent-name}/core.mdΓÇö Tier 1 - ΓÿÉ Read
data/agents/{agent-name}/working.mdΓÇö Tier 2 - ΓÿÉ Check “Last Updated” timestamp ΓÇö is it stale (> 3 days)?
- ΓÿÉ If stale, refresh temporal references before trusting content
- ΓÿÉ Only load
long-term.mdif you need historical context - ΓÿÉ Never bulk-load
events.log
At Run End (MANDATORY):
- ΓÿÉ Update
working.mdΓÇö what happened, state changes, deferred items - ΓÿÉ Update “Last Updated” timestamp ΓÇö full ISO-8601 with timezone
- ΓÿÉ Check size ΓÇö is
working.mdunder 5KB? If not, prune now - ΓÿÉ Append to
events.logΓÇö one line per significant action - ΓÿÉ Promote to
long-term.mdONLY if a validated pattern was discovered
Pruning Decision Flowchart
For each item in working memory, ask in order:
- Is it completed AND older than 7 days? → Remove it.
- Has it been repeated 3+ times without change? → Collapse into one summary line.
- Has it been deferred for 14+ days? → Either escalate (create user task), move to long-term, or remove.
- Is it still relevant to the CURRENT state? → Keep it. Otherwise, remove.
- After all removals, is the file under 5KB? → Done. If not, shorten verbose entries to bullets.
Promotion Criteria Checklist
Promote from working → long-term ONLY when ALL are true:
- ΓÿÉ It’s a pattern, not a one-off event
- ΓÿÉ It’s been confirmed across 3+ runs or data points
- ΓÿÉ It will affect future decisions (it’s actionable)
- ΓÿÉ It’s NOT already captured in a skill, core rule, or platform convention
- ΓÿÉ It can be summarized in 3-5 bullet points (not a raw data dump)
Staleness Detection Checklist
Working memory is stale if ANY are true:
- ΓÿÉ “Last Updated” is > 3 days old AND agent has an active cron schedule
- ΓÿÉ Contains “today,” “tomorrow,” or “this week” from a previous week
- ΓÿÉ References events or deadlines that have already passed without resolution notes
- ΓÿÉ Contains items marked “pending” with no progress for 14+ days
When detected: Flag → Refresh temporal references → Remove completed items → Update timestamp.
Complete Templates (Copy-Paste Ready)
All four templates in one block for easy setup of a new agent:
# ============================================
# FILE: data/agents/{agent-name}/core.md
# ============================================
# {Agent Name} ΓÇö Core Identity
## Last Updated
{YYYY-MM-DD}
## Identity
{1-2 sentences: role and scope}
## Mission
- {Primary responsibility}
- {Secondary responsibility}
## Ownership Boundaries
### You own
- {Domain 1}
- {Domain 2}
### You do NOT own
- {Explicit exclusion 1}
- {Explicit exclusion 2}
## Core Heuristics
1. {Decision rule 1}
2. {Decision rule 2}
## Key Rules
- {Constraint 1}
- {Constraint 2}
# ============================================
# FILE: data/agents/{agent-name}/working.md
# ============================================
# {Agent Name} ΓÇö Working Memory
## Last Updated
{YYYY-MM-DDTHH:MM:SS┬▒HH:MM}
## Current State
- {Active item 1}
- {Active item 2}
- {Active item 3}
## Recent Actions
- {Last run}: {outcome}
## Pending / Deferred
- {Item}: waiting for {reason}
## Active Rules
- {Any temporary overrides}
# ============================================
# FILE: data/agents/{agent-name}/long-term.md
# ============================================
# {Agent Name} ΓÇö Long-Term Memory
## Last Updated
{YYYY-MM-DD}
## History & Learnings
{Empty until first validated pattern}
## Recurring Patterns
{Empty until patterns are confirmed 3+ times}
## Decisions Made
{Empty until significant decisions are recorded}
# ============================================
# FILE: data/agents/{agent-name}/events.log
# ============================================
[{ISO timestamp}] create: Agent initialized with 4-tier memoryTo set up a new agent’s memory system, create the directory and populate these four files. The entire setup takes less than 5 minutes. The agent is ready to maintain persistent context from its very first run.