What Is Context Engineering? A Practical Guide from Building 50 Production AI Agents

Most People Are Still Writing Prompts. The Real Skill Is Designing Context.

Here’s the uncomfortable truth about AI agent context: the model is rarely the bottleneck. The context is.

I’ve spent the last six months building what I call the “Rocha Family Home OS” — a platform of 50 autonomous AI agents and 71 reusable skills, all orchestrated by GitHub Copilot. These agents manage everything from family finances and meal planning to content publishing and home maintenance. They run on cron schedules, communicate across sessions, and maintain persistent memory.

And the single most important discipline I’ve developed isn’t prompt engineering. It’s context engineering — the art and science of designing what each agent sees, remembers, and acts on at every moment.

When Andrej Karpathy publicly advocated for “context engineering” over “prompt engineering,” he described it as “the delicate art and science of filling the context window with just the right information for the next step.” That framing changed how I build systems. I wrote about the theoretical foundations of context engineering earlier this year. This article is the production sequel — what context engineering actually looks like when you’re running 50 agents in the real world.

Why Context Engineering Matters More Than Prompt Engineering

Prompt engineering is writing a clever instruction for a single turn. Context engineering is designing the entire information architecture that determines what the model sees before it generates a single token — across many turns, many agents, and many sessions.

The distinction is critical once you move beyond chat:

Dimension	Prompt Engineering	Context Engineering
Scope	Single request	Entire agent lifecycle
Persistence	Stateless	Multi-session memory
Scale	One model, one turn	50 agents, 71 skills, shared state
Approach	Craft a good question	Design what the model knows, remembers, and can do

Anthropic’s engineering team defines context as “the set of tokens included when sampling from a large-language model” and the engineering challenge as “optimizing the utility of those tokens against the inherent constraints of LLMs.” In a multi-agent system, that challenge multiplies. Every agent has a different job, different tools, and different knowledge requirements. You can’t just write 50 good prompts — you have to design 50 context architectures.

Google’s Agent Development Kit blog frames it perfectly: context should be “a compiled view over a richer stateful system” — not a mutable string buffer you keep appending to. That’s exactly how my platform works.

The 4-Tier Memory System

The first context engineering pattern I’ll share is tiered memory. Every stateful agent in my platform uses a 4-tier memory system:

data/agents/{agent-name}/
├── core.md          # Tier 1 — identity, rules, heuristics (3-5KB)
├── working.md       # Tier 2 — current state, active context (5KB max)
├── long-term.md     # Tier 3 — historical patterns, lessons (10KB max)
└── events.log       # Tier 4 — append-only audit trail (unlimited)

The key insight: not all memory deserves context window space. Here’s how the tiers work:

Tier 1 (core.md) — Always loaded. Contains the agent’s identity, mission, domain ownership, and hard rules. My finance manager’s core.md defines it as “the Rocha family’s financial backbone — practical, no-nonsense, and protective of the family’s money.” This never changes between sessions.
Tier 2 (working.md) — Always loaded. Contains current state — active bills, recent transactions, pending budget items. This is the agent’s short-term memory, updated every run.
Tier 3 (long-term.md) — Loaded on demand. Contains validated patterns and historical lessons. Only pulled into context when the agent needs historical data for a specific decision.
Tier 4 (events.log) — Never bulk-loaded. An append-only audit trail. Agents write to it but never read the full log into context.

This tiered approach keeps each agent’s base context under 10KB while giving it access to deep history when needed. Without it, every agent would load its entire history every run — burning tokens on irrelevant context and hitting what Anthropic calls “context rot,” where recall accuracy degrades as context size grows.

Before tiered memory, my agents would occasionally “forget” rules from early in their instructions because those rules were buried under thousands of tokens of history. After implementing tiers, rule adherence became consistent because Tier 1 (core identity) is always at the top of the context window.

Skills as Reusable Context Modules

The second pattern is skills — reusable context modules that any agent can invoke. I have 71 skills in .github/skills/, each a self-contained markdown file with YAML frontmatter:

---
name: memory-management
description: >
  4-tier memory system management for all stateful agents —
  loading, saving, pruning, promoting, and maintaining memory files.
  Use when user says "load memory", "save memory", "update working
  memory", "prune memory", or any agent memory lifecycle activity.
---

# Memory Management Skill

Standard 4-tier memory system used by all stateful agents...

Skills are context engineering artifacts. Instead of embedding the same memory management instructions in all 50 agents, I write the pattern once as a skill and reference it: “Follow the memory-management skill.” When the agent invokes the skill, the right knowledge loads into its context window at exactly the right time.

This is the software engineering principle of DRY (Don’t Repeat Yourself), applied to AI context. Some of my skills:

telegram-communication — How to send messages to each family member (formatting, quiet hours, TTS rules)
copilot-brand-safety — Brand protection rules for all content mentioning GitHub Copilot or Microsoft
child-safety-protocol — Safety-critical rules for child location tracking and caregiver handoff
daily-briefing-format — Structured morning briefing compilation workflow
email-triage — Email scanning, categorization, and autonomous action patterns

Before skills, I had formatting rules duplicated across 15+ agent definitions. When I needed to change the Telegram messaging format, I had to update every agent. After extracting the telegram-communication skill, I update one file and every agent gets the new behavior. Context engineering at scale is software engineering — you need abstractions.

Agent Instructions as Context Contracts

Every agent in my system has an .agent.md file that serves as its context contract — the complete definition of what this agent knows, owns, and can do:

---
name: finance-manager
description: "Family Budget & Bills — owns budget tracking, bill
  payments, expense categorization, savings goals, and debt
  management for the Rocha family."
---

# Finance Manager — Rocha Family Budget & Bills

## Constitution
**Before doing ANYTHING else**, read the family constitution:
data/constitution.md

## Memory (4-Tier System) — see `memory-management` skill
**Load first:** core.md (Tier 1) + working.md (Tier 2)
**Save last:** Update working.md, append events.log

## Identity & Personality
You are the Rocha family's financial backbone. Practical,
no-nonsense, and protective of the family's money.

## Domain Ownership
### Budget Management
- Track all income and expenses
- Maintain monthly budgets
- Run budget-vs-actual reports
- Flag categories trending over budget at 50% and 80%

## Decision Framework
### Act Immediately (no confirmation needed)
- Log expenses and income
- Send bill reminders
### Ask First
- Major purchase decisions (>$200)

This structure is a context engineering pattern I call progressive context loading. The agent doesn’t get everything at once. It gets:

Identity — Who am I? (loaded immediately)
Constitution reference — What are the system-wide rules? (loaded on first action)
Memory references — What do I currently know? (loaded per the memory skill)
Domain ownership — What do I own? (always in context)
Decision framework — What can I do autonomously? (always in context)
Skill references — How do I do specific things? (loaded on demand)

This mirrors what Google’s engineering blog calls “separating storage from presentation” — the full knowledge base exists in files, but only the relevant slice compiles into each agent’s working context.

The Constitution Pattern

The most powerful context engineering pattern in my system is the constitution — a shared document that all production agents read before doing anything:

# Rocha Family Constitution

*The foundational rules that govern ALL agents in this system.*

## Core Principles

1. **Task-First System.** Tasks are Hector's PRIMARY interface.
   Every actionable finding becomes a task.

2. **Act first, report after.** Detect → act → notify.
   Never say "would you like me to...?"

3. **No Assumptions — Clarification First.**
   NEVER fill knowledge gaps with assumptions.
   Create a clarification task instead of guessing.

4. **Child Location — SAFETY CRITICAL.**
   NEVER state a child's location as current fact.

The constitution solves a fundamental multi-agent problem: behavioral consistency. Without it, each agent develops its own interpretation of how to communicate, when to act, and what to escalate. With a shared constitution, all agents follow the same core principles while maintaining their individual domain expertise.

I wrote about why monolithic prompts fail in Your God Prompt Is the New Monolith. The constitution is the opposite pattern — shared governance without shared monolithic context. Each agent reads the constitution, internalizes the principles, then operates independently within its domain.

Practical Before/After Examples

Example 1: The Assumption Problem

Before context engineering:

Agent: “Leave at 5:15 PM for the NICU — it’s a 20-minute drive.”

The agent assumed I was at home. I was actually at the office, 45 minutes away. Following bad directions from an AI you trust is a real safety concern when you have premature twins in the NICU.

After adding the clarification-first principle to the constitution:

Agent: “I need your current location to calculate NICU departure time. Task created: ‘Where are you right now?’”

The agent now creates a clarification task instead of guessing. This single context engineering change — adding a “no assumptions” rule to the shared constitution — fixed the behavior across all 50 agents simultaneously.

Example 2: Skill Extraction

Before skills: My content-creative agent had 200+ lines of brand safety rules embedded directly. My blog-writer had a different version. My content-manager had yet another. When I needed to update the rules, I’d miss one and publish brand-inconsistent content.

After extracting the copilot-brand-safety skill: All three agents reference one canonical skill. The brand rules are in one place, always consistent, and any update propagates instantly. I covered this architectural pattern in my piece on the agentic development maturity curve — the experts return to simplicity by extracting reusable patterns.

Example 3: Memory Tier Optimization

Before tiered memory: My daily briefing agent loaded the full history of every briefing it had ever generated. By week three, its context was 80% stale historical data and 20% today’s actual briefing content. It started missing calendar events because relevant information was buried.

After implementing the 4-tier system: The briefing agent loads its core identity (Tier 1) and today’s state (Tier 2) — roughly 8KB total. Historical patterns (Tier 3) only load if it needs to reference a recurring issue. Result: briefings became more accurate and faster to generate because the agent’s attention was focused on what matters right now.

The Context Engineering Checklist

If you’re building AI agents — whether it’s one or fifty — here’s the checklist I wish I’d had when I started:

Architecture

Separate identity from state. Agent identity (who am I, what do I own) should be distinct from working state (what’s happening right now).
Tier your memory. Not everything deserves context window space. Design explicit tiers with loading rules.
Extract reusable patterns into skills. If two agents need the same capability, it’s a skill, not duplicated instructions.
Create a shared constitution. System-wide rules live in one place, not copy-pasted across agents.

Context Quality

Put critical rules at the top. LLMs have recency and primacy bias — important instructions go first.
Size-limit every tier. Context rot is real. Set hard limits and enforce them with pruning rules.
Use progressive loading. Load identity always, state always, history on demand, logs never.
Cross-reference, don’t duplicate. “See the X skill” is better than embedding X inline.

Production Operations

Version your context artifacts. Agent instructions, skills, and constitutions are code — treat them that way.
Test context changes across agents. A constitution change affects everyone. Review accordingly.
Monitor context window utilization. Know how much of each agent’s window you’re using and why.
Persist lessons from corrections. When an agent makes a mistake, the fix should be permanent — update the context, not just the prompt.

The Bottom Line

Context engineering isn’t a buzzword — it’s the core discipline of building production AI systems. Jeremy Daly argues that context management and isolation are the ultimate determinants of reliability and cost at scale, and after building a 50-agent platform, I completely agree.

The patterns are straightforward: tier your memory, extract reusable skills, define agent contracts, and govern the system with a shared constitution. These aren’t theoretical frameworks — they’re the exact patterns running in production in my household right now, orchestrated by GitHub Copilot.

I’ve written about how I built this platform, how I applied context engineering to client work, and why I repurposed a coding agent as a life assistant. Context engineering is the thread that connects all of it.

Start small. Pick one agent. Give it an identity file, a working memory file, and a clear decision framework. Watch how much better it performs when it knows exactly what it is, what it knows, and what it’s allowed to do. Then scale from there.

Building with AI agents? I share context engineering patterns, production architectures, and lessons learned weekly. Follow me on LinkedIn or check out more on htek.dev.