Skillbook

This framework defines the five components required to build agentic AI systems that ship real work. Each essential builds on the previous one — together they form a complete system. The core insight is: the model is not the agent — the system is.

These essentials apply in two contexts:

Building agents — designing systems where AI agents do work autonomously
Doing development with agents — using agentic AI as a development tool (e.g. Claude Code)

Both contexts require the same five components. When advising on agentic systems, evaluate all five and identify which are missing or underdeveloped.

The Five Essentials

1. Agentic Harness

The runtime that manages the agentic loop, tool access, caching, and context compaction. This is the most commonly underestimated component — people focus on the model when they should focus on the harness.

What it does:

Runs the agentic loop: Plan → Act → Observe → Repeat
Controls which tools the agent can access and how they're invoked
Manages context window limits through caching and compaction
Handles error recovery and retry logic

Key design questions:

How does the harness decide when to stop the loop?
What happens when a tool call fails?
How is the context window managed as work accumulates?
How are tool results cached to avoid redundant work?

Reference example: Claude Code is an agentic harness for software development. It manages tool access (file edit, bash, web search), handles context compaction automatically, runs the plan-act-observe loop, and integrates MCP servers for extensibility.

Red flags when this is missing or weak: The agent loses context mid-task, hits token limits unexpectedly, makes redundant tool calls, or gets stuck in loops without termination.

2. Unit of Work

The container that gives the agent scope, persistence, and the ability to finish real work. This is what separates a chatbot from a system that actually completes tasks.

Spectrum of complexity:

Simple	Production
Chat session	Ticket / Job / Task
Starts and ends with conversation	Can span hours or days
Context is temporary	State is persistent
Work is ephemeral	Work is resumable, trackable, completable

Key design questions:

What defines the boundaries of one unit of work?
How does the agent know when the work is "done"?
Can work be paused and resumed?
How is progress tracked and reported?
What happens if the agent fails mid-unit?

The progression: Most teams start with chat sessions, but production systems need a more durable container. A ticket in a project management system, a job in a queue, or a task in a workflow engine — these give the agent something concrete to work against and a clear definition of done.

Red flags when this is missing or weak: The agent can't do work that takes more than one session, there's no way to track what the agent has done, work gets lost when sessions end, or there's no concept of "completion."

3. Workflows & Commands

Predefined patterns that kick off the agentic loop with the right context. These are the playbooks that make agents repeatable and reliable rather than hoping the LLM figures out what to do from a vague prompt.

The three-step pattern:

Trigger — A command, event, or scheduled action initiates the workflow
Context — The workflow loads relevant data, history, and constraints
Execute — The agentic loop runs with clear goals and boundaries

Key design questions:

What are the most common actions the agent needs to perform?
What context does each workflow need to load before starting?
How are workflows parameterised for different inputs?
Can users create their own workflows or only use predefined ones?
How do workflows compose (one workflow calling another)?

The insight: Without workflows, you're hoping the LLM figures out what to do from a vague prompt. Workflows encode your team's best practices into repeatable automation. They're the difference between "ask the AI to help" and "run this process."

Red flags when this is missing or weak: Every interaction starts from scratch, users get inconsistent results for the same type of task, there's no way to standardise common operations, or the agent requires extensive prompting to do routine work.

4. Memory

Not just "remember stuff" — memory must be self-learning, self-managing, and properly scoped. This is what compounds the agent's value over time. Agents without memory start from zero every time.

Three required properties:

Self-Learning — The memory system automatically updates from work the agent does. Every completed task, every correction, every observed pattern feeds back into what the agent knows. This shouldn't require manual curation — the system should learn from its own work.

Self-Managing — Memory needs to prune, prioritise, and organise itself. As the volume of remembered information grows, the system must decide what's still relevant, what can be compressed, and what should be forgotten. Unbounded memory becomes noise.

Properly Scoped — Different contexts need different memories. A well-designed memory system distinguishes between:

Personal — individual user preferences and history
Project — context specific to a piece of work
Organisation — shared knowledge across the team or company
Global — general knowledge applicable everywhere

Key design questions:

How does the agent learn from completed work without explicit instruction?
What triggers memory consolidation and cleanup?
How are scope boundaries enforced (preventing project A's context from leaking into project B)?
What's the retrieval strategy (vector search, structured lookup, hybrid)?
How do you handle contradictions between old and new information?

Red flags when this is missing or weak: The agent asks the same questions repeatedly, doesn't improve at recurring tasks, bleeds context between unrelated projects, or requires users to manually maintain its knowledge base.

5. Skills

Reusable, testable capabilities the agent draws on, with a built-in feedback loop for continuous improvement. This is where organisational knowledge gets encoded and where the system gets better with use rather than staying static.

Skill properties:

System-wide — shared across the organisation, not locked to one user or project
Versioned — track changes and roll back when a skill regresses
Testable — validate against known scenarios before deploying
Composable — combine skills for complex tasks (a "write report" skill might use a "research" skill and a "format document" skill)
Self-improving — feedback from usage drives refinement

The continuous improvement loop:

Deploy — Ship the skill into the system
Observe — Monitor how the agent uses it and what outcomes it produces
Evaluate — Measure quality against success criteria
Refine — Update the skill based on what was learned
Return to step 1

Key design questions:

How are skills discovered and loaded by the agent?
What's the mechanism for skill authors to publish and share?
How do you measure whether a skill is working well?
What prevents skill bloat (too many skills degrading selection quality)?
How do skills handle edge cases they weren't designed for?

The analogy: Think of skills like packages in a package manager (npm, pip), but at the knowledge layer rather than the code layer. They're the reusable units that encode "how to do X well" and improve over time.

Red flags when this is missing or weak: The agent is equally mediocre at everything, there's no way to encode domain expertise, improvements aren't captured for reuse, or the system doesn't get better with use.

How They Fit Together

The five essentials form a stack, each building on the one below:

┌─────────────────────────────────┐
│  Skills — Give it capability    │
├─────────────────────────────────┤
│  Memory — Give it context       │
├─────────────────────────────────┤
│  Workflows — Tell it what to do │
├─────────────────────────────────┤
│  Unit of Work — Give it scope   │
├─────────────────────────────────┤
│  Harness — Run the loop         │
└─────────────────────────────────┘

The harness runs the loop. The unit of work gives it boundaries. Workflows tell it what to do. Memory gives it context. Skills give it capability.

Using This Framework

For Architecture Reviews

When reviewing an agentic system, evaluate each essential on a maturity scale:

Missing — Not present at all
Ad hoc — Present but informal, inconsistent, or manual
Defined — Deliberately designed with clear interfaces
Managed — Monitored, measured, and actively maintained
Optimising — Self-improving with feedback loops

A system doesn't need all five at "Optimising" to be useful, but any essential at "Missing" is a significant gap. Start by getting everything to "Defined" — that's where most of the value unlocks.

For New Projects

Start with the harness and unit of work — these are the foundation. You can build a useful system with just these two (many chat-based agents operate here). Add workflows when you find yourself repeatedly setting up the same context. Add memory when you notice the agent re-learning things it should already know. Add skills when you have domain expertise worth encoding for reuse.

For Evaluating Tools and Platforms

When evaluating agentic AI tools or platforms, use the five essentials as a checklist. Most tools are strong on the harness and weak on everything else. The differentiation happens in how well they handle units of work, workflows, memory, and skills.

See references/evaluation-checklist.md for a detailed evaluation template.