This is part of a 3-post series on AI infrastructure for GTM:
1. Context Graphs - The data foundation (memory, world model
2. Agent Harness - The coordination infrastructure (policies, audit trails (you are here)
3. Long Horizon Agents - The capability that emerges when you have both
Everyone's building AI agents. Almost no one's building the infrastructure to run them.
An agent harness is the infrastructure layer that provides AI agents with shared context, coordination rules, and audit trails. Without one, your agents will fail 3-15% of the time, contradict each other, and operate as black boxes you can't debug. We run 9 AI agents in production every day at Warmly. Here's what we learned about building the harness that makes them reliable.
The market is obsessed with making agents smarter. But intelligence isn't the bottleneck. Infrastructure is.
Quick Answer: Agent Harness Components by Use Case
Best for multi-agent coordination: Event-based routing with Temporal workflows - prevents agents from colliding or duplicating work.
Best for decision auditability: Decision ledger with full traces - every agent decision logged with reasoning, confidence scores, and context snapshots.
Best for context management: Unified context graph - single source of truth across CRM, intent signals, and website activity.
Best for policy enforcement: YAML-based policy engine - define rules once, enforce across all agents.
Best for continuous improvement: Outcome loop - link decisions to business results (meetings booked, deals closed) and learn from patterns.
Best for GTM teams getting started: Warmly's AI Orchestrator - production-ready agent harness with 9 workflows already built.
The Problem Nobody Talks About
Here's a stat that should worry you: tool calling - the mechanism by which AI agents actually do things - fails 3-15% of the time in production. That's not a bug. That's the baseline for well-engineered systems (Gartner 2025).
And it gets worse. According to RAND Corporation, over 80% of AI projects fail—twice the failure rate of non-AI technology projects. Gartner predicts 40%+ of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.
Why? Because most teams focus on the wrong problem.
They're fine-tuning prompts. Switching models. Adding more tools. But the agents keep failing in production because there's no infrastructure holding them together. (For more on what works, see our guide to agentic AI orchestration.)
Think about it this way: You wouldn't deploy a fleet of microservices without Kubernetes. You wouldn't run a data pipeline without Airflow. But somehow, we're deploying fleets of AI agents with nothing but prompts and prayers.
That's where the agent harness comes in.
What is an Agent Harness?
An agent harness is the infrastructure layer between your AI agents and the real world. It does three things:
- Context: Gives every agent access to the same unified view of reality
- Coordination: Ensures agents don't contradict or duplicate each other
- Constraints: Enforces policies and creates audit trails for every decision
The metaphor is intentional. A harness doesn't slow down a horse - it lets the horse pull. Same principle. A harness doesn't limit your agents. It gives them the structure they need to actually work.
Without a harness, you get what I call the "demo-to-disaster" gap. Your agent works perfectly in a notebook. Then you deploy it, and within a week:
- Agent A sends an email. Agent B sends a nearly identical email two hours later.
- A customer asks "why did you reach out?" and nobody knows.
- Your agents burn through your entire TAM before anyone notices the personalization is broken.
With a harness, you get agents that operate like a coordinated team instead of a bunch of interns who've never met. This is the foundation of what we call agentic automation - AI that can actually run autonomously in production.
Why AI Agents Fail in Production (The Real Reasons)
Let me be specific about why agents fail. This isn't theoretical. We've seen all of these.
Failure Mode 1: Context Rot
Here's something the model spec sheets don't tell you: models effectively utilize only 8K-50K tokens regardless of what the context window promises. Information buried in the middle shows 20% performance degradation. Approximately 70% of tokens you're paying for provide minimal value (Princeton KDD 2024).
This is called "context rot." Your agent has access to everything, but can actually use almost nothing.
The fix isn't a bigger context window. It's better context engineering - giving the agent exactly what it needs, when it needs it, in a format it can actually use.
Failure Mode 2: Agent Collision
This is the second-order problem that kills most multi-agent systems. You deploy Agent A to send LinkedIn messages. Agent B to send emails. Agent C to update the CRM. Each agent works perfectly in isolation. (This is exactly the problem that AI sales automation tools need to solve.)
Then Agent A messages a prospect at 9am. Agent B emails the same prospect at 11am. Agent C marks them as "contacted" but doesn't know which agent did what. The prospect gets annoyed. Your brand looks like a spam operation.
The agents aren't broken. They just have no idea what each other are doing.
Failure Mode 3: Black Box Decisions
A prospect asks: "Why did your AI reach out to me?"
If you can't answer that question with specifics - what signals the agent saw, what rules it applied, why it chose this action over alternatives - you have a black box problem.
Black boxes are fine for demos. They're disasters for production. You can't debug what you can't see. You can't improve what you can't measure. And you definitely can't explain to your legal team why the AI sent that message.
The Agent Harness Architecture
Here's the architecture we use to run 9 production agents at Warmly. It has four layers.
Layer 1: The Context Graph
A context graph is a unified data layer that gives every agent the same view of reality.
Most companies have their data scattered across a dozen systems. Intent signals in one tool. CRM data in another. Website activity somewhere else. Each agent has to query multiple APIs, stitch together partial views, and hope nothing changed in between.
That's a recipe for inconsistent decisions. Our context graph unifies three databases:
- Terminus (port 5444): Company data, buying committees, ICP tiers, audience memberships
- Warm Opps (port 5441): Website sessions, chat messages, intent signals, page visits
- HubSpot: Deal stages, contact properties, activity history
This unified view is what enables person-based signals - knowing not just which company visited, but who specifically and what they care about.
Every agent queries the same graph. When Agent A looks up a company, it sees the same data Agent B would see. No API race conditions. No stale caches. One source of truth.
The graph has four sub-layers: Entity Layer: Core objects linked together
- Company → People → Employments → Buying Committee
- Signals → Sessions → Page Visits → Intent Scores
Ledger Layer: Immutable event stream (the "why" behind everything)
- Activity events: website_visit, email_sent, meeting_booked
- Signal events: new_hire, job_posting, bombora_surge
- State snapshots: intentscorecomputed, icp_tier_assigned
Policy Layer: Rules that govern agent behavior
- "Only reach out if intent_score > 50 AND icp_tier IN ['Tier 1', 'Tier 2']"
- "Never contact accounts with active deals in Negotiation stage"
API Layer: Unified interface for all agents
- GET:
getCompanyContext(), getBuyingCommittee(), getPriorityRanking() - POST:
syncToCRM(), addToLinkedInAds(), sendEmail() - OBSERVE:
onEvent(), recordDecision(), recordOutcome()
Layer 2: The Policy Engine
Policies are rules that constrain what agents can do.
This sounds limiting. It's actually liberating. When agents know their boundaries, they can operate with more autonomy inside those boundaries.
Here's what a policy looks like:
yaml
policy:
name: "outbound-qualification"
version: "2.3"
conditions:
- field: "icpTier"
operator: "in"
value: ["Tier 1", "Tier 2"]
- field: "intentScore"
operator: "gte"
value: 50
- field: "dealStage"
operator: "not_in"
value: ["Negotiation", "Contracting", "Closed Won"]
actions:
allowed:
- "send_email"
- "add_to_salesflow"
- "add_to_linkedin_audience"
blocked:
- "create_deal"
- "update_deal_stage"
human_review_threshold: 0.6
The policy engine evaluates every agent action against applicable policies before execution. If an action violates a policy, it's blocked. If confidence is below the review threshold, it's queued for human approval.
This is how you deploy agents without worrying they'll burn through your TAM or message the CEO of your biggest customer. (If you're evaluating AI SDR agents, this is the first thing to check: what policies can you set?)
Layer 3: The Decision Ledger
Every agent decision gets recorded. Not just what happened - why it happened. Here's what a decision trace looks like:
json
{
"decisionId": "dec_7f8a9b2c",
"timestamp": "2026-01-17T14:32:18Z",
"agent": "lead-list-builder",
"workflowId": "manual-list-sync-a0396ff9-1737135132975",
"decisionType": "reach_out",
"reasoning": {
"summary": "High intent Tier 1 account with active buying committee, no recent outreach",
"factors": [
{"factor": "intentScore", "value": 72, "weight": 0.3, "contribution": "high"},
{"factor": "icpTier", "value": "Tier 1", "weight": 0.25, "contribution": "high"},
{"factor": "buyingCommitteeSize", "value": 4, "weight": 0.2, "contribution": "medium"},
{"factor": "daysSinceLastContact", "value": 45, "weight": 0.15, "contribution": "high"},
{"factor": "dealStage", "value": null, "weight": 0.1, "contribution": "neutral"}
],
"confidence": 0.85
},
"contextSnapshot": {
"company": "acme.com",
"intentScore": 72,
"icpTier": "Tier 1",
"buyingCommittee": ["Sarah Chen (CRO)", "Mike Davis (RevOps)", "Lisa Park (VP Sales)"],
"recentSignals": ["pricing_page_visit", "competitor_research", "new_sales_hire"]
},
"policyApplied": {
"policyId": "outbound-qualification",
"version": "2.3",
"result": "approved"
},
"action": {
"type": "add_to_sdr_list",
"parameters": {
"listId": "high-intent-2026-01-17",
"assignedSDR": "[email protected]",
"priority": "high"
}
},
"methodology": {
"approach": "Weighted scoring against closed-won deal patterns",
"dataSourcesQueried": ["terminus", "warm_opps", "hubspot"],
"modelUsed": "internal-scoring-v3",
"tokensConsumed": 0
}
}
When someone asks "why did we reach out to Acme?", you can pull up the exact decision trace. You can see the intent score was 72, the account was Tier 1, they had 4 buying committee members identified, and they hadn't been contacted in 45 days.
That's not a black box. That's a transparent, auditable decision system.
Layer 4: The Outcome Loop
The decision ledger captures what the agent decided. The outcome loop captures what actually happened.
json
{
"decisionId": "dec_7f8a9b2c",
"outcomes": [
{
"timestamp": "2026-01-18T09:15:00Z",
"event": "email_sent",
"details": {"to": "[email protected]", "template": "high-intent-cro"}
},
{
"timestamp": "2026-01-19T14:22:00Z",
"event": "email_opened",
"details": {"opens": 3}
},
{
"timestamp": "2026-01-22T11:00:00Z",
"event": "meeting_booked",
"details": {"type": "demo", "attendees": 2}
}
],
"businessOutcome": {
"result": "opportunity_created",
"value": 45000,
"daysToOutcome": 5
}
}
Now you can answer the question: "Did that decision work?"
Over time, this creates a feedback loop. You can see which factors actually correlate with meetings booked. You can adjust the weights. You can A/B test policies. The system gets smarter because it learns from its own decisions.
How We Coordinate 9 Agents Without Chaos
Running one agent is easy. Running nine agents that don't step on each other? That's where most teams fail.
Here's our approach.
The Second-Order Problem
When you have multiple agents operating in parallel, each agent makes locally optimal decisions that can be globally suboptimal.
Agent A sees high intent and sends an email.
Agent B sees high intent and adds them to a LinkedIn campaign.
Agent C sees the email was sent and updates the CRM.
Each agent did the right thing based on its view. But the prospect just got hit with three touches in 24 hours. That's not orchestration. That's spam.
This is the second-order problem: agents lose context of each other.
The Solution: Event-Based Coordination
We use Temporal for workflow orchestration. Every agent action publishes to a shared event stream. A routing layer watches the stream and prevents collisions.
typescript
export async function gtmDailyWorkflow(input: {
organizationId: string;
config: GTMAgentConfig;
}): Promise<GTMAgentResult> {
// Step 1: Identify high-intent accounts
const highIntent = await activities.identifyHighIntentAccounts({
organizationId: input.organizationId,
lookbackDays: 7,
minIntentScore: 50
});
// Step 2: Filter by policies (CRM status, recent contact, etc.)
const qualified = await activities.applyQualificationPolicies({
accounts: highIntent,
policies: ['no-active-deals', 'no-recent-outreach', 'icp-tier-filter']
});
// Step 3: Get buying committees (parallel execution)
const withCommittees = await Promise.all(
qualified.map(account =>
activities.getBuyingCommittee({
domain: account.domain,
organizationId: input.organizationId
})
)
);
// Step 4: Route to appropriate channels (with coordination)
const routingDecisions = await activities.routeToChannels({
accounts: withCommittees,
availableChannels: ['email', 'linkedin', 'linkedin_ads'],
coordinationRules: {
maxTouchesPerDay: 1,
channelCooldown: { email: 72, linkedin: 48 }, // hours
requireDifferentChannels: true
}
});
// Step 5: Execute actions (parallel, with rate limiting)
const results = await activities.executeRoutedActions({
decisions: routingDecisions,
recordDecisionTraces: true
});
// Step 6: Sync outcomes to CRM
await activities.syncToCRM({
results,
updateFields: ['last_contact_date', 'outreach_channel', 'agent_decision_id']
});
return {
accountsProcessed: qualified.length,
actionsExecuted: results.filter(r => r.success).length,
decisionsRecorded: results.length
};
}
The coordination rules are explicit:
- Max 1 touch per day per account
- 72-hour cooldown after email before another email
- 48-hour cooldown after LinkedIn
- Require different channels if multiple touches The routing layer enforces these rules across all agents. Agent B can't send a LinkedIn message if Agent A sent an email 6 hours ago—the coordination layer blocks it.
What This Looks Like in Practice
We run 9 workflows in production:
| Workflow | Trigger | What It Does |
|---|
| listSyncWorkflow | Hourly schedule | Syncs audience memberships to HubSpot |
| manualListSyncWorkflow | On-demand | Triggered list syncs for specific audiences |
| buyingCommitteeWorkflow | New high-intent account | Identifies decision makers, champions, influencers (see [AI Data Agent](/p/ai-agents/ai-data-agent)) |
| buyingCommitteePersonaFinderProcessingWorkflow | New company in ICP | Finds people matching buyer personas |
| buyingCommitteePersonaClassificationProcessingWorkflow | New person identified | Classifies persona (CRO, RevOps, etc.) |
| webResearchWorkflow | New target account | Researches company context for personalization |
| leadListBuilderWorkflow | Daily 6am | Builds prioritized SDR target lists (powers [AI Outbound](/p/blog/ai-outbound-sales-tools)) |
| linkedInAudienceWorkflow | New qualified contact | Adds contacts to LinkedIn Ads audiences |
| crmSyncWorkflow | Any outreach action | Updates HubSpot with agent activities |
All 9 workflows query the same context graph. All 9 publish to the same event stream. All 9 are constrained by the same policies.
That's how you get coordination without chaos.
Agent Harness vs. No Harness: What Changes
| Scenario | Without Harness | With Harness |
|---|
| **Agent A emails prospect** | No record of context or reasoning | Full decision trace: signals seen, policy applied, confidence score |
| **Agent B wants to message same prospect** | Has no idea Agent A already reached out | Sees Agent A's action in event stream, waits for cooldown |
| **Prospect asks "why did you contact me?"** | "Uh... our AI thought you'd be interested?" | "You visited our pricing page 3 times, matched our ICP, and your company just hired a new sales leader" |
| **Agent makes bad decision** | Black box—can't debug | Full trace—see exactly what went wrong |
| **New policy needed** | Update prompts across all agents | Update policy once, all agents comply |
| **Want to A/B test approach** | Manual tracking in spreadsheets | Built-in—compare outcomes by policy version |
|
When You Need a Harness (And When You Don't)
Let me be honest: not everyone needs this. You probably don't need a harness if:
- You have one agent doing one thing
- The agent doesn't make autonomous decisions
- You're in demo/prototype phase
- The cost of failure is low You definitely need a harness if:
- You have multiple agents that could interact
- Agents make decisions that affect customers
- You need to explain decisions to stakeholders (legal, customers, executives)
- You want agents to improve over time
- The cost of failure is high (brand damage, TAM burn, compliance risk)
For most GTM teams, the answer is: you need a harness sooner than you think. (Not sure where to start? Check out our guide to AI for RevOps.)
The moment you deploy a second agent, you have a coordination problem. The moment an agent contacts a customer, you have an auditability requirement. The moment you want to improve performance, you need outcome tracking.
Build vs. Buy: What an Agent Harness Actually Costs
Let's talk numbers. Building an agent harness in-house is a significant investment.
Build It Yourself
| Component | Engineering Time | Ongoing Cost |
|---|
| Context graph (unified data layer) | 2-3 months | $2-5K/mo infrastructure |
| Event stream + coordination | 1-2 months | $500-2K/mo (Kafka/Redis) |
| Policy engine | 1-2 months | Minimal |
| Decision ledger | 1 month | $500-1K/mo (storage) |
| Outcome tracking + analytics | 1-2 months | $500-1K/mo |
| Workflow orchestration (Temporal) | 1 month | $500-2K/mo |
| **Total** | **8-12 months** | **$4-11K/mo** |
Plus: 1-2 senior engineers dedicated to maintenance, debugging, and improvements. At $200K+ fully loaded, that's $17-33K/mo in labor alone.
Realistic all-in cost to build: $250-500K first year, $150-300K/year ongoing.
Buy a Platform
Most enterprise agent platforms with harness capabilities:
| Platform Type | Annual Cost | What You Get |
|---|
| Point solutions (single agent) | $10-25K/yr | One agent, limited coordination |
| Mid-market platforms | $25-75K/yr | 2-4 agents, basic orchestration |
| Enterprise ABM/intent (6sense, Demandbase) | $100-200K/yr | Intent data + some automation |
| Full agent harness (Warmly) | [$10-25K/yr](/p/pricing) | 4+ agents, full orchestration, decision traces |
The math: If you have a RevOps or data engineering team that can dedicate 8+ months to building infrastructure, building might make sense. If you need agents in production in weeks, buy.
When Building Makes Sense
- You have unique data sources no platform supports
- You need custom compliance/audit requirements
- You have 3+ engineers who can dedicate 50%+ time
- You're already running Temporal or similar orchestration
When Buying Makes Sense
- You need results in weeks, not months
- Your team is <20 people (can't afford dedicated infra engineers)
- You want to focus on GTM strategy, not infrastructure
- You need proven coordination patterns (not experimenting)
Getting Started: The Minimum Viable Harness
You don't need to build all four layers on day one. Here's how to start:
Week 1: Unified Context
- Pick your 2-3 critical data sources
- Build a single API that queries all of them
- Every agent calls this API instead of querying sources directly
Week 2: Event Stream
- Every agent action publishes an event
- Events include: agent ID, action type, target (company/person), timestamp
- Simple coordination rule: block duplicate actions within N hours
Week 3: Decision Logging
- For every decision, log: what the agent saw, what it decided, why
- Doesn't need to be the full trace structure—start simple
- Make logs queryable (you'll need them for debugging)
Week 4: Outcome Tracking
- Link decisions to outcomes (email opened, meeting booked, deal created)
- Start measuring: which decisions led to good outcomes?
- Use this to refine policies That's your minimum viable harness. Four weeks of work, and your agents go from "black boxes that might work" to "observable systems you can debug and improve."
The Long Horizon Connection
Everything we've described - context graphs, coordination, decision traces, outcome loops - serves one goal: enabling long horizon agents.
Long horizon agents are AI systems that complete complex, multi-step tasks spanning hours, days, or weeks. According to METR research, AI agent task completion capability is doubling every ~7 months. By late 2026, agents may routinely complete tasks requiring 50-500 sequential steps - the kind of complex workflows that define B2B sales cycles.
Why the harness enables long horizon: Without an agent harness, long horizon agents are impossible:
- No persistent memory → Agent forgets what it learned last week
- No coordination → Multiple agents contradict each other across days
- No decision traces → Can't debug why the agent went off-course
- No outcome loops → Agent never improves from experience
With a harness, agents can:
- Remember that they contacted Sarah 3 weeks ago and she said "not now, Q2"
- Coordinate with marketing agents so the prospect gets a consistent experience
- Explain why they prioritized this account over others
- Learn that LinkedIn outreach to VPs at high-intent accounts closes 40% better than cold email
The agentic loop: Long horizon agents operate through a perceive-think-act-reflect cycle that spans weeks:
Week 1: Perceive high-intent signal → Think about buying committee → Act with targeted outreach
Week 2: Perceive reply → Think about objection handling → Act with relevant case study
Week 3: Perceive meeting request → Think about deal strategy → Act with champion enablement
Week 4+: Reflect on outcome → Update policies for future accounts
The harness provides the infrastructure for each step. The [context graph](/p/blog/context-graphs-for-gtm) provides the perceive layer. The policy engine provides the think layer. The coordination layer provides the act layer. The outcome loop provides the reflect layer.
Short-horizon agents (1-15 steps in minutes) will become table stakes. Competitive advantage comes from agents that reason across quarters.
The Bigger Picture: Why Infrastructure Wins
Here's what I believe: the AI agent wars will be won by infrastructure, not intelligence.
Model capabilities are converging. GPT-4o, Claude, Gemini - they're all good enough for most GTM use cases. The marginal gains from switching models are shrinking. That's why we focus on agentic workflows rather than model selection.
What's not converging is infrastructure. The teams that build robust harnesses - unified context, coordination, auditability, learning loops - will compound their advantage over time.
Their agents will get smarter because they learn from outcomes. Their agents will be more reliable because they're constrained by policies. Their agents will be more trustworthy because every decision is traceable.
The teams without harnesses will keep chasing the next model upgrade, wondering why their agents still fail 10% of the time.
Build the harness. The agents will thank you.
FAQ
What is an agent harness?
An agent harness is the infrastructure layer that provides AI agents with shared context, coordination rules, and audit trails. It ensures multiple agents can work together without contradicting each other, while maintaining full traceability of every decision. The harness sits between your agents and the real world, handling context management, policy enforcement, decision logging, and outcome tracking.
How do you coordinate multiple AI agents?
Coordinate multiple AI agents using event-based routing with explicit coordination rules. Every agent action publishes to a shared event stream. A routing layer watches the stream and prevents collisions—for example, blocking Agent B from emailing a prospect if Agent A already messaged them within a cooldown period. Define rules like "max 1 touch per day" and "72-hour cooldown between same-channel touches" and enforce them centrally.
Why do AI agents fail in production?
AI agents fail in production for three main reasons: (1) Context rot—models effectively use only 8K-50K tokens regardless of context window size, so critical information gets lost. (2) Agent collision—multiple agents make locally optimal decisions that are globally suboptimal, like two agents messaging the same prospect within hours. (3) Black box decisions—no audit trail means you can't debug failures or explain decisions to stakeholders.
What's the difference between AI agent orchestration and an agent harness?
Orchestration is about sequencing tasks—making sure step B happens after step A. A harness provides the infrastructure that makes orchestration reliable: shared context so agents see the same data, coordination rules so agents don't collide, policy enforcement so agents stay within bounds, and decision logging so you can debug and improve. You need both, but the harness is the foundation.
How do you debug AI agent decisions?
Debug AI agent decisions using decision traces that capture the full reasoning chain. Each trace should include: (1) the context the agent saw (intent score, ICP tier, recent signals), (2) the policy that was applied, (3) the confidence score, (4) the action taken, and (5) the outcome. When something goes wrong, pull up the trace and see exactly what the agent knew and why it made that choice.
What is a context graph for AI agents?
A context graph is a unified data layer that gives every AI agent the same view of reality. Instead of each agent querying multiple APIs and stitching together partial views, all agents query a single graph that combines data from your CRM, intent signals, website activity, and other sources. This ensures consistent decisions and eliminates the "different agents seeing different data" problem.
How many AI agents can you run in production?
There's no hard limit, but complexity scales non-linearly. We run 9 agents in production with strong coordination. The key is having infrastructure (the harness) that scales with agent count. Without a harness, 2-3 agents become unmanageable. With a harness, you can run dozens - the coordination layer handles the complexity.
Further Reading
The AI Infrastructure Trilogy
Agentic AI Fundamentals
AI Agents for Sales & GTM
RevOps & Infrastructure
Warmly Product Pages
Competitor Comparisons
External Resources
We're building the agent harness for GTM at Warmly. If you're running AI agents in production and want to compare notes, Book a demo or check out our Pricing.
Last updated: January 2026