The Agent Harness: What We Learned Running 9 AI Agents in Production

This is part of a 3-post series on AI infrastructure for GTM:
1. Context Graphs - The data foundation (memory, world model

2. Agent Harness - The coordination infrastructure (policies, audit trails (you are here)

3. Long Horizon Agents - The capability that emerges when you have both

Everyone's building AI agents. Almost no one's building the infrastructure to run them.

An agent harness is the infrastructure layer that provides AI agents with shared context, coordination rules, and audit trails. Without one, your agents will fail 3-15% of the time, contradict each other, and operate as black boxes you can't debug. We run 9 AI agents in production every day at Warmly. Here's what we learned about building the harness that makes them reliable.

The market is obsessed with making agents smarter. But intelligence isn't the bottleneck. Infrastructure is.

Quick Answer: Agent Harness Components by Use Case

Best for multi-agent coordination: Event-based routing with Temporal workflows - prevents agents from colliding or duplicating work.

Best for decision auditability: Decision ledger with full traces - every agent decision logged with reasoning, confidence scores, and context snapshots.

Best for context management: Unified context graph - single source of truth across CRM, intent signals, and website activity.

Best for policy enforcement: YAML-based policy engine - define rules once, enforce across all agents.

Best for continuous improvement: Outcome loop - link decisions to business results (meetings booked, deals closed) and learn from patterns.

Best for GTM teams getting started: Warmly's AI Orchestrator - production-ready agent harness with 9 workflows already built.

The Problem Nobody Talks About

Here's a stat that should worry you: tool calling - the mechanism by which AI agents actually do things - fails 3-15% of the time in production. That's not a bug. That's the baseline for well-engineered systems (Gartner 2025).

And it gets worse. According to RAND Corporation, over 80% of AI projects fail—twice the failure rate of non-AI technology projects. Gartner predicts 40%+ of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.

Why? Because most teams focus on the wrong problem.

They're fine-tuning prompts. Switching models. Adding more tools. But the agents keep failing in production because there's no infrastructure holding them together. (For more on what works, see our guide to agentic AI orchestration.)

Think about it this way: You wouldn't deploy a fleet of microservices without Kubernetes. You wouldn't run a data pipeline without Airflow. But somehow, we're deploying fleets of AI agents with nothing but prompts and prayers.

That's where the agent harness comes in.

What is an Agent Harness?

An agent harness is the infrastructure layer between your AI agents and the real world. It does three things:

Context: Gives every agent access to the same unified view of reality
Coordination: Ensures agents don't contradict or duplicate each other
Constraints: Enforces policies and creates audit trails for every decision

The metaphor is intentional. A harness doesn't slow down a horse - it lets the horse pull. Same principle. A harness doesn't limit your agents. It gives them the structure they need to actually work.

Without a harness, you get what I call the "demo-to-disaster" gap. Your agent works perfectly in a notebook. Then you deploy it, and within a week:

Agent A sends an email. Agent B sends a nearly identical email two hours later.
A customer asks "why did you reach out?" and nobody knows.
Your agents burn through your entire TAM before anyone notices the personalization is broken.

With a harness, you get agents that operate like a coordinated team instead of a bunch of interns who've never met. This is the foundation of what we call agentic automation - AI that can actually run autonomously in production.

Why AI Agents Fail in Production (The Real Reasons)

Let me be specific about why agents fail. This isn't theoretical. We've seen all of these.

Failure Mode 1: Context Rot

Here's something the model spec sheets don't tell you: models effectively utilize only 8K-50K tokens regardless of what the context window promises. Information buried in the middle shows 20% performance degradation. Approximately 70% of tokens you're paying for provide minimal value (Princeton KDD 2024).

This is called "context rot." Your agent has access to everything, but can actually use almost nothing.

The fix isn't a bigger context window. It's better context engineering - giving the agent exactly what it needs, when it needs it, in a format it can actually use.

Failure Mode 2: Agent Collision

This is the second-order problem that kills most multi-agent systems. You deploy Agent A to send LinkedIn messages. Agent B to send emails. Agent C to update the CRM. Each agent works perfectly in isolation. (This is exactly the problem that AI sales automation tools need to solve.)

Then Agent A messages a prospect at 9am. Agent B emails the same prospect at 11am. Agent C marks them as "contacted" but doesn't know which agent did what. The prospect gets annoyed. Your brand looks like a spam operation.

The agents aren't broken. They just have no idea what each other are doing.

Failure Mode 3: Black Box Decisions

A prospect asks: "Why did your AI reach out to me?"

If you can't answer that question with specifics - what signals the agent saw, what rules it applied, why it chose this action over alternatives - you have a black box problem.

Black boxes are fine for demos. They're disasters for production. You can't debug what you can't see. You can't improve what you can't measure. And you definitely can't explain to your legal team why the AI sent that message.

The Agent Harness Architecture

Here's the architecture we use to run 9 production agents at Warmly. It has four layers.

Layer 1: The Context Graph

A context graph is a unified data layer that gives every agent the same view of reality.

Most companies have their data scattered across a dozen systems. Intent signals in one tool. CRM data in another. Website activity somewhere else. Each agent has to query multiple APIs, stitch together partial views, and hope nothing changed in between.

That's a recipe for inconsistent decisions. Our context graph unifies three databases:

Terminus (port 5444): Company data, buying committees, ICP tiers, audience memberships
Warm Opps (port 5441): Website sessions, chat messages, intent signals, page visits
HubSpot: Deal stages, contact properties, activity history

This unified view is what enables person-based signals - knowing not just which company visited, but who specifically and what they care about.

Every agent queries the same graph. When Agent A looks up a company, it sees the same data Agent B would see. No API race conditions. No stale caches. One source of truth.

The graph has four sub-layers: Entity Layer: Core objects linked together

Company → People → Employments → Buying Committee
Signals → Sessions → Page Visits → Intent Scores

Ledger Layer: Immutable event stream (the "why" behind everything)

Activity events: website_visit, email_sent, meeting_booked
Signal events: new_hire, job_posting, bombora_surge
State snapshots: intentscorecomputed, icp_tier_assigned

Policy Layer: Rules that govern agent behavior

"Only reach out if intent_score > 50 AND icp_tier IN ['Tier 1', 'Tier 2']"
"Never contact accounts with active deals in Negotiation stage"

API Layer: Unified interface for all agents

GET: getCompanyContext(), getBuyingCommittee(), getPriorityRanking()
POST: syncToCRM(), addToLinkedInAds(), sendEmail()
OBSERVE: onEvent(), recordDecision(), recordOutcome()

Layer 2: The Policy Engine

Policies are rules that constrain what agents can do.

This sounds limiting. It's actually liberating. When agents know their boundaries, they can operate with more autonomy inside those boundaries.

Here's what a policy looks like:

yaml

policy:

 name: "outbound-qualification"

 version: "2.3"

 conditions:

  - field: "icpTier"

   operator: "in"

   value: ["Tier 1", "Tier 2"]

  - field: "intentScore"

   operator: "gte"

   value: 50

  - field: "dealStage"

   operator: "not_in"

   value: ["Negotiation", "Contracting", "Closed Won"]

 actions:

  allowed:

   - "send_email"

   - "add_to_salesflow"

   - "add_to_linkedin_audience"

  blocked:

   - "create_deal"

   - "update_deal_stage"

 human_review_threshold: 0.6

The policy engine evaluates every agent action against applicable policies before execution. If an action violates a policy, it's blocked. If confidence is below the review threshold, it's queued for human approval.

This is how you deploy agents without worrying they'll burn through your TAM or message the CEO of your biggest customer. (If you're evaluating AI SDR agents, this is the first thing to check: what policies can you set?)

Layer 3: The Decision Ledger

Every agent decision gets recorded. Not just what happened - why it happened. Here's what a decision trace looks like:

json
{
  "decisionId": "dec_7f8a9b2c",
  "timestamp": "2026-01-17T14:32:18Z",
  "agent": "lead-list-builder",
  "workflowId": "manual-list-sync-a0396ff9-1737135132975",


  "decisionType": "reach_out",


  "reasoning": {
    "summary": "High intent Tier 1 account with active buying committee, no recent outreach",
    "factors": [
      {"factor": "intentScore", "value": 72, "weight": 0.3, "contribution": "high"},
      {"factor": "icpTier", "value": "Tier 1", "weight": 0.25, "contribution": "high"},
      {"factor": "buyingCommitteeSize", "value": 4, "weight": 0.2, "contribution": "medium"},
      {"factor": "daysSinceLastContact", "value": 45, "weight": 0.15, "contribution": "high"},
      {"factor": "dealStage", "value": null, "weight": 0.1, "contribution": "neutral"}
    ],
    "confidence": 0.85
  },


  "contextSnapshot": {
    "company": "acme.com",
    "intentScore": 72,
    "icpTier": "Tier 1",
    "buyingCommittee": ["Sarah Chen (CRO)", "Mike Davis (RevOps)", "Lisa Park (VP Sales)"],
    "recentSignals": ["pricing_page_visit", "competitor_research", "new_sales_hire"]
  },


  "policyApplied": {
    "policyId": "outbound-qualification",
    "version": "2.3",
    "result": "approved"
  },


  "action": {
    "type": "add_to_sdr_list",
    "parameters": {
      "listId": "high-intent-2026-01-17",
      "assignedSDR": "[email protected]",
      "priority": "high"
    }
  },


  "methodology": {
    "approach": "Weighted scoring against closed-won deal patterns",
    "dataSourcesQueried": ["terminus", "warm_opps", "hubspot"],
    "modelUsed": "internal-scoring-v3",
    "tokensConsumed": 0
  }
}

When someone asks "why did we reach out to Acme?", you can pull up the exact decision trace. You can see the intent score was 72, the account was Tier 1, they had 4 buying committee members identified, and they hadn't been contacted in 45 days.

That's not a black box. That's a transparent, auditable decision system.

Layer 4: The Outcome Loop

The decision ledger captures what the agent decided. The outcome loop captures what actually happened.

json
{
  "decisionId": "dec_7f8a9b2c",
  "outcomes": [
    {
      "timestamp": "2026-01-18T09:15:00Z",
      "event": "email_sent",
      "details": {"to": "[email protected]", "template": "high-intent-cro"}
    },
    {
      "timestamp": "2026-01-19T14:22:00Z",
      "event": "email_opened",
      "details": {"opens": 3}
    },
    {
      "timestamp": "2026-01-22T11:00:00Z",
      "event": "meeting_booked",
      "details": {"type": "demo", "attendees": 2}
    }
  ],
  "businessOutcome": {
    "result": "opportunity_created",
    "value": 45000,
    "daysToOutcome": 5
  }
}

Now you can answer the question: "Did that decision work?"

Over time, this creates a feedback loop. You can see which factors actually correlate with meetings booked. You can adjust the weights. You can A/B test policies. The system gets smarter because it learns from its own decisions.

How We Coordinate 9 Agents Without Chaos

Running one agent is easy. Running nine agents that don't step on each other? That's where most teams fail.

Here's our approach.

The Second-Order Problem

When you have multiple agents operating in parallel, each agent makes locally optimal decisions that can be globally suboptimal.

Agent A sees high intent and sends an email.
Agent B sees high intent and adds them to a LinkedIn campaign.
Agent C sees the email was sent and updates the CRM.

Each agent did the right thing based on its view. But the prospect just got hit with three touches in 24 hours. That's not orchestration. That's spam.

This is the second-order problem: agents lose context of each other.

The Solution: Event-Based Coordination

We use Temporal for workflow orchestration. Every agent action publishes to a shared event stream. A routing layer watches the stream and prevents collisions.

typescript
export async function gtmDailyWorkflow(input: {
  organizationId: string;
  config: GTMAgentConfig;
}): Promise<GTMAgentResult> {


  // Step 1: Identify high-intent accounts
  const highIntent = await activities.identifyHighIntentAccounts({
    organizationId: input.organizationId,
    lookbackDays: 7,
    minIntentScore: 50
  });


  // Step 2: Filter by policies (CRM status, recent contact, etc.)
  const qualified = await activities.applyQualificationPolicies({
    accounts: highIntent,
    policies: ['no-active-deals', 'no-recent-outreach', 'icp-tier-filter']
  });


  // Step 3: Get buying committees (parallel execution)
  const withCommittees = await Promise.all(
    qualified.map(account =>
      activities.getBuyingCommittee({
        domain: account.domain,
        organizationId: input.organizationId
      })
    )
  );


  // Step 4: Route to appropriate channels (with coordination)
  const routingDecisions = await activities.routeToChannels({
    accounts: withCommittees,
    availableChannels: ['email', 'linkedin', 'linkedin_ads'],
    coordinationRules: {
      maxTouchesPerDay: 1,
      channelCooldown: { email: 72, linkedin: 48 }, // hours
      requireDifferentChannels: true
    }
  });


  // Step 5: Execute actions (parallel, with rate limiting)
  const results = await activities.executeRoutedActions({
    decisions: routingDecisions,
    recordDecisionTraces: true
  });


  // Step 6: Sync outcomes to CRM
  await activities.syncToCRM({
    results,
    updateFields: ['last_contact_date', 'outreach_channel', 'agent_decision_id']
  });


  return {
    accountsProcessed: qualified.length,
    actionsExecuted: results.filter(r => r.success).length,
    decisionsRecorded: results.length
  };
}

The coordination rules are explicit:

Max 1 touch per day per account
72-hour cooldown after email before another email
48-hour cooldown after LinkedIn
Require different channels if multiple touches The routing layer enforces these rules across all agents. Agent B can't send a LinkedIn message if Agent A sent an email 6 hours ago—the coordination layer blocks it.

What This Looks Like in Practice

We run 9 workflows in production:

Workflow	Trigger	What It Does
listSyncWorkflow	Hourly schedule	Syncs audience memberships to HubSpot
manualListSyncWorkflow	On-demand	Triggered list syncs for specific audiences
buyingCommitteeWorkflow	New high-intent account	Identifies decision makers, champions, influencers (see [AI Data Agent](/p/ai-agents/ai-data-agent))
buyingCommitteePersonaFinderProcessingWorkflow	New company in ICP	Finds people matching buyer personas
buyingCommitteePersonaClassificationProcessingWorkflow	New person identified	Classifies persona (CRO, RevOps, etc.)
webResearchWorkflow	New target account	Researches company context for personalization
leadListBuilderWorkflow	Daily 6am	Builds prioritized SDR target lists (powers [AI Outbound](/p/blog/ai-outbound-sales-tools))
linkedInAudienceWorkflow	New qualified contact	Adds contacts to LinkedIn Ads audiences
crmSyncWorkflow	Any outreach action	Updates HubSpot with agent activities

All 9 workflows query the same context graph. All 9 publish to the same event stream. All 9 are constrained by the same policies.

That's how you get coordination without chaos.

Agent Harness vs. No Harness: What Changes

Scenario	Without Harness	With Harness
Agent A emails prospect	No record of context or reasoning	Full decision trace: signals seen, policy applied, confidence score
Agent B wants to message same prospect	Has no idea Agent A already reached out	Sees Agent A's action in event stream, waits for cooldown
Prospect asks "why did you contact me?"	"Uh... our AI thought you'd be interested?"	"You visited our pricing page 3 times, matched our ICP, and your company just hired a new sales leader"
Agent makes bad decision	Black box—can't debug	Full trace—see exactly what went wrong
New policy needed	Update prompts across all agents	Update policy once, all agents comply
Want to A/B test approach	Manual tracking in spreadsheets	Built-in—compare outcomes by policy version

When You Need a Harness (And When You Don't)

Let me be honest: not everyone needs this. You probably don't need a harness if:

You have one agent doing one thing
The agent doesn't make autonomous decisions
You're in demo/prototype phase
The cost of failure is low You definitely need a harness if:
You have multiple agents that could interact
Agents make decisions that affect customers
You need to explain decisions to stakeholders (legal, customers, executives)
You want agents to improve over time
The cost of failure is high (brand damage, TAM burn, compliance risk)

For most GTM teams, the answer is: you need a harness sooner than you think. (Not sure where to start? Check out our guide to AI for RevOps.)

The moment you deploy a second agent, you have a coordination problem. The moment an agent contacts a customer, you have an auditability requirement. The moment you want to improve performance, you need outcome tracking.

Build vs. Buy: What an Agent Harness Actually Costs

Let's talk numbers. Building an agent harness in-house is a significant investment.

Build It Yourself

Component	Engineering Time	Ongoing Cost
Context graph (unified data layer)	2-3 months	$2-5K/mo infrastructure
Event stream + coordination	1-2 months	$500-2K/mo (Kafka/Redis)
Policy engine	1-2 months	Minimal
Decision ledger	1 month	$500-1K/mo (storage)
Outcome tracking + analytics	1-2 months	$500-1K/mo
Workflow orchestration (Temporal)	1 month	$500-2K/mo
Total	8-12 months	$4-11K/mo

Plus: 1-2 senior engineers dedicated to maintenance, debugging, and improvements. At $200K+ fully loaded, that's $17-33K/mo in labor alone.

Realistic all-in cost to build: $250-500K first year, $150-300K/year ongoing.

Buy a Platform

Most enterprise agent platforms with harness capabilities:

Platform Type	Annual Cost	What You Get
Point solutions (single agent)	$10-25K/yr	One agent, limited coordination
Mid-market platforms	$25-75K/yr	2-4 agents, basic orchestration
Enterprise ABM/intent (6sense, Demandbase)	$100-200K/yr	Intent data + some automation
Full agent harness (Warmly)	[$10-25K/yr](/p/pricing)	4+ agents, full orchestration, decision traces

The math: If you have a RevOps or data engineering team that can dedicate 8+ months to building infrastructure, building might make sense. If you need agents in production in weeks, buy.

When Building Makes Sense

You have unique data sources no platform supports
You need custom compliance/audit requirements
You have 3+ engineers who can dedicate 50%+ time
You're already running Temporal or similar orchestration

When Buying Makes Sense

You need results in weeks, not months
Your team is <20 people (can't afford dedicated infra engineers)
You want to focus on GTM strategy, not infrastructure
You need proven coordination patterns (not experimenting)

Getting Started: The Minimum Viable Harness

You don't need to build all four layers on day one. Here's how to start:

Week 1: Unified Context

Pick your 2-3 critical data sources
Build a single API that queries all of them
Every agent calls this API instead of querying sources directly

Week 2: Event Stream

Every agent action publishes an event
Events include: agent ID, action type, target (company/person), timestamp
Simple coordination rule: block duplicate actions within N hours

Week 3: Decision Logging

For every decision, log: what the agent saw, what it decided, why
Doesn't need to be the full trace structure—start simple
Make logs queryable (you'll need them for debugging)

Week 4: Outcome Tracking

Link decisions to outcomes (email opened, meeting booked, deal created)
Start measuring: which decisions led to good outcomes?
Use this to refine policies That's your minimum viable harness. Four weeks of work, and your agents go from "black boxes that might work" to "observable systems you can debug and improve."

The Long Horizon Connection

Everything we've described - context graphs, coordination, decision traces, outcome loops - serves one goal: enabling long horizon agents.

Long horizon agents are AI systems that complete complex, multi-step tasks spanning hours, days, or weeks. According to METR research, AI agent task completion capability is doubling every ~7 months. By late 2026, agents may routinely complete tasks requiring 50-500 sequential steps - the kind of complex workflows that define B2B sales cycles.

Why the harness enables long horizon: Without an agent harness, long horizon agents are impossible:

No persistent memory → Agent forgets what it learned last week
No coordination → Multiple agents contradict each other across days
No decision traces → Can't debug why the agent went off-course
No outcome loops → Agent never improves from experience

With a harness, agents can:

Remember that they contacted Sarah 3 weeks ago and she said "not now, Q2"
Coordinate with marketing agents so the prospect gets a consistent experience
Explain why they prioritized this account over others
Learn that LinkedIn outreach to VPs at high-intent accounts closes 40% better than cold email

The agentic loop: Long horizon agents operate through a perceive-think-act-reflect cycle that spans weeks:

Week 1: Perceive high-intent signal → Think about buying committee → Act with targeted outreach

Week 2: Perceive reply → Think about objection handling → Act with relevant case study

Week 3: Perceive meeting request → Think about deal strategy → Act with champion enablement

Week 4+: Reflect on outcome → Update policies for future accounts

The harness provides the infrastructure for each step. The [context graph](/p/blog/context-graphs-for-gtm) provides the perceive layer. The policy engine provides the think layer. The coordination layer provides the act layer. The outcome loop provides the reflect layer.

Short-horizon agents (1-15 steps in minutes) will become table stakes. Competitive advantage comes from agents that reason across quarters.

The Bigger Picture: Why Infrastructure Wins

Here's what I believe: the AI agent wars will be won by infrastructure, not intelligence.

Model capabilities are converging. GPT-4o, Claude, Gemini - they're all good enough for most GTM use cases. The marginal gains from switching models are shrinking. That's why we focus on agentic workflows rather than model selection.

What's not converging is infrastructure. The teams that build robust harnesses - unified context, coordination, auditability, learning loops - will compound their advantage over time.

Their agents will get smarter because they learn from outcomes. Their agents will be more reliable because they're constrained by policies. Their agents will be more trustworthy because every decision is traceable.

The teams without harnesses will keep chasing the next model upgrade, wondering why their agents still fail 10% of the time.

Build the harness. The agents will thank you.

FAQ

What is an agent harness?

An agent harness is the infrastructure layer that provides AI agents with shared context, coordination rules, and audit trails. It ensures multiple agents can work together without contradicting each other, while maintaining full traceability of every decision. The harness sits between your agents and the real world, handling context management, policy enforcement, decision logging, and outcome tracking.

How do you coordinate multiple AI agents?

Coordinate multiple AI agents using event-based routing with explicit coordination rules. Every agent action publishes to a shared event stream. A routing layer watches the stream and prevents collisions—for example, blocking Agent B from emailing a prospect if Agent A already messaged them within a cooldown period. Define rules like "max 1 touch per day" and "72-hour cooldown between same-channel touches" and enforce them centrally.

Why do AI agents fail in production?

AI agents fail in production for three main reasons: (1) Context rot—models effectively use only 8K-50K tokens regardless of context window size, so critical information gets lost. (2) Agent collision—multiple agents make locally optimal decisions that are globally suboptimal, like two agents messaging the same prospect within hours. (3) Black box decisions—no audit trail means you can't debug failures or explain decisions to stakeholders.

What's the difference between AI agent orchestration and an agent harness?

Orchestration is about sequencing tasks—making sure step B happens after step A. A harness provides the infrastructure that makes orchestration reliable: shared context so agents see the same data, coordination rules so agents don't collide, policy enforcement so agents stay within bounds, and decision logging so you can debug and improve. You need both, but the harness is the foundation.

How do you debug AI agent decisions?

Debug AI agent decisions using decision traces that capture the full reasoning chain. Each trace should include: (1) the context the agent saw (intent score, ICP tier, recent signals), (2) the policy that was applied, (3) the confidence score, (4) the action taken, and (5) the outcome. When something goes wrong, pull up the trace and see exactly what the agent knew and why it made that choice.

What is a context graph for AI agents?

A context graph is a unified data layer that gives every AI agent the same view of reality. Instead of each agent querying multiple APIs and stitching together partial views, all agents query a single graph that combines data from your CRM, intent signals, website activity, and other sources. This ensures consistent decisions and eliminates the "different agents seeing different data" problem.

How many AI agents can you run in production?

There's no hard limit, but complexity scales non-linearly. We run 9 agents in production with strong coordination. The key is having infrastructure (the harness) that scales with agent count. Without a harness, 2-3 agents become unmanageable. With a harness, you can run dozens - the coordination layer handles the complexity.

Connect with Our Experts