AI Agents · 11 min read · May 2026

Operating an Agent Fleet in 2026: The Practical Guide

By Thinklytics, AI Agent Practice

23 percent of organizations are scaling agentic AI somewhere in their enterprise (McKinsey, November 2025). Bank of America's Erica has crossed 3 billion interactions; Wells Fargo's Fargo has crossed 1 billion. Salesforce Agentforce 360 has 12,000 customers. And Klarna walked back its agent-only deployment after admitting that cost had been a too-predominant evaluation factor. Here is the practical 2026 guide to operating an agent fleet that actually works.

Topics covered

ai-agents
agent-fleet
agentforce
claude
agent-ops

Frequently asked questions

Should we build agents on Anthropic Claude, OpenAI, or a smaller open-source model?

For most enterprise use cases in 2026, Claude Sonnet/Opus or GPT-5/GPT-5 Codex deliver the best capability-per-dollar. Open-source models (Llama, Qwen, DeepSeek) work well for lower-stakes use cases or where data residency requires on-prem inference. The platform decision matters more than the model decision because the platform is what runs the fleet.

Do we need an AI gateway?

Yes if you have more than two agents in production. The Kong, Cloudflare, Microsoft Agent 365, and Databricks Agent Bricks gateways all serve the same architectural role: a single point for auth, audit, rate limits, and routing. Without a gateway, every agent is its own attack surface.

How do we evaluate agents reliably?

Build your own evaluation set from real production traces. Public benchmarks (SWE-bench, GAIA, OSWorld, WebArena) are useful capability proxies but the April 2026 UC Berkeley reward-hacking demonstration showed they are not production-quality measures. Use Langfuse, LangSmith, Arize, or your runtime's native eval harness.

What does Salesforce Agentforce or Microsoft Agent 365 cost?

Both are usage-based with platform fees. Agentforce charges per-conversation; Microsoft Agent 365 is in early GA with the centralized control-plane positioning. Plan for $5-25 per agent per month at the platform layer plus inference costs.

What's the right ratio of human:agent for customer service?

The Klarna walkback suggests the all-agent ratio fails at quality scale. The BofA Erica pattern (deflect predictable volume to the agent, escalate the long tail to humans) is the durable model. Most fleets converge to 60-80 percent agent-handled / 20-40 percent human-handled by ticket count, with human-handled biased toward higher-revenue or higher-stakes interactions.

What about responsible AI / safety?

Apply the 2026 AI Governance Operating Model to the fleet. NIST AI RMF + ISO 42001 + OWASP LLM Top 10 + EU AI Act overlay. The agent runtime and AI gateway choices need to support the documentation and audit-trail requirements of those frameworks. --- If you want the longer version of this analysis, including the agent fleet inventory template, the AI gateway selection matrix, and the 30-day incident-response runbook, our AI Workflow Automation Consulting, AI Readiness, and Data Governance Consulting practices ship the operating model end-to-end.

Which observability tools are mature enough for production agent fleets?

Three categories. Action logging (LangSmith, Arize, Helicone) for the per-action trail. Model evaluation (Galileo, Patronus AI, Weights & Biases Weave) for output quality regression. Infrastructure (Datadog, New Relic, Grafana) for the wider system. Most production fleets need at least one from each category.

How does Thinklytics scope fleet operations work?

Typical engagement: 6 to 10 weeks to instrument the shared services, write the operations playbook, and train the on-call team. Read more at [AI agent consulting](/services/ai-agent-consulting).