All posts
DevOps

'Agentic DevOps Glossary: Key Terms From Orchestrator to Tool Use, Defined'

DevOS Platform TeamJune 19, 202613 min read

Last week I watched a Slack thread where someone asked what "blast radius" meant in the context of AI agents. Six people replied. Four definitions. Two diagrams. One person insisted it was the same as "failure domain" (it's not, exactly).

That's where we are in mid-2026. The field moves faster than terminology can stabilize, and honestly? I got three definitions wrong when I started writing this. DevOps terms that meant one thing for a decade suddenly mean something different when an AI agent is doing the work.

This glossary focuses on agentic DevOps specifically — the vocabulary you need when AI agents handle infrastructure, deployments, and operations. Not a general AI agent glossary (we wrote that separately). Not a generic DevOps primer. The intersection.

Twenty-five terms. Defined for teams who run production systems. I probably got at least one of these slightly wrong too — the definitions are still settling. Bookmark it anyway.

Core Agentic DevOps Concepts

1. Orchestrator

The coordination layer that manages multiple DevOps agents. An orchestrator assigns tasks — this agent handles database provisioning, that one monitors deployment health, another manages rollbacks. It handles handoffs when one agent's work triggers another's responsibility.

Without an orchestrator, you get chaos. Two agents trying to scale the same service simultaneously. One agent rolling back while another deploys forward. I've seen this break a staging environment for an entire afternoon.

DevOS calls this the Super Orchestrator — the PM for your AI DevOps team, deciding who does what and when. If you're running CI/CD with AI agents, the orchestrator keeps the pipeline from eating itself.

2. Dry-Run Gate

A checkpoint where an agent simulates a change before applying it. The agent generates a Terraform plan, a Kubernetes diff, or a database migration preview — then pauses.

You review the diff. If it looks wrong, the change never happens. If it looks right, you approve and the agent applies it.

Dry-run gates are how you get the speed of agent automation without the terror of "what did it just do to production." DevOS agents have configurable dry-run requirements — some changes auto-approve, others always require human sign-off. (Though if you're auto-approving production database migrations, we should talk.)

3. Blast Radius

How much damage an agent can cause if something goes wrong. A narrow blast radius means the agent's worst-case mistake affects one service, one environment, maybe one customer segment. A wide blast radius means a mistake cascades across your entire production infrastructure.

Smart guardrails scope blast radius by design. Your staging agent can touch staging. Your production agent can touch one service in prod. Your "emergency" agent (which should exist, reluctantly) can touch more — but requires extra approvals and generates audit logs.

The goal isn't preventing all mistakes — that's impossible. It's ensuring mistakes stay contained. Small fires, not infernos.

4. Tool Use

An agent invoking external capabilities. For DevOps agents, that means: running kubectl, calling AWS/GCP/Azure APIs, executing Terraform, triggering GitHub Actions, querying Prometheus, sending PagerDuty alerts.

The tools an agent can access define what it can actually accomplish. A DevOps agent without tool use just writes YAML and shell scripts you have to run yourself. That's not autonomous — that's fancy autocomplete.

Real DevOps agents act. They provision infrastructure. Deploy code. Scale services. Tool use is what makes that possible — and also what makes them terrifying if you get guardrails wrong.

5. Autonomy Level

How much a DevOps agent handles without human approval. I'll borrow the L1-L5 framing from our general agent glossary:

  • L1: Agent suggests a change, human runs it
  • L2: Agent runs the change, human approves first (most production DevOps today)
  • L3: Agent handles specific categories autonomously (dev deploys, staging, non-critical scaling)
  • L4: Agent manages entire environments end-to-end
  • L5: Full autonomy across all infrastructure (doesn't exist safely in 2026)

Most teams run L2-L3 for DevOps agents. Staging gets L3. Production gets L2. Critical infrastructure gets L1 until you trust the agent more. (Spoiler: that takes longer than you'd expect. Maybe forever for databases.) We covered best practices for human-in-the-loop AI agent teams in depth.

Infrastructure & Deployment Terms

6. Infrastructure Drift Detection

An agent continuously comparing actual infrastructure state to declared state. When drift happens — someone manually changed a security group, an auto-scaler added instances not in Terraform — the agent notices and either alerts or auto-remediates.

Before agents, drift detection was a scheduled job you checked weekly. Maybe. With agents, it's continuous. Still annoying, but at least you know immediately instead of discovering drift during an incident.

7. GitOps Agent

An agent that treats Git as the source of truth for infrastructure. You push a change to your infra repo. The agent sees it, validates it, applies it to the target environment.

ArgoCD and Flux do this for Kubernetes. AI agents take it further — they can also propose changes to the repo based on observed issues, write the PR, and handle the review process.

8. Progressive Deployment Agent

An agent that handles canary, blue-green, or rolling deployments with automatic promotion or rollback based on metrics.

Old way: configure Argo Rollouts, define metric thresholds, hope you didn't mess up the YAML.

Agent way: tell the agent "deploy to 5% of traffic, watch error rate, promote to 25% if it stays under 0.1%, roll back if it spikes." The agent writes the config, monitors the metrics, handles the progression. Still feels weird trusting it. But so did trusting Kubernetes the first time.

9. Self-Healing Infrastructure

Systems that detect failures and recover without human intervention. Auto-scaling. Pod restarts. Node replacement. Failover.

With agents, self-healing extends beyond predefined responses. An agent can diagnose why a pod is crashing, propose a fix, and either apply it or escalate with context. We covered this in how to stop AI agents from going rogue.

10. Environment Parity Agent

An agent that keeps dev, staging, and production environments in sync (where they should be synced) and intentionally different (where they should differ).

Staging needs the same service versions as prod but smaller resources and different secrets. Tracking all that manually is a spreadsheet nightmare. An agent can enforce parity rules and flag when environments drift.

Guardrails & Safety

11. Guardrail

A constraint that prevents unwanted agent behavior. "Don't touch production databases." "Don't modify IAM policies." "Don't deploy after 5pm Friday." (That last one has saved my weekend more than once.)

Guardrails are your safety net when agent judgment fails. And it will fail — not if, when. Good guardrails are explicit (written down), enforced (the agent literally cannot bypass them), and audited (you know when they fire).

12. Approval Gate

A point where human sign-off is required before the agent proceeds. Different from a dry-run gate — approval gates don't require previewing the change, just blessing it.

"Deploy to staging" might auto-approve. "Deploy to production" might require an approval gate.

13. Scope Constraint

A limit on which resources an agent can access. Your deployment agent can touch the api service but not the auth service. Your monitoring agent can read metrics but not modify dashboards.

Scope constraints implement least privilege for AI agents. Broader scope = higher blast radius.

14. Rollback Trigger

A condition that automatically initiates rollback. Error rate above X%. Latency above Y ms. Health check failing for Z seconds.

Honestly? Set those thresholds yourself and let the agent execute. Letting the agent define when to roll back is scarier than letting it do the rollback. I don't fully trust my own threshold judgment half the time — why would I trust the agent's?

Observability & Monitoring

15. Observability Agent

An agent that manages your monitoring, logging, and tracing infrastructure. Creates dashboards. Tunes alerts. Detects anomalies before they become incidents.

The dream: you deploy a new service and the agent auto-instruments it, creates dashboards, and sets alert thresholds. The reality: agents can do most of this, but "reasonable thresholds" still requires tuning. And then more tuning. And then someone complains about alert fatigue. You know the drill.

16. Alert Triage Agent

An agent that receives alerts and determines: is this actionable? Is it a duplicate? Should it page someone, or wait? Does it correlate with other alerts?

Alert fatigue is real. A triage agent filters the noise — routing real issues to humans and handling the false positives. For teams building this out, JustAnalytics tracks incident metrics that can feed back into triage rules.

17. Root Cause Inference

An agent correlating symptoms to identify why a failure happened. Service A is slow → traces show Service B timeout → Service B logs show database connection exhaustion → database metrics show spike from a bad query deployed yesterday.

Humans do this manually during incidents — usually at 3am, bleary-eyed, with half the context. Agents can run the same correlation faster, especially when they have access to logs, metrics, and deployment history in one context. Not always right. But faster.

Agent Coordination Terms

18. Handoff Protocol

How one agent transfers work to another. The Planner agent scopes infrastructure needs; the DevOps agent provisions them. What context gets passed?

Bad handoffs lose context and cause rework. Good handoffs include: what was decided, what's expected, what constraints exist.

19. Conflict Resolution

What happens when agents disagree or overlap. One agent wants to scale up; another wants to scale down based on different metrics. One agent deploys while another runs a load test.

The orchestrator needs rules. First-mover wins? Priority ranking? Human arbitration? Without explicit conflict resolution, agents step on each other. It's like giving two toddlers one toy. Someone has to referee. We addressed this in how to fix AI agent coordination in multi-agent sprints.

20. Escalation Path

When an agent recognizes it's stuck and asks for help. Good escalation includes: what the agent tried, why it failed, what options exist, and how urgent it is.

Bad escalation: "I couldn't do the thing." Thanks, that's super helpful. Train your agents to escalate with context or don't let them escalate at all.

Advanced Concepts

21. Policy-as-Code Agent

An agent that enforces organizational policies written as code. Security policies. Compliance requirements. Cost constraints.

"Every S3 bucket must have encryption enabled." The agent checks proposed changes against the policy and blocks violations before they deploy. Like OPA or Sentinel, but with natural language policy understanding.

22. Chaos Engineering Agent

An agent that intentionally introduces failures to test resilience. Kills pods. Injects latency. Simulates region outages.

The agent doesn't just run chaos experiments — it observes results, identifies weaknesses, proposes fixes.

I remain skeptical of letting agents decide what chaos to introduce. Human-designed chaos experiments with agent execution? Sure. Agent-designed chaos experiments? That's how you get creative failures you never anticipated.

23. Cost Optimization Agent

An agent that analyzes infrastructure spend and recommends savings. Right-sizing instances. Deleting orphaned resources. Moving workloads to spot instances. Scheduling dev environments to shut down at night.

DevOS routes across Anthropic, Google, DeepSeek, and OpenAI for AI costs — picking the cheapest capable model per task. Same principle applies to infrastructure: not every job needs the premium hardware.

24. Compliance Audit Agent

An agent that continuously validates infrastructure against compliance requirements. SOC 2. HIPAA. PCI-DSS. GDPR data residency.

Instead of quarterly audits with spreadsheets, the agent runs continuous checks. When you need to prove compliance for a customer audit, the report already exists.

25. Multi-Model Routing

Sending different DevOps tasks to different AI models based on the job. Simple tasks (parse this log, generate this YAML) go to a fast, cheap model. Complex tasks (debug this intermittent failure, design this scaling strategy) go to a more capable model.

Not all DevOps work needs Claude Opus. Most doesn't. A simple log parse? Fast cheap model. Debugging a gnarly race condition? Bring in the big guns. Routing saves money without sacrificing quality where it matters. If you're worried about costs spiraling, see our guide on fixing AI agent cost explosion with token budgets.

Honorable Mentions

State Lock: Preventing concurrent modifications to infrastructure. Terraform uses state locking to stop two deploys from colliding. Agents need similar coordination.

Idempotency: An operation that produces the same result whether run once or many times. Essential for agents that might retry failed operations. For more on agent memory between retries, see designing memory systems for coding agents.

Blast Door: A maximum limit on what an agent can change in one action. "Modify up to 5 resources per plan. Bigger changes require human review."

Quick Verdict

If you take one thing from this glossary: dry-run gate and blast radius are the concepts that matter most for safe DevOps automation with agents.

Dry-run gates give you review before damage. Blast radius scoping limits damage when review fails — and it will, eventually, because humans get tired and approve things they shouldn't. Everything else is important, but those two determine whether you can sleep at night while agents manage your infrastructure.

Or at least sleep slightly better.

Frequently Asked Questions

What is an orchestrator in agentic DevOps?

An orchestrator coordinates multiple AI agents working on DevOps tasks. It assigns deployments, manages handoffs between infrastructure agents, resolves conflicts when agents target the same resources, and tracks pipeline progress. Think of it as the conductor for your AI DevOps team — deciding which agent handles provisioning vs monitoring vs rollback.

What is a dry-run gate?

A dry-run gate is a checkpoint where an agent simulates a change without applying it. The agent generates a Terraform plan, Kubernetes diff, or deployment preview — then pauses for review. If the diff looks wrong, the change never reaches production. Dry-run gates are how teams let agents propose infrastructure changes without trusting them blindly.

What does blast radius mean for AI agents?

Blast radius describes how much damage an agent can cause if something goes wrong. A narrow blast radius means the agent can only affect one service or one environment. A wide blast radius means a mistake could cascade across production infrastructure. Smart guardrails scope blast radius by limiting which resources an agent can touch.

What is tool use in the context of DevOps agents?

Tool use means an AI agent invoking external capabilities — running kubectl commands, calling cloud provider APIs, executing Terraform, triggering CI pipelines. The tools an agent can access define what it can actually accomplish. A DevOps agent without tool use is just a chatbot that writes YAML you have to copy-paste.


Join the DevOS Waitlist

AI agents that work as employees inside your sprints — not single-task copilots. Planner, Developer, QA, and DevOps agents pick up tickets from the backlog, ship in branches, request review. Pre-launch, so join the waitlist if any of this glossary made you think "I want that."

Join the waitlist →