Letting an AI Agent Handle Your On-Call: Building Safe, Runbook-Driven Incident Response
The page came in at 4:17 AM. Redis connection timeout, production, severity high.
I fumbled for my phone, squinted at the screen, and did exactly what I'd done the last six times this alert fired: ran the cache flush runbook. Three commands. Maybe 90 seconds. Back to sleep.
The next morning, I stared at my PagerDuty history and thought: why am I the one doing this?
I'm not adding judgment. I'm not making decisions. I'm mechanically executing a documented procedure while half-asleep — and probably making typos, if I'm being honest about my 4 AM keyboard skills. An agent could do this. Probably better than I can at 4 AM.
That's what we're building here. An AI on-call agent that handles Tier-1 incidents: scoped permissions, runbook-driven actions, blast-radius limits, and automatic escalation to a human when things go sideways. By the end, you'll have a setup where simple incidents get resolved without waking anyone up, and complex ones still page a human within minutes.
Fair warning: this isn't "let the AI do whatever it wants." It's closer to "let the AI operate inside a very well-defined sandbox." The guardrails matter more than the intelligence.
Prerequisites
You'll need:
- A runbook repository with documented incident procedures (we'll use Markdown files)
- PagerDuty, Opsgenie, or similar alerting tool with webhook support
- An execution environment for the agent (we'll use a lightweight Node.js service)
- API credentials for your infrastructure (scoped, read-only where possible, limited write for specific actions)
- Slack or similar for incident logging and escalation — you can also pipe alerts through JustEmails for compliance-critical notifications
- 30 minutes to set this up, plus another hour to tune your runbooks
This tutorial assumes you already have runbooks. If you're winging it without documented procedures, stop here and write those first. An AI agent executing undocumented chaos isn't on-call — it's a liability.
(I learned this the hard way. Spent two weeks building an agent before realizing our runbooks were basically vibes and tribal knowledge. Had to pause and actually document things. Annoying, but necessary.)
What We're Building
The flow looks like this:
- Alert fires in PagerDuty
- Webhook triggers your on-call agent
- Agent reads the alert, matches it to a runbook
- Agent executes runbook steps sequentially, logging everything
- If a step fails twice, agent escalates to human
- If all steps succeed, agent resolves the alert and posts a summary
The key constraint: the agent never improvises. It only executes actions defined in runbooks you've written. Runbook-driven automation with an AI brain for selection and adaptation. Not freeform incident response.
Strong opinion: I think the "AI takes over your infrastructure" crowd is building the wrong thing. Autonomy without constraints is how you get 3 AM pages that say "AI agent deleted prod database." No thanks.
Step 1: Structure Your Runbooks for Agent Execution
Your runbooks probably exist. They're probably Markdown files or Confluence pages with headers like "Steps" and code blocks with commands. That's fine — we're going to add some structure so the agent knows what to do.
Create a runbook format the agent can parse:
# Runbook: Redis Connection Timeout
## Trigger Conditions
- Alert contains: "redis", "connection", "timeout"
- Severity: high or critical
- Service: api-gateway, user-service
## Blast Radius
- max_pod_restarts: 0
- max_cache_operations: 1
- requires_human_approval: false
## Steps
### Step 1: Flush the connection pool
```bash
kubectl exec -n production deploy/redis-primary -- redis-cli FLUSHDB
Expected output: "OK" Failure action: continue
Step 2: Verify connections restored
kubectl exec -n production deploy/api-gateway -- curl -s localhost:8080/health | jq .redis
Expected output: contains "connected" Failure action: escalate
Escalation
If any step with "failure action: escalate" fails twice, page the on-call engineer via PagerDuty.
A few things to notice:
**Trigger conditions** tell the agent when this runbook applies. The agent matches alerts against these conditions to select the right runbook. Be specific — you don't want the redis runbook running for postgres alerts.
**Blast radius** limits what the agent can do. Even if the runbook says "restart all pods," the `max_pod_restarts: 0` setting stops the agent from doing that. (Why have the limit if the runbook doesn't include pod restarts? Defense in depth. Someone might edit the runbook later and forget the constraint.)
**Failure action** per step. "Continue" means try the next step. "Escalate" means page a human. Don't make every step escalate — that defeats the purpose. But do escalate when the fix didn't work and you're out of automated options.
Store these in a `/runbooks` directory. The agent will load them at startup.
Getting the runbook format right took us three iterations. First version was too loose — agent couldn't parse it reliably. Second version was too rigid — writing runbooks felt like filling out tax forms. Third version is what you see above. Still not perfect, but it works.
## Step 2: Build the Alert Handler
Create the webhook endpoint that receives alerts. I'm using Node.js because it's what we had lying around — adapt to your stack.
```javascript
// incident-agent.js
const express = require('express');
const { loadRunbooks, matchRunbook } = require('./runbook-loader');
const { executeStep } = require('./executor');
const { escalate, logToSlack, resolveAlert } = require('./integrations');
const app = express();
app.use(express.json());
const runbooks = loadRunbooks('./runbooks');
app.post('/webhook/pagerduty', async (req, res) => {
const alert = req.body;
// Don't block the webhook response
res.status(200).send({ received: true });
await handleIncident(alert);
});
async function handleIncident(alert) {
const incidentId = alert.incident.id;
const title = alert.incident.title;
const severity = alert.incident.urgency;
await logToSlack(`🚨 Incident received: ${title} (${severity})`);
// Find the matching runbook
const runbook = matchRunbook(runbooks, alert);
if (!runbook) {
await logToSlack(`⚠️ No runbook matched. Escalating to human.`);
await escalate(alert, 'No matching runbook found');
return;
}
await logToSlack(`📋 Matched runbook: ${runbook.name}`);
// Execute steps with blast-radius enforcement
let stepIndex = 0;
for (const step of runbook.steps) {
stepIndex++;
await logToSlack(`▶️ Step ${stepIndex}: ${step.description}`);
let attempts = 0;
let success = false;
while (attempts < 2 && !success) {
attempts++;
const result = await executeStep(step, runbook.blastRadius);
if (result.success) {
success = true;
await logToSlack(`✅ Step ${stepIndex} succeeded`);
} else {
await logToSlack(`❌ Step ${stepIndex} attempt ${attempts} failed: ${result.error}`);
}
}
if (!success) {
if (step.failureAction === 'escalate') {
await logToSlack(`🔴 Escalating after step ${stepIndex} failed twice`);
await escalate(alert, `Runbook step failed: ${step.description}`);
return;
}
// failure action is 'continue', move to next step
}
}
// All steps completed
await resolveAlert(incidentId);
await logToSlack(`✅ Incident ${incidentId} resolved automatically`);
}
This is the core loop. Alert comes in, match it to a runbook, execute steps, log everything, escalate or resolve.
The executeStep function (not shown in full) runs the shell command from the runbook and compares output against expected patterns. The blastRadius object gets passed in so the executor can enforce limits — if you've hit max_cache_operations: 1 and the runbook wants another cache flush, the executor refuses and logs why.
Step 3: Scope Your Permissions Ruthlessly
This is where most teams screw up. And honestly? We screwed it up too, initially.
They give the agent broad credentials because "it's easier" and then wonder why their incident response agent accidentally terminated a production database. We didn't go that far, but we did give our first agent write access to more namespaces than it needed. Nothing bad happened, but the potential was there. Tightened it up fast.
Create an IAM role (or equivalent) for the agent with exactly the permissions it needs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"eks:DescribePod",
"eks:GetPodLogs"
],
"Resource": "arn:aws:eks:us-east-1:*:cluster/production"
},
{
"Effect": "Allow",
"Action": [
"elasticache:DescribeCacheClusters",
"elasticache:RebootCacheCluster"
],
"Resource": "arn:aws:elasticache:us-east-1:*:cluster:prod-redis-*"
},
{
"Effect": "Deny",
"Action": [
"rds:DeleteDBInstance",
"eks:DeleteCluster",
"iam:*"
],
"Resource": "*"
}
]
}
The explicit Deny statements aren't strictly necessary (default deny handles it), but I like having them visible. Anyone reviewing this policy sees immediately: "Oh, this role cannot delete databases or modify IAM." Makes the blast-radius limits obvious.
For Kubernetes, use a ServiceAccount with a restricted Role:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: incident-agent
namespace: production
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
# Only for specific containers
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get"]
# Note: no "delete", no "patch" for replicas
The agent can exec into pods to run commands (for the redis-cli example) but cannot restart pods, scale deployments, or delete anything. If a runbook requires pod restarts, you'd add that permission — but then your max_pod_restarts limit in the runbook becomes critical.
Yes, this is tedious. Yes, you have to maintain two sets of constraints (IAM + runbook limits). Deal with it. The alternative is trusting an AI agent with prod credentials at 4 AM.
Step 4: Add Human-in-the-Loop Escalation
The agent should never be the last line of defense. Escalation paths matter.
// integrations.js
async function escalate(alert, reason) {
// Log the escalation
await logToSlack(`🔴 ESCALATING: ${reason}`);
// Trigger PagerDuty to page the human on-call
await fetch('https://events.pagerduty.com/v2/enqueue', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
routing_key: process.env.PAGERDUTY_ROUTING_KEY,
event_action: 'trigger',
payload: {
summary: `AI agent escalation: ${reason}`,
severity: 'critical',
source: 'incident-agent',
custom_details: {
original_alert: alert.incident.title,
agent_actions: await getIncidentLog(alert.incident.id),
reason_for_escalation: reason
}
}
})
});
}
When the human gets paged, they see: what the original alert was, what the agent tried, and why it escalated. They're not starting from zero — they're picking up where the agent stopped.
You can also configure escalation triggers beyond runbook failures:
- Alert severity is "critical" and estimated downtime exceeds 5 minutes
- Agent detects the same alert firing 3+ times in an hour (possible flapping or deeper issue) — track these patterns in JustAnalytics for postmortem trend analysis
- Runbook execution takes longer than 10 minutes (something's stuck)
- Agent uncertainty: if using an LLM to match runbooks, confidence below 80% triggers human confirmation
Step 5: Audit Everything
The agent needs to explain itself. Not just for postmortems — for trust.
Here's the thing: if engineers don't know what the agent did, they won't trust it to run unsupervised. And they shouldn't. I've been on teams where "the automation handled it" meant nobody knew what actually happened. That's not incident response. That's hope.
const incidentLogs = new Map();
async function logToSlack(message) {
const timestamp = new Date().toISOString();
const formatted = `[${timestamp}] ${message}`;
// Store locally for the incident summary
const currentIncident = getCurrentIncidentId();
if (currentIncident) {
const logs = incidentLogs.get(currentIncident) || [];
logs.push(formatted);
incidentLogs.set(currentIncident, logs);
}
// Post to Slack
await fetch(process.env.SLACK_WEBHOOK_URL, {
method: 'POST',
body: JSON.stringify({ text: formatted })
});
}
At the end of every incident (resolved or escalated), post a full summary:
📋 Incident Summary: Redis Connection Timeout
Duration: 2m 34s
Runbook: redis-connection-timeout
Steps executed: 2/2
Result: Resolved automatically
Timeline:
[04:17:23] 🚨 Incident received
[04:17:24] 📋 Matched runbook: redis-connection-timeout
[04:17:24] ▶️ Step 1: Flush the connection pool
[04:17:26] ✅ Step 1 succeeded
[04:17:26] ▶️ Step 2: Verify connections restored
[04:17:28] ✅ Step 2 succeeded
[04:17:28] ✅ Incident resolved automatically
This log goes to Slack and to your incident management system. We pipe incident data to JustAnalytics for tracking resolution times and runbook effectiveness — if you're running paid ad campaigns for your SaaS, ClickzProtect can help identify whether bot traffic is skewing your conversion metrics.
Common Errors and Fixes
Agent matches the wrong runbook
Your trigger conditions are too broad. "Alert contains: timeout" will match both redis timeouts and HTTP gateway timeouts. Be more specific: "Alert contains: redis AND timeout AND service:api-gateway".
Blast-radius limit blocks a necessary action
Good — the limit did its job. If the action is actually safe, increase the limit in that specific runbook. Don't raise global limits.
Agent succeeds but the alert keeps firing
The runbook fixed the symptom, not the cause. This isn't an agent problem — it's a runbook problem. Add a step that checks whether the underlying issue is actually resolved, not just whether the immediate symptom cleared.
This one bit us hard. Had a runbook that restarted a service, service came back, alert cleared. Twenty minutes later, same alert. Turns out there was a memory leak, and we were just cycling the OOM-kill. The agent did exactly what we asked. We asked for the wrong thing.
kubectl exec fails with permission denied
Your ServiceAccount Role is missing the pods/exec verb, or it's scoped to the wrong namespace. Check kubectl auth can-i create pods/exec -n production --as=system:serviceaccount:production:incident-agent.
Next Steps
Once the basic agent is running, a few improvements worth considering:
Multiple runbook matching: Right now, the agent picks the first matching runbook. Consider scoring matches and asking for human confirmation when multiple runbooks match with similar confidence.
Runbook learning: Track which runbooks succeed and which escalate. If a runbook escalates 80% of the time, it's probably incomplete — DevOS's three-tier memory system could eventually help agents learn from incident history, though that's further out.
Agent-to-agent handoffs: Tier-1 agent resolves what it can, then hands off to a more capable agent for deeper investigation. This is closer to how DevOS structures multi-agent coordination — Planner, Developer, QA, DevOps agents with different capabilities and escalation paths.
Proactive alerting: The agent notices a pattern (disk usage climbing steadily) and creates an alert before it becomes an incident. That's beyond incident response — it's incident prevention.
The VDL engineering team runs a version of this setup across multiple products. It's not perfect — we still get false escalations, still have runbooks that need tuning, still occasionally wake up to find the agent made things slightly worse before escalating. But it handles about 40% of Tier-1 pages without human intervention. That's 40% fewer 4 AM phone calls. I'll take it.
Frequently Asked Questions
What incidents can an AI agent safely handle without human intervention?
Tier-1 incidents with well-defined runbooks and bounded blast radius. Pod restarts, cache flushes, feature flag toggles, certificate renewals, log rotation. Anything where the fix is documented, the action is reversible, and failure means "try the next step" not "production is down." Complex incidents — cascading failures, data corruption, anything requiring judgment calls — still need a human.
How do you prevent an AI on-call agent from making things worse?
Three layers: scoped permissions (the agent literally cannot delete databases because the IAM role doesn't allow it), blast-radius limits (max 2 pod restarts per incident, max 1 feature flag change), and mandatory escalation triggers (if the runbook step fails twice, page a human). The agent operates inside a sandbox — it can only execute what you've pre-approved.
What's the difference between AI on-call and traditional runbook automation?
Traditional automation is rigid: if condition X, do action Y. AI on-call adds judgment within bounds. The agent reads the alert, selects the appropriate runbook, adapts parameters based on context (which pod? which region?), handles edge cases the automation script didn't anticipate, and knows when to stop and escalate. It's runbook-guided, not runbook-hardcoded.
How do you audit what an AI on-call agent did during an incident?
Every action gets logged: the alert that triggered the agent, which runbook it selected, each command executed, the output, and the decision to continue or escalate. We write this to both Slack (for immediate visibility) and a structured incident log (for postmortems). The agent also explains its reasoning — "selected cache-flush runbook because alert mentions redis timeout" — so you can review the decision chain.
Join the DevOS Waitlist
AI agents that work as employees inside your sprints, standups, and tickets — not single-task copilots. Planner / Developer / QA / DevOps agents pick up work from the backlog, ship in branches, request review. Linear-shaped backlog UI with AI underneath. Pre-launch.
Join the waitlist → · How agents-as-employees works · VeloCards for team payments