AI in Network Operations: Where It Helps (and Where It Hurts)
Practical patterns for AI adoption in NOC/SRE teams. When AI reduces toil. When it amplifies chaos. How to build guardrails that work.
AI in Network Operations: Where It Helps (and Where It Hurts)
AI is a powerful accelerator for operations — but it can also amplify bad signals, weak runbooks, and unclear ownership.
The goal is not "AI everywhere." The goal is faster incident resolution with lower risk.
Where It Actually Works
Alert Triage
When DNS latency spikes at 3 AM, most teams:
- Wake engineer
- Engineer logs into 4 different systems
- Engineer reads dashboards
- Engineer searches past incidents
- 10–20 minutes until action
With AI (guardrailed):
- Alert fires
- AI summarizes: "Latency spike 3x baseline. Config changed 15 min ago. Similar issue resolved by [rollback]. Escalation options: [A], [B], [C]."
- Engineer reads summary in 1 minute
- Engineer approves action
- 3 minutes total
Key: AI suggests, human decides.
Runbook Execution
Structured runbooks with clear checkpoints work well:
BGP session down runbook:
1. Verify peer reachable? (ping -M do -s 1472)
2. Check BGP status? (show bgp summary)
3. Clear session? (requires human approval)
4. Verify recovery? (BGP up in < 2 min)
5. Verify traffic? (packets flowing)
Each step has a deterministic check. If any fails, stop and escalate. If all pass, done in 2 minutes.
Key: Deterministic + auditable + reversible.
Pre/Post Validation
Change validation is ideal for automation:
- Reachability before and after
- BGP session count stable
- Packet loss near zero
- Route count unchanged
- No error logs
All automated, all logged. If any check fails, flag it immediately.
Where It Breaks
Hallucinations in Ambiguous Incidents
Alert: "Interface saturation on border-1."
Bad AI: "Solution: Upgrade to 400G interface." (You don't have 400G interfaces.)
Good AI: "Options from similar incidents: [1], [2], [3]. Unknown: root cause not in historical patterns. Escalate to senior engineer."
Overconfidence with Weak Observability
Your monitoring sees dropped packets. AI concludes: "Congestion."
Reality: Fiber degradation or duplex mismatch.
Result: 30 minutes wasted troubleshooting the wrong thing.
Solution: Fix observability first, then add AI.
Undocumented Drift
Runbook says: "Update VLAN 100 gateway to 10.10.1.254."
Reality: VLAN 100 moved 6 months ago. Gateway is now 10.10.2.254.
AI executes old address. Silent failure. Outage.
Solution: Codify runbooks in version control. Update them or mark as obsolete.
Shadow Automation Without Governance
One team uses Claude for network ACLs. Another uses ChatGPT for BGP debugging. No audit trail. No approval. No rollback capability.
Outcome: Someone got fired.
Solution: Central governance. All automation logged. All state changes approved.
The Adoption Path
Phase 1: Fix Signals (Weeks 1–2)
Before AI touches anything:
- Reduce alert noise — tune thresholds, suppress false positives (goal: 90%+ actionable)
- Standardize dashboards — one dashboard per incident type (DNS, BGP, capacity, etc.)
- Define golden signals — what actually matters (latency, loss, success rate, reachability)
Don't add AI to a noisy system. You'll just automate bad decisions faster.
Phase 2: Codify Runbooks (Weeks 3–4)
Document your runbooks with:
- Clear precondition checks (must be true before proceeding)
- Deterministic steps (yes/no outcomes only)
- Verification checks (proves step worked)
- Abort conditions (when to stop and escalate)
- Timeouts (how long to wait)
Example:
Step: Clear BGP session
Precondition: BGP neighbor reachable
Action: clear bgp {neighbor_ip}
Verification: BGP state = Established within 60s
Abort if: Verification fails after 2 retries
Rollback: None needed (session comes back up)
Phase 3: Automate Deterministic Tasks (Weeks 5–6)
Only automate:
- ✅ Clear sessions (reversible)
- ✅ Restart services (reversible)
- ✅ Rotate logs (low impact)
- ✅ Drain connections (reversible)
Never automate:
- ❌ Routing changes (hard to rollback)
- ❌ Firewall rule changes (security impact)
- ❌ Delete operations (dangerous)
Add approval gates for all state-changing steps.
Phase 4: Add AI with Guardrails (Weeks 7+)
Three levels:
Summarization (safe):
- AI reads metrics, logs, past incidents
- AI generates: "This looks like X. Similar incident was [link]. Options: [A], [B], [C]."
- Human reads summary. Makes decision.
Suggestion (medium):
- AI suggests actions ranked by risk
- AI explains precedent: "Similar incident resolved by [action]. Success rate: 95%."
- Human approves. AI executes with approval.
Execution (controlled):
- AI executes runbook step-by-step
- Every action logged
- Every verification checked
- Human approval required for state changes
- Automatic rollback if verification fails
What Gets Better
Teams that do this well see:
| Metric | Before | After |
|---|---|---|
| MTTR | 25 min | 8 min |
| Alert actionability | 70% | 95% |
| Runbook exec time | 15 min | 3 min |
| Manual investigation | 60% of time | 20% of time |
But only if you build the infrastructure first.
The Anti-Patterns
❌ "Use AI as a black box" — Feed raw logs to LLM, expect intelligence. Result: Hallucinations.
❌ "Automate without approvals" — Let AI execute without human gate. Result: Production outages.
❌ "Add AI to noisy systems" — 500 alerts/day → 500 bad AI decisions/day.
❌ "No audit trail" — Can't explain what AI did. Compliance failure.
❌ "AI decides incident priority" — AI doesn't have business context. Humans do.
The Real Bottom Line
AI helps operations teams with:
- Strong observability (golden signals working)
- Clear runbooks (documented, version-controlled)
- Predictable patterns (most incidents follow playbooks)
- Governance (approvals, audit logs, escalation paths)
If those foundations are missing: do not add AI yet.
Instead:
- Fix your alert noise
- Document your runbooks
- Establish approval workflows
- Build audit logging
Then AI becomes a multiplier: faster triage, faster execution, same human judgment.
Without those foundations, AI becomes a liability: faster wrong decisions, faster disasters, same on-call exhaustion.
Next Step
Audit your top 5 incident types. Pick one that's deterministic (clear steps, verifiable outcome). Document the runbook with preconditions and checks. Wire up approval gates. Then automate it.
That's your proof of concept.
Don't start with AI. Start with discipline.