AI in Network Operations: Where It Helps (and Where It Hurts)

AI is a powerful accelerator for operations — but it can also amplify bad signals, weak runbooks, and unclear ownership.

The goal is not "AI everywhere." The goal is faster incident resolution with lower risk.

Where It Actually Works

Alert Triage

When DNS latency spikes at 3 AM, most teams:

Wake engineer
Engineer logs into 4 different systems
Engineer reads dashboards
Engineer searches past incidents
10–20 minutes until action

With AI (guardrailed):

Alert fires
AI summarizes: "Latency spike 3x baseline. Config changed 15 min ago. Similar issue resolved by [rollback]. Escalation options: [A], [B], [C]."
Engineer reads summary in 1 minute
Engineer approves action
3 minutes total

Key: AI suggests, human decides.

Runbook Execution

Structured runbooks with clear checkpoints work well:

BGP session down runbook:
  1. Verify peer reachable? (ping -M do -s 1472)
  2. Check BGP status? (show bgp summary)
  3. Clear session? (requires human approval)
  4. Verify recovery? (BGP up in < 2 min)
  5. Verify traffic? (packets flowing)

Each step has a deterministic check. If any fails, stop and escalate. If all pass, done in 2 minutes.

Key: Deterministic + auditable + reversible.

Pre/Post Validation

Change validation is ideal for automation:

Reachability before and after
BGP session count stable
Packet loss near zero
Route count unchanged
No error logs

All automated, all logged. If any check fails, flag it immediately.

Where It Breaks

Hallucinations in Ambiguous Incidents

Alert: "Interface saturation on border-1."

Bad AI: "Solution: Upgrade to 400G interface." (You don't have 400G interfaces.)

Good AI: "Options from similar incidents: [1], [2], [3]. Unknown: root cause not in historical patterns. Escalate to senior engineer."

Overconfidence with Weak Observability

Your monitoring sees dropped packets. AI concludes: "Congestion."

Reality: Fiber degradation or duplex mismatch.

Result: 30 minutes wasted troubleshooting the wrong thing.

Solution: Fix observability first, then add AI.

Undocumented Drift

Runbook says: "Update VLAN 100 gateway to 10.10.1.254."

Reality: VLAN 100 moved 6 months ago. Gateway is now 10.10.2.254.

AI executes old address. Silent failure. Outage.

Solution: Codify runbooks in version control. Update them or mark as obsolete.

Shadow Automation Without Governance

One team uses Claude for network ACLs. Another uses ChatGPT for BGP debugging. No audit trail. No approval. No rollback capability.

Outcome: Someone got fired.

Solution: Central governance. All automation logged. All state changes approved.

The Adoption Path

Phase 1: Fix Signals (Weeks 1–2)

Before AI touches anything:

Reduce alert noise — tune thresholds, suppress false positives (goal: 90%+ actionable)
Standardize dashboards — one dashboard per incident type (DNS, BGP, capacity, etc.)
Define golden signals — what actually matters (latency, loss, success rate, reachability)

Don't add AI to a noisy system. You'll just automate bad decisions faster.

Phase 2: Codify Runbooks (Weeks 3–4)

Document your runbooks with:

Clear precondition checks (must be true before proceeding)
Deterministic steps (yes/no outcomes only)
Verification checks (proves step worked)
Abort conditions (when to stop and escalate)
Timeouts (how long to wait)

Example:

Step: Clear BGP session
  Precondition: BGP neighbor reachable
  Action: clear bgp {neighbor_ip}
  Verification: BGP state = Established within 60s
  Abort if: Verification fails after 2 retries
  Rollback: None needed (session comes back up)

Phase 3: Automate Deterministic Tasks (Weeks 5–6)

Only automate:

✅ Clear sessions (reversible)
✅ Restart services (reversible)
✅ Rotate logs (low impact)
✅ Drain connections (reversible)

Never automate:

❌ Routing changes (hard to rollback)
❌ Firewall rule changes (security impact)
❌ Delete operations (dangerous)

Add approval gates for all state-changing steps.

Phase 4: Add AI with Guardrails (Weeks 7+)

Three levels:

Summarization (safe):

AI reads metrics, logs, past incidents
AI generates: "This looks like X. Similar incident was [link]. Options: [A], [B], [C]."
Human reads summary. Makes decision.

Suggestion (medium):

AI suggests actions ranked by risk
AI explains precedent: "Similar incident resolved by [action]. Success rate: 95%."
Human approves. AI executes with approval.

Execution (controlled):

AI executes runbook step-by-step
Every action logged
Every verification checked
Human approval required for state changes
Automatic rollback if verification fails

What Gets Better

Teams that do this well see:

Metric	Before	After
MTTR	25 min	8 min
Alert actionability	70%	95%
Runbook exec time	15 min	3 min
Manual investigation	60% of time	20% of time

But only if you build the infrastructure first.

The Anti-Patterns

❌ "Use AI as a black box" — Feed raw logs to LLM, expect intelligence. Result: Hallucinations.

❌ "Automate without approvals" — Let AI execute without human gate. Result: Production outages.

❌ "Add AI to noisy systems" — 500 alerts/day → 500 bad AI decisions/day.

❌ "No audit trail" — Can't explain what AI did. Compliance failure.

❌ "AI decides incident priority" — AI doesn't have business context. Humans do.

The Real Bottom Line

AI helps operations teams with:

Strong observability (golden signals working)
Clear runbooks (documented, version-controlled)
Predictable patterns (most incidents follow playbooks)
Governance (approvals, audit logs, escalation paths)

If those foundations are missing: do not add AI yet.

Instead:

Fix your alert noise
Document your runbooks
Establish approval workflows
Build audit logging

Then AI becomes a multiplier: faster triage, faster execution, same human judgment.

Without those foundations, AI becomes a liability: faster wrong decisions, faster disasters, same on-call exhaustion.

Next Step

Audit your top 5 incident types. Pick one that's deterministic (clear steps, verifiable outcome). Document the runbook with preconditions and checks. Wire up approval gates. Then automate it.

That's your proof of concept.

Don't start with AI. Start with discipline.