IJYALabs logo
IJYALabs
Insights·Operations

AI in Network Operations: Where It Helps (and Where It Hurts)

Practical patterns for AI adoption in NOC/SRE teams. When AI reduces toil. When it amplifies chaos. How to build guardrails that work.

2026-05-20·5 min read·By Arun R Kaushik
AI in Network Operations: Where It Helps (and Where It Hurts)

AI in Network Operations: Where It Helps (and Where It Hurts)

AI is a powerful accelerator for operations — but it can also amplify bad signals, weak runbooks, and unclear ownership.

The goal is not "AI everywhere." The goal is faster incident resolution with lower risk.


Where It Actually Works

Alert Triage

When DNS latency spikes at 3 AM, most teams:

  • Wake engineer
  • Engineer logs into 4 different systems
  • Engineer reads dashboards
  • Engineer searches past incidents
  • 10–20 minutes until action

With AI (guardrailed):

  • Alert fires
  • AI summarizes: "Latency spike 3x baseline. Config changed 15 min ago. Similar issue resolved by [rollback]. Escalation options: [A], [B], [C]."
  • Engineer reads summary in 1 minute
  • Engineer approves action
  • 3 minutes total

Key: AI suggests, human decides.

Runbook Execution

Structured runbooks with clear checkpoints work well:

BGP session down runbook:
  1. Verify peer reachable? (ping -M do -s 1472)
  2. Check BGP status? (show bgp summary)
  3. Clear session? (requires human approval)
  4. Verify recovery? (BGP up in < 2 min)
  5. Verify traffic? (packets flowing)

Each step has a deterministic check. If any fails, stop and escalate. If all pass, done in 2 minutes.

Key: Deterministic + auditable + reversible.

Pre/Post Validation

Change validation is ideal for automation:

  • Reachability before and after
  • BGP session count stable
  • Packet loss near zero
  • Route count unchanged
  • No error logs

All automated, all logged. If any check fails, flag it immediately.


Where It Breaks

Hallucinations in Ambiguous Incidents

Alert: "Interface saturation on border-1."

Bad AI: "Solution: Upgrade to 400G interface." (You don't have 400G interfaces.)

Good AI: "Options from similar incidents: [1], [2], [3]. Unknown: root cause not in historical patterns. Escalate to senior engineer."

Overconfidence with Weak Observability

Your monitoring sees dropped packets. AI concludes: "Congestion."

Reality: Fiber degradation or duplex mismatch.

Result: 30 minutes wasted troubleshooting the wrong thing.

Solution: Fix observability first, then add AI.

Undocumented Drift

Runbook says: "Update VLAN 100 gateway to 10.10.1.254."

Reality: VLAN 100 moved 6 months ago. Gateway is now 10.10.2.254.

AI executes old address. Silent failure. Outage.

Solution: Codify runbooks in version control. Update them or mark as obsolete.

Shadow Automation Without Governance

One team uses Claude for network ACLs. Another uses ChatGPT for BGP debugging. No audit trail. No approval. No rollback capability.

Outcome: Someone got fired.

Solution: Central governance. All automation logged. All state changes approved.


The Adoption Path

Phase 1: Fix Signals (Weeks 1–2)

Before AI touches anything:

  • Reduce alert noise — tune thresholds, suppress false positives (goal: 90%+ actionable)
  • Standardize dashboards — one dashboard per incident type (DNS, BGP, capacity, etc.)
  • Define golden signals — what actually matters (latency, loss, success rate, reachability)

Don't add AI to a noisy system. You'll just automate bad decisions faster.

Phase 2: Codify Runbooks (Weeks 3–4)

Document your runbooks with:

  • Clear precondition checks (must be true before proceeding)
  • Deterministic steps (yes/no outcomes only)
  • Verification checks (proves step worked)
  • Abort conditions (when to stop and escalate)
  • Timeouts (how long to wait)

Example:

Step: Clear BGP session
  Precondition: BGP neighbor reachable
  Action: clear bgp {neighbor_ip}
  Verification: BGP state = Established within 60s
  Abort if: Verification fails after 2 retries
  Rollback: None needed (session comes back up)

Phase 3: Automate Deterministic Tasks (Weeks 5–6)

Only automate:

  • ✅ Clear sessions (reversible)
  • ✅ Restart services (reversible)
  • ✅ Rotate logs (low impact)
  • ✅ Drain connections (reversible)

Never automate:

  • ❌ Routing changes (hard to rollback)
  • ❌ Firewall rule changes (security impact)
  • ❌ Delete operations (dangerous)

Add approval gates for all state-changing steps.

Phase 4: Add AI with Guardrails (Weeks 7+)

Three levels:

Summarization (safe):

  • AI reads metrics, logs, past incidents
  • AI generates: "This looks like X. Similar incident was [link]. Options: [A], [B], [C]."
  • Human reads summary. Makes decision.

Suggestion (medium):

  • AI suggests actions ranked by risk
  • AI explains precedent: "Similar incident resolved by [action]. Success rate: 95%."
  • Human approves. AI executes with approval.

Execution (controlled):

  • AI executes runbook step-by-step
  • Every action logged
  • Every verification checked
  • Human approval required for state changes
  • Automatic rollback if verification fails

What Gets Better

Teams that do this well see:

Metric Before After
MTTR 25 min 8 min
Alert actionability 70% 95%
Runbook exec time 15 min 3 min
Manual investigation 60% of time 20% of time

But only if you build the infrastructure first.


The Anti-Patterns

"Use AI as a black box" — Feed raw logs to LLM, expect intelligence. Result: Hallucinations.

"Automate without approvals" — Let AI execute without human gate. Result: Production outages.

"Add AI to noisy systems" — 500 alerts/day → 500 bad AI decisions/day.

"No audit trail" — Can't explain what AI did. Compliance failure.

"AI decides incident priority" — AI doesn't have business context. Humans do.


The Real Bottom Line

AI helps operations teams with:

  • Strong observability (golden signals working)
  • Clear runbooks (documented, version-controlled)
  • Predictable patterns (most incidents follow playbooks)
  • Governance (approvals, audit logs, escalation paths)

If those foundations are missing: do not add AI yet.

Instead:

  1. Fix your alert noise
  2. Document your runbooks
  3. Establish approval workflows
  4. Build audit logging

Then AI becomes a multiplier: faster triage, faster execution, same human judgment.

Without those foundations, AI becomes a liability: faster wrong decisions, faster disasters, same on-call exhaustion.


Next Step

Audit your top 5 incident types. Pick one that's deterministic (clear steps, verifiable outcome). Document the runbook with preconditions and checks. Wire up approval gates. Then automate it.

That's your proof of concept.

Don't start with AI. Start with discipline.