Infrastructure pillar · Module 5 of 6

Capacity and resilience

Here’s an uncomfortable truth: things WILL break. Hardware fails. Software crashes. People make mistakes. The question isn’t whether you’ll have problems—it’s whether you’ll be ready when they happen.

← Back to Infrastructure Fundamentals Training

Controls stack visual kit

Reusable icons and a telemetry-to-audit diagram aligned to our fundamentals and operational guides.

Governance evidence

Use for control statements that cite ISO/IEC 42001 clause 6.3 change management, EU AI Act Articles 62–75, and SOC 2 trust service criteria.

Secure supply chain

Pair with SBOM, provenance, and intake guidance that references SPDX or CycloneDX formats, SLSA Level 3 attestations, and NIST SSDF tasks PS.3/PO.4.

Telemetry & evaluations

Highlight logging of prompts, responses, refusal rates, and safety filters alongside adversarial evaluation suites from NIST AI RMF playbooks or UK AISI guidance.

Assurance & resilience

Use for incident response and assurance artefacts that must meet OMB M-24-10 24-hour notifications, CIRCIA’s 72-hour clocks, and serious-incident duties under the EU AI Act.

Signals Controls Evidence Audit

Signals: prompt traces, supplier advisories, and safety filter activations streamed into monitoring.
Controls: guardrails, change review, SBOM validation, and access enforcement tied to AI lifecycle gates.
Evidence: runbooks that capture artefacts for ISO/IEC 42001 management reviews and SOC 2 narratives.
Audit: regulator-facing packets that satisfy EU AI Act post-market monitoring, OMB M-24-10, and CIRCIA timelines.

Capacity planning: don’t run out of runway

Know your current usage. You can’t plan if you don’t measure. How much CPU, memory, storage, and bandwidth are you using? What’s the trend?
Understand your patterns. Does traffic spike on Black Friday? End of quarter? When marketing sends emails? Plan for peaks, not averages.
Leave headroom. Running at 90% capacity means you have no buffer for surprises. Most ops teams get nervous above 70-80%.
Know your lead times. New servers might take weeks to arrive. Cloud can scale in minutes. Plan around the constraints you actually have.

Resilience: planning for failure

The goal isn’t to prevent all failures—that’s impossible. The goal is to fail gracefully and recover quickly.

Redundancy patterns

N+1: One extra for backup. Three servers doing the work, one standing by.
2N: Complete duplicate of everything. Expensive, but if something fails, you’re fine.
Active-active: All systems doing work all the time. If one fails, others absorb the load.
Active-passive: Backup sits idle until needed. Simpler, cheaper, but failover takes time.

The acronyms you need

RTO (Recovery Time Objective): How fast must we recover? Hours? Minutes? Seconds?
RPO (Recovery Point Objective): How much data can we afford to lose? Last day? Last hour? Nothing?
MTTR (Mean Time to Recovery): How long does recovery actually take?
MTBF (Mean Time Between Failures): How often do things break?

Backups: your safety net

The 3-2-1 rule. Three copies of your data, on two different types of media, with one copy off-site. It’s simple but it works.
Test your restores. A backup you’ve never tested isn’t a backup—it’s a hope. Actually restore something regularly. You’ll find problems before they matter.
Consider ransomware. Backups connected to your network can be encrypted by ransomware too. Air-gapped or immutable backups are your real safety net.

💡 The key insight

Resilience is a practice, not a product. You can’t buy it and forget it. It requires testing, updating, and practicing. The organisations that recover well from incidents are the ones that prepare and drill regularly—not the ones with the fanciest equipment.

Free resources to learn more

Classic reading: Google SRE Book — Free online, covers reliability engineering from Google’s perspective
Chaos engineering: Principles of Chaos Engineering — The practice of intentionally breaking things to find weaknesses
War stories: Public Post-Mortems — Learn from other people’s outages (invaluable!)
Hands-on: Chaos Monkey — Netflix’s tool for randomly killing production instances (start in a test environment!)

← Previous: Networking basics Next: Infrastructure management →