Infrastructure pillar · Module 6 of 6

Infrastructure management

Having infrastructure is one thing. Running it well is another entirely. This module covers the practices that separate “it works” from “it works reliably.”

← Back to Infrastructure Fundamentals Training

Controls stack visual kit

Reusable icons and a telemetry-to-audit diagram aligned to our fundamentals and operational guides.

Governance evidence

Use for control statements that cite ISO/IEC 42001 clause 6.3 change management, EU AI Act Articles 62–75, and SOC 2 trust service criteria.

Secure supply chain

Pair with SBOM, provenance, and intake guidance that references SPDX or CycloneDX formats, SLSA Level 3 attestations, and NIST SSDF tasks PS.3/PO.4.

Telemetry & evaluations

Highlight logging of prompts, responses, refusal rates, and safety filters alongside adversarial evaluation suites from NIST AI RMF playbooks or UK AISI guidance.

Assurance & resilience

Use for incident response and assurance artefacts that must meet OMB M-24-10 24-hour notifications, CIRCIA’s 72-hour clocks, and serious-incident duties under the EU AI Act.

Signals Controls Evidence Audit
  • Signals: prompt traces, supplier advisories, and safety filter activations streamed into monitoring.
  • Controls: guardrails, change review, SBOM validation, and access enforcement tied to AI lifecycle gates.
  • Evidence: runbooks that capture artefacts for ISO/IEC 42001 management reviews and SOC 2 narratives.
  • Audit: regulator-facing packets that satisfy EU AI Act post-market monitoring, OMB M-24-10, and CIRCIA timelines.

The fundamentals of good operations

  • Change management. Most outages are caused by changes. Not by hardware failing randomly, but by someone changing something without fully understanding the impact. Good change management isn’t bureaucracy—it’s protection.
  • Documentation. If it’s not written down, it doesn’t exist. When something breaks at 2am, you don’t want to be figuring things out from scratch. Runbooks, diagrams, and decision logs save lives (and weekends).
  • Monitoring. You need to know when things are broken—ideally before users tell you. Good monitoring watches what matters, alerts appropriately (not too much noise!), and gives you the data to diagnose problems.
  • Automation. Humans make mistakes. Repeatable tasks should be automated. Not because automation is trendy, but because consistency prevents outages.

Monitoring: what to watch

The basics

  • CPU, memory, disk usage on every server
  • Network throughput and errors
  • Application response times and error rates
  • Database query performance
  • Queue lengths and processing backlogs

What actually matters

  • User-facing metrics (can users do what they need?)
  • Business metrics (are sales processing?)
  • SLO/SLA metrics (are we meeting commitments?)
  • Cost metrics (is spending as expected?)

When things go wrong: incident management

  • Detection. Something’s broken. Maybe monitoring caught it, maybe a user reported it. Clock starts now.
  • Triage. How bad is it? Who’s affected? What’s the immediate priority—restoring service or understanding the cause?
  • Response. Fix it. Or at least stabilise it. Communication is key—keep stakeholders informed.
  • Review. After things calm down: what happened? Why? How do we prevent it? No blame—just learning.

💡 The key insight

Great infrastructure teams are proactive, not reactive. They automate routine work so they have time for improvements. They build systems that alert them to problems before they become outages. They learn from every incident. This doesn’t happen by accident—it takes deliberate investment in tools, processes, and culture.

Free resources to learn more

Nice work! What’s next?

You’ve covered a lot of ground. You now understand the building blocks of IT infrastructure, from physical data centres to cloud services to the practices that keep everything running. Here’s where to go from here.

Want to go hands-on?

  • Sign up for free tiers on AWS, Azure, or GCP and actually build something
  • Set up a home lab (even a Raspberry Pi counts!)
  • Try network simulators like GNS3 or Packet Tracer
  • Follow along with the Google SRE books

Want to go deeper?