Infrastructure pillar

Infrastructure Fundamentals

A comprehensive reference for infrastructure leads who need a single, defensible playbook for the resilience lifecycle, capacity planning, supplier risk management, and measurable operational metrics.

Content aligns with ISO 22301, NIST SP 800-34, Uptime Institute and EN 50600 design principles, CHIPS Act guardrails, export controls, and EU energy and sustainability reporting so teams can defend decisions to regulators, insurers, and boards.

Resilience lifecycle

Embed resilience across design, operations, and recovery so critical services survive outages, vendor disruptions, or regulatory audits. Anchor playbooks to ISO 22301, NIST SP 800-34, and sector reliability rules such as NERC EOP and CIP.

Lifecycle controls

  • Design for failure with dual power and cooling paths mapped to Uptime Tier or EN 50600 targets, plus alternate network carriers and fuel contracts validated through method-of-procedure testing.
  • Dependency mapping that traces workloads to facilities, network routes, spares, and people, with change-control gates that require validated rollback plans before maintenance windows.
  • Exercise and adapt: quarterly tabletop tests for grid loss, transit carrier failure, supplier lockout, and cyber incidents, with corrective actions tracked in the CMDB and risk register.
  • Recovery discipline that binds RTO/RPO to business impact analysis outputs, synchronises backup verification with immutability checks, and documents handoffs to incident command systems.

Evidence and metrics

  • RTO/RPO coverage per service with the percentage proven in the last two test cycles.
  • Concurrent maintainability for power and cooling chains, measured as the share of capacity that can be serviced without downtime.
  • Drill close-out: completion rate and time-to-remediate for issues logged during exercises and post-incident reviews.
  • Runbook freshness tracked by median days since last validation for failover, black-start, and telecom reroute playbooks.

Lifecycle cadence

Tie critical changes to a resilience calendar: quarterly integrated exercises, semiannual supplier failover tests, and annual program reviews that refresh the business impact analysis and dependency maps. Use these cadences to prove compliance with ISO 22301 clause 9 and insurer-required risk engineering checkpoints.

Five disaster recovery tiers progressing from cold backups to active-active, with RTO and RPO targets plus operational controls for each tier.
Map business impact analysis outputs to the right disaster recovery tier, pairing RTO/RPO targets with the controls, telemetry, and test cadence required to defend resilience claims to auditors and regulators.

Capacity planning and demand shaping

Plan power, thermal, and compute headroom alongside network and logistics lead times so construction, retrofit, and procurement decisions stay ahead of demand and regulatory milestones.

Controls

  • Load envelopes that model power, cooling, and space by phase, referencing ASHRAE TC 9.9 thermal ranges and utility interconnect milestones before locking rack densities.
  • Network and compute planning that pairs GPU/CPU procurement roadmaps with transit diversity, optical budgets, and export-control determinations for accelerators.
  • Change governance that blocks high-risk work during peak seasons, requires updated method-of-procedure documents, and pre-positions spares based on mean time between failure curves.
  • Energy and water budgeting tied to local disclosure rules and heat-reuse opportunities, with contingency fuel and water contracts that match generator and cooling runtime assumptions.

Metrics

  • Headroom: available kW, tons of cooling, and cross-connect capacity versus committed loads by site and phase.
  • Lead-time adherence: percentage of critical-path items (switchgear, transformers, chillers, GPUs, optics) ordered on or before required dates.
  • Forecast accuracy: rolling three-month variance between planned and actual rack power draw, network utilisation, and shipment arrivals.
  • Change stability: number of maintenance-induced incidents per 100 changes and percentage of changes with validated rollback steps.

What good looks like

Scenario models show at least 20–30% flexible headroom for critical circuits, align procurement gates to DOE and utility interconnection timelines, and link IT load growth to carbon- and water-intensity targets under local disclosure schemes. Teams share a single, versioned capacity plan that facilities, network, finance, and compliance use during steering reviews.

Supplier risk and continuity

Manage silicon, colocation, network, fuel, and critical-spares providers with the same rigor used for internal controls. Align due diligence with NIST SP 800-161 Rev 1, FedRAMP supply-chain expectations, and EU DORA outsourcing principles.

Controls

  • Multi-source strategies across fabs, OSATs, carriers, and fuel suppliers, with contractual right-of-audit and exit plans that include design escrow and data repatriation steps.
  • Critical spares governance that inventories transformers, switchgear, optics, and facility controllers, sets reorder triggers, and validates alternates through periodic substitution tests.
  • Contractual notification and evidence requiring 24-hour incident notice, quarterly service-level reporting, and attestations for export-control, sanctions, and data-handling obligations.
  • Geo-political and regulatory screening for suppliers against EAR/ITAR restrictions, CHIPS Act guardrails, OFAC sanctions, and localisation requirements affecting design data or telemetry.

Metrics

  • Single-point exposure: percentage of critical inputs sourced from one vendor or country; targets typically keep any one geography below 40–50%.
  • Critical spare autonomy: days of coverage for transformers, PDUs, and optics at current failure rates plus lead times.
  • Contract compliance: rate of on-time SLA reports, incident notifications, and audit responses from strategic suppliers.
  • Exit readiness: time to migrate workloads or reroute traffic using existing cross-connects, secondary fabs, or alternate logistics hubs.

Regulatory hooks

Supplier controls should demonstrate alignment with NIST SP 800-161 Rev 1 supply-chain risk management, NERC CIP-013 procurement expectations for critical infrastructure, EU DORA operational resilience outsourcing rules, and CHIPS Act guardrail covenants for funded semiconductor projects.

Metrics, reporting, and regulatory hooks

Pair operational metrics with the evidence regulators, auditors, and insurers request so findings can be closed quickly and leadership dashboards stay defensible.

Operational metrics to publish

  • Availability SLOs per site and region with dependency-adjusted error budgets.
  • Energy and water performance: PUE, WUE, and heat-reuse ratios with monthly variance explanations.
  • Asset and firmware currency for OT and DCIM components, including time-to-remediate security advisories.
  • Incident readiness: mean time to detection, mean time to recovery, and drill coverage across power, network, and supplier-failure scenarios.

Evidence to retain

  • Change records with pre/post readings for critical maintenance, black-starts, and bypass events.
  • Tabletop and failover reports summarising objectives, outcomes, corrective actions, and ownership.
  • Supplier attestations covering export-control classifications, incident notifications, penetration tests, and financial viability checks.
  • Regulatory filings such as EU Energy Efficiency Directive facility reports, CSRD/ESRS E1 and E5 disclosures, or utility benchmarking submissions.
Dashboard architecture showing facility telemetry, IT workload data, a data platform layer, and reporting panels for PUE, WUE, carbon intensity, demand response, and compliance evidence.
Energy and sustainability dashboards pull calibrated facility meters, workload telemetry, emissions factors, and maintenance records into a single evidence set for EU Energy Efficiency Directive Article 11, ESRS E1/E5, ISO 50001, and utility program filings.

Regulatory mapping

Use these fundamentals to show how controls satisfy ISO 22301 continuity clauses, ISO 27001 Annex A physical security requirements, EN 50600 design and performance reporting, EU Energy Efficiency Directive and ESRS reporting, FedRAMP logging and resilience expectations for federal workloads, and EAR/ITAR or CHIPS Act guardrail covenants for advanced silicon.