← ClaudeAtlas

nvidia-datacenter-bringuplisted

Bring up NVIDIA HGX/DGX datacenter GPU hosts on Ubuntu 24.04 LTS — air-gapped or connected, Secure Boot enabled. Covers B300/B200/H100/A100/L40S/L4 driver+fabricmanager+NVLSM+DOCA-OFED install order and exact package set from NVIDIA CUDA repo + DOCA repo. Triggers on B300/B200/HGX/DGX install, "fabricmanager won't start", "system not yet initialized" / cudaErrorSystemNotReady, NVLSM missing, ib_umad not loading, DOCA-OFED before NVIDIA driver, nvidia-driver-pinning-XXX, nvlink5-XXX, nvidia-open vs cuda-drivers, "Blackwell requires open kernel modules", ConnectX-7/8 bridge device, FM exact-version-match, gpu-operator cuda-validator CrashLoopBackOff, B300 PCI ID 0x3182, air-gap CUDA + DOCA mirror, three-tier DOCA GPG key, MOK enrollment, DKMS sign, Dell PowerEdge XE9780/XE9785 baseboard firmware v1.4.30, iDRAC Redfish virtual AC cycle DellOemChassis.ExtendedReset, generic "install nvidia driver ubuntu 24.04 datacenter".
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# nvidia-datacenter-bringup Opinionated greenfield recipe for **NVIDIA datacenter GPUs on Ubuntu 24.04 LTS** — get from a clean OS install to a healthy host where `nvidia-smi` reports all GPUs, `nvidia-fabricmanager` is `active (running)`, and the gpu-operator `cuda-validator` pod passes. Air-gap is the primary case; connected sites use the same packages from the same upstream URLs. ## Decision tree | Question | Answer | Read | |---|---|---| | Has Blackwell silicon (B300/B200/B100)? | Yes | Open kernel modules **mandatory** — proprietary is unsupported [[open-modules-transition]] | | Grace Hopper (GH200)? | Yes | Open kernel modules **mandatory** (same as Blackwell). Otherwise 8-GPU SXM path [[hopper-recipe]] | | HGX 8-GPU SXM with 3rd-gen NVSwitch (H100/H200/H800 in XE9680 or similar)? | Yes | Open recommended (not mandatory). Use `cuda-drivers-fabricmanager-<branch>` meta; **skip** `nvlink5-<branch>`, NVLSM, DOCA-OFED entirely. Min driver 525+ for H100, 535+ for H200 [[hopper-recipe]] | | HGX 4-GPU SXM (H100 in XE8640, A100 4-GPU)? | Yes | **No NVSwitch on this baseboard** — direct NVLink mesh between 4 GPUs. **Skip fabricmanager entirely** + DOCA + NVLSM. Three-package install: driver + container-toolkit [[hopper-recipe]] | | HGX A100 8-GPU (2nd-gen NVSwitch)? | Yes | Same as 8-GPU SXM H100 path. Min driver 450.xx. ALI not available — FM trains NVLinks at boot | | L40S, L40, L4, H200 NVL, or other PCIe-only? | Yes | Driver + container-toolkit only. **Skip** DOCA-OFED, f