Infrastructure operations · shared across sites
Reports to: Director, Hardware Engineering
Location: Pleasanton, CA (hybrid) or assigned site; travel up to 25%
Department: Infrastructure & DC Operations / Systems Engineering
The Hardware Engineer owns hardware lifecycle for GPU and supporting infrastructure assets, including fleet health monitoring, RMA workflows, firmware management, and long-range capacity planning. The role is the technical owner of the physical compute platform.
Monitor GPU and server health including thermal, error rates, and component failures
Drive the RMA process with vendors (NVIDIA, Supermicro, HPE, and others) end-to-end
Manage firmware, BIOS, and BMC upgrade campaigns across the fleet
Develop hardware burn-in and acceptance test procedures, including NCCL and stress tests
Investigate hardware failures and produce vendor-grade root cause analyses
Maintain hardware inventory, asset records, and CMDB accuracy
Drive capacity planning across compute, storage, and networking
Coordinate with Procurement on spare parts strategy and stocking levels
Author hardware engineering runbooks and operational procedures
Support new platform bring-up, qualification, and reference architecture validation
5+ years in hardware engineering, systems engineering, or data center engineering
Deep knowledge of x86 server architecture, GPU systems, and modern storage
Hands-on experience with NVIDIA HGX, DGX, or hyperscale-class systems
Strong Linux fundamentals and scripting skills (Python, Bash)
Bachelor's degree in computer science, electrical engineering, or related field
Experience with NVIDIA Mission Control, Base Command Manager, or Bright Cluster Manager
Familiarity with IPMI, Redfish, and vendor management interfaces
Knowledge of liquid cooling and high-density power architectures
Experience operating fleets of 1,000+ GPUs