Go Back

Hardware Engineer

STN ↗

📍 San Francisco, United States 🇺🇸

full-time

senior

hybrid

Posted —

Apply Now ↗

Key Skills

GPUNVIDIAPythonLinuxBIOS

Industry

Consumer ElectronicsData Centers

Job Description

Hardware Engineer

Infrastructure operations · shared across sites

Reports to: Director, Hardware Engineering

Location: Pleasanton, CA (hybrid) or assigned site; travel up to 25%

Department: Infrastructure & DC Operations / Systems Engineering

Position summary

The Hardware Engineer owns hardware lifecycle for GPU and supporting infrastructure assets, including fleet health monitoring, RMA workflows, firmware management, and long-range capacity planning. The role is the technical owner of the physical compute platform.

Key responsibilities

Monitor GPU and server health including thermal, error rates, and component failures
Drive the RMA process with vendors (NVIDIA, Supermicro, HPE, and others) end-to-end
Manage firmware, BIOS, and BMC upgrade campaigns across the fleet
Develop hardware burn-in and acceptance test procedures, including NCCL and stress tests
Investigate hardware failures and produce vendor-grade root cause analyses
Maintain hardware inventory, asset records, and CMDB accuracy
Drive capacity planning across compute, storage, and networking
Coordinate with Procurement on spare parts strategy and stocking levels
Author hardware engineering runbooks and operational procedures
Support new platform bring-up, qualification, and reference architecture validation

Required qualifications

5+ years in hardware engineering, systems engineering, or data center engineering
Deep knowledge of x86 server architecture, GPU systems, and modern storage
Hands-on experience with NVIDIA HGX, DGX, or hyperscale-class systems
Strong Linux fundamentals and scripting skills (Python, Bash)
Bachelor's degree in computer science, electrical engineering, or related field

Preferred qualifications

Experience with NVIDIA Mission Control, Base Command Manager, or Bright Cluster Manager
Familiarity with IPMI, Redfish, and vendor management interfaces
Knowledge of liquid cooling and high-density power architectures
Experience operating fleets of 1,000+ GPUs