Hardware Engineer

STN 

📍 San Francisco, United States 🇺🇸

full-time
senior
hybrid
Posted —

Key Skills

GPUNVIDIAPythonLinuxBIOS

Industry

Consumer ElectronicsData Centers

Job Description

Hardware Engineer

Infrastructure operations · shared across sites

Reports to: Director, Hardware Engineering

Location: Pleasanton, CA (hybrid) or assigned site; travel up to 25%

Department: Infrastructure & DC Operations / Systems Engineering

Position summary

The Hardware Engineer owns hardware lifecycle for GPU and supporting infrastructure assets, including fleet health monitoring, RMA workflows, firmware management, and long-range capacity planning. The role is the technical owner of the physical compute platform.

Key responsibilities
  • Monitor GPU and server health including thermal, error rates, and component failures

  • Drive the RMA process with vendors (NVIDIA, Supermicro, HPE, and others) end-to-end

  • Manage firmware, BIOS, and BMC upgrade campaigns across the fleet

  • Develop hardware burn-in and acceptance test procedures, including NCCL and stress tests

  • Investigate hardware failures and produce vendor-grade root cause analyses

  • Maintain hardware inventory, asset records, and CMDB accuracy

  • Drive capacity planning across compute, storage, and networking

  • Coordinate with Procurement on spare parts strategy and stocking levels

  • Author hardware engineering runbooks and operational procedures

  • Support new platform bring-up, qualification, and reference architecture validation

Required qualifications
  • 5+ years in hardware engineering, systems engineering, or data center engineering

  • Deep knowledge of x86 server architecture, GPU systems, and modern storage

  • Hands-on experience with NVIDIA HGX, DGX, or hyperscale-class systems

  • Strong Linux fundamentals and scripting skills (Python, Bash)

  • Bachelor's degree in computer science, electrical engineering, or related field

Preferred qualifications
  • Experience with NVIDIA Mission Control, Base Command Manager, or Bright Cluster Manager

  • Familiarity with IPMI, Redfish, and vendor management interfaces

  • Knowledge of liquid cooling and high-density power architectures

  • Experience operating fleets of 1,000+ GPUs