Problem Solving and Deep-Level Troubleshooting: Investigating and troubleshooting problems and hardware faults that our automation can’t determine within our GPU platforms. This will involve taking data from system logs, kernel logs, BMC redfish APIs, and if the data is not there, working with hardware and kernel engineers to add information you need to make accurate determinations.
Coordination and Collaboration: Working closely with our Data Centre Operations, Hardware Engineering and Capacity Planning teams to repair and remediate failed hardware, ensure consistent delivery of new hardware to customers, and roll out new upgrades across the fleet
Automation and Tool Development: Automate routine processes and build hardware diagnostics, provisioning and repair tooling
Build Processes and Documentation: When you figure out the best way to do something, you’ll be working on building processes, documentation and tooling to help the next person who finds this problem
Validate and Test new hardware: Crusoe is often the first company in the world to get the latest generation AI hardware, before it’s fully tested. Conducting rigorous testing and validation on such cutting-edge hardware and servers that comes back from repair
On-Call: Participate in our on-call rotation, partnering with our US teams to provide follow-the-sun coverage