Bootstrap

Yotta Labs

GPU Cloud Platform Engineer

๐Ÿ“ŒCanada, Canada ๐Ÿ‡จ๐Ÿ‡ฆ

โฑ๏ธŽ full-time

๐Ÿง™โ€โ™‚๏ธ mid-level

remote

Location: Remote (Global)

Type: Full-time

Company: Yotta Labs

Apply: careers@yottalabs.ai

๐Ÿง  About Yotta Labs

Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardwareโ€”from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development.

๐Ÿ› ๏ธ Role Overview

We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloadsโ€”ranging from LLMs to generative modelsโ€”deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform.

๐ŸŽฏ Responsibilities

  • Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues.
  • Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks.
  • Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users.
  • Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization.
  • Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms.
  • Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs.

โœ… Qualifications

  • Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.
  • 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.
  • Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.
  • Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.
  • Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.
  • Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.
  • Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.
  • Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.
  • Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.
  • Strong communication skills, self-motivation, and team collaboration

๐ŸŒŸ Preferred Experience

  • Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.
  • Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.
  • Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.
  • Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.
  • Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.

๐ŸŒ Why Join Yotta Labs?

  • Be part of a visionary team aiming to redefine AI infrastructure.
  • Work on cutting-edge technologies that bridge AI and decentralized computing.
  • Collaborate with experts from leading institutions and tech companies.
  • Enjoy a flexible, remote work environment that values innovation and autonomy.

๐Ÿ“ฉ How to Apply

Interested candidates should apply directly or send their resume and a brief cover letter to careers@yottalabs.ai. Please include links to any relevant projects or contributions.

Other similar jobs

GPU SW Engineer

@ Huawei Technologies Research & Development (UK) Ltd, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Alternance - Ingรฉnieure/Ingรฉnieur "Edge Computing"

@ EDF, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

SOC Engineer

@ Hitachi Digital, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded Cyber Security Engineer

@ AeroVironment, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded Controls Engineer

@ Actalent, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded Systems Engineer

@ Allied OneSource, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded RTOS Lead Engineer

@ Carrier, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded RTOS Engineer

@ Carrier, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Embedded Engr I

@ Honeywell, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Technical Lead Engineer, Embedded Systems

@ Lam Research, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

  • Employment

    โฑ๏ธŽ full-time

  • Experience

    ๐Ÿง™โ€โ™‚๏ธ mid-level

  • Working model

    remote

  • Skills
  • Industry
  • Find similar jobs

    GPU SW Engineer

    @ Huawei Technologies Research & Development (UK) Ltd, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Alternance - Ingรฉnieure/Ingรฉnieur "Edge Computing"

    @ EDF, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    SOC Engineer

    @ Hitachi Digital, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded Cyber Security Engineer

    @ AeroVironment, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded Controls Engineer

    @ Actalent, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded Systems Engineer

    @ Allied OneSource, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded RTOS Lead Engineer

    @ Carrier, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded RTOS Engineer

    @ Carrier, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Embedded Engr I

    @ Honeywell, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

    Technical Lead Engineer, Embedded Systems

    @ Lam Research, ๐Ÿ“Canada ๐Ÿ‡จ๐Ÿ‡ฆ

Remote Work
Post time
Level
Employment
Industry
Apply Now โ†—