Our client Agay Barho is looking for a GPU Engineer in REMOTE.
Agay Barho is seeking a highly skilled GPU Engineer to lead the performance and scalability efforts of its real-time image processing pipeline. This role involves owning and optimizing the throughput of hardware platforms such as NVIDIA Jetson Orin and x86 discrete GPUs, ensuring that high-frame-rate data streams operate without any frame drops or latency delays. The ideal candidate brings over five years of deep expertise in performance-critical Modern C++ programming and CUDA architecture, coupled with a strong foundation in algorithm design and embedded system optimization. As the GPU Engineer, you will manage a dedicated team of five professionals, driving technical excellence in kernel development, bottleneck elimination, and hardware-aware tuning to push GPU hardware to its maximum potential.
In this position, you will design, implement, and tune custom CUDA or Triton kernels to meet demanding performance goals. You will perform comprehensive bottleneck analysis using advanced NVIDIA profiling tools, develop efficient data movement strategies to minimize host-to-device lag, and restructure algorithms to leverage specific GPU architectures effectively. Your responsibilities will extend to optimizing mathematical throughput for inference applications by leveraging high-performance libraries like TensorRT, cuBLAS, cuFFT, and Thrust without compromising precision. The role also benefits from experience with computer vision geometry, motion capture technologies, and device driver knowledge, which will enhance the overall robustness and synchronization of sensor data streams across multiple GPUs.
Responsibilities
-
Design, develop, and optimize low-level CUDA or Triton kernels for real-time image processing tasks requiring high throughput and low latency.
-
Conduct in-depth bottleneck analysis using NVIDIA Nsight Systems, Nsight Compute, nvprof, and other profiling tools to identify and eliminate performance stalls related to memory bandwidth, kernel launches, and warp occupancy.
-
Architect and implement efficient memory orchestration strategies including zero-copy, unified memory, and pinned memory to minimize host-to-device latency and ensure seamless data flow.
-
Adapt and restructure algorithms to align with hardware architectures such as NVIDIA Ampere and Hopper, optimizing register pressure, cache utilization, and parallel computation efficiency.
-
Utilize and integrate high-performance mathematical and inference libraries such as TensorRT, cuBLAS, cuFFT, and Thrust to maximize computational throughput while maintaining accuracy.
-
Manage a team of five GPU engineers, providing technical guidance, task prioritization, and mentorship to drive continuous performance improvements and innovation.
-
Apply strong linear algebra and parallel algorithm knowledge to develop efficient reduction, prefix sum, and other parallel processing methods tailored for GPU execution.
-
Optimize workloads specifically for embedded environments like Linux for Tegra (L4T) and other resource-constrained platforms, ensuring robust performance under strict hardware limits.
-
Leverage knowledge of computer vision geometry including camera intrinsics, lens distortion, and 3D triangulation to assist in advanced image processing and sensor fusion tasks.
-
Incorporate motion capture expertise such as sub-pixel centroid extraction and high-speed marker tracking to enhance the precision and reliability of tracking systems.
-
Work with camera drivers including GMSL2, MIPI, and V4L2 capture pipelines to support video input at the hardware level, ensuring compatibility and performance stability.
-
Develop synchronization logic to coordinate hardware-level frame capture across multiple GPUs and sensors, maintaining consistent timing and data integrity in real-time workflows.