GPU Engineer - TJ / 1802676

Agay Barho! 

📍 Pakistan, Pakistan 🇵🇰

remote
mid-level
remote
Expired
Posted —
This job posting has expired View All Embedded Systems Engineer Jobs

Key Skills

CUDAC++NVIDIATegracomputervision

Industry

Consumer ElectronicsAerospace

Job Description

Our client Agay Barho is looking for a GPU Engineer in REMOTE.

Agay Barho is seeking a skilled GPU Engineer to take ownership of the performance and scalability of its real-time image processing pipeline. This role focuses on maximizing the efficiency of NVIDIA Jetson Orin and x86 discrete GPUs, pushing hardware capabilities to their theoretical limits. The successful candidate will work to eliminate bottlenecks and reduce latency to microseconds, ensuring a smooth data flow of high-frame-rate image streams exceeding 200 FPS with zero drops. The position requires deep expertise in Modern C++ (versions 17/20) and CUDA architecture, particularly in designing and tuning low-level kernels and managing memory orchestration for real-time applications.

This role is well suited for a candidate with at least two years of professional experience in a performance-critical environment and a strong foundation in parallel algorithm design and advanced mathematical concepts. The GPU Engineer will not manage a team but will play a key individual contributor role developing and optimizing performance-critical kernels and algorithms. Close familiarity with NVIDIA developer tools for performance profiling and tuning is essential, as is experience with embedded systems like Linux for Tegra. Additional expertise in computer vision geometry, mocap technologies, driver-level camera interfaces, and hardware synchronization is considered a significant advantage.

Responsibilities

  • Design, implement, and optimize custom CUDA or Triton kernels to enhance the performance of real-time image processing tasks.
  • Conduct bottleneck analysis using NVIDIA Nsight Systems and Nsight Compute to identify and resolve performance impediments such as memory bandwidth stalls, kernel launch overhead, and warp occupancy issues.
  • Architect and manage efficient data movement strategies including Zero-copy, Unified Memory, and Pinned Memory to minimize Host-to-Device latency in high-speed data streams.
  • Adapt and restructure algorithms specifically for NVIDIA architectures such as Ampere and Hopper, focusing on managing register pressure and maximizing cache hit rates for L1 and L2 caches.
  • Leverage TensorRT and high-performance CUDA libraries such as cuBLAS, cuFFT, and Thrust to optimize mathematical computations without compromising accuracy or precision.
  • Develop and maintain a deep understanding of Streaming Multiprocessors, Warps, Thread Blocks, and shared memory access patterns to improve kernel efficiency and execution throughput.
  • Perform advanced profiling and roofline analysis using tools such as nvprof, Nsight, and Visual Profiler to guide performance tuning efforts.
  • Implement parallel algorithm designs that harness techniques like reduction and prefix sum, ensuring high-performance execution aligned with mathematical principles.
  • Optimize workloads specifically for embedded and resource-constrained environments, including Linux for Tegra (L4T) platforms.
  • Apply knowledge of computer vision geometry, including camera intrinsics and extrinsics, lens distortion correction, and 3D triangulation to support image data processing requirements.
  • Incorporate motion capture data processing techniques, such as sub-pixel centroid extraction and high-speed marker tracking, into GPU accelerated pipelines where applicable.
  • Collaborate with hardware engineers and software developers to integrate driver-level camera support for protocols like GMSL2 and MIPI and to manage video capture pipelines with V4L2 technology.
  • Develop and implement synchronization logic at the hardware level to coordinate frame capture across multiple GPU-connected sensors, ensuring coherent multi-sensor data flow.