Nvidia - Santa Clara, CA
posted 4 months ago
We are the GPU Communications Libraries and Networking team at NVIDIA. Our mission is to deliver cutting-edge communication libraries such as NCCL, NVSHMEM, and UCX, which are essential for Deep Learning (DL) and High-Performance Computing (HPC). As applications in these fields demand immense computational power, they often operate at scales involving tens of thousands of GPUs. These GPUs are interconnected using high-speed technologies like NVLink and PCIe within a single node, and utilize high-speed networking solutions such as Infiniband and Ethernet for communication across multiple nodes. The performance of communication between GPUs is critical, as it directly influences the overall application performance, especially at large scales. In this role, we are seeking a technical leader to manage our NVSHMEM and UCX libraries. This is an exceptional opportunity to push the boundaries of technology and contribute to the development of platforms that have never been seen before. As a Software Engineering Manager, you will lead, mentor, and grow your library engineering team, overseeing the planning and execution of projects while ensuring the quality and performance of your libraries. This position requires active participation in feature design and implementation, as well as collaboration with internal and external partners to understand their use cases and requirements. You will work closely with engineering teams, program and product management, and partners to define the product roadmap. Additionally, you will continuously review and identify opportunities for improvement in established processes, infrastructure, and practices to ensure that your teams are executing in the most efficient and transparent manner.