Nvidia - Santa Clara, CA

posted 3 months ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

We are the GPU Communications Libraries and Networking team at NVIDIA. Our mission is to deliver cutting-edge communication libraries such as NCCL, NVSHMEM, and UCX, which are essential for Deep Learning (DL) and High-Performance Computing (HPC). As applications in these fields demand immense computational power, they often operate at scales involving tens of thousands of GPUs. These GPUs are interconnected using high-speed technologies like NVLink and PCIe within a single node, and utilize high-speed networking solutions such as Infiniband and Ethernet for communication across multiple nodes. The performance of communication between GPUs is critical, as it directly influences the overall application performance, especially at large scales. In this role, we are seeking a technical leader to manage our NVSHMEM and UCX libraries. This is an exceptional opportunity to push the boundaries of technology and contribute to the development of platforms that have never been seen before. As a Software Engineering Manager, you will lead, mentor, and grow your library engineering team, overseeing the planning and execution of projects while ensuring the quality and performance of your libraries. This position requires active participation in feature design and implementation, as well as collaboration with internal and external partners to understand their use cases and requirements. You will work closely with engineering teams, program and product management, and partners to define the product roadmap. Additionally, you will continuously review and identify opportunities for improvement in established processes, infrastructure, and practices to ensure that your teams are executing in the most efficient and transparent manner.

Responsibilities

  • Lead, mentor, and grow the library engineering team.
  • Plan and execute projects while ensuring quality and performance of libraries.
  • Participate in feature design and implementation.
  • Interact with internal and external partners to understand use cases and requirements.
  • Collaborate with engineering teams, program and product management, and partners to define the product roadmap.
  • Review and identify improvement opportunities in established processes, infrastructure, and practices.

Requirements

  • 10+ years of experience in the software industry with specialization in HPC networking or system software.
  • 4+ years of management experience.
  • BS, MS, or Ph.D. in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field, or equivalent experience.
  • Prior experience in systems software or communication runtime or high-performance networking software development with a successful track record of managing complex software features or products through the full product life cycle.
  • Strong understanding of computer system architecture, operating systems principles, hardware-software interactions, and performance analysis/optimizations.
  • Excellent C/C++ programming and debugging skills in Linux.
  • Experience balancing multiple projects with competing priorities.
  • Flexibility to work and communicate effectively across different teams and time zones.

Nice-to-haves

  • Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC).
  • Experience with programming using CUDA, MPI, OpenMP, OpenACC, pthreads.
  • Background with RDMA and high-performance networking technologies (InfiniBand, RoCE, Ethernet, EFA), network architecture, and network topologies.
  • Knowledge of HPC and ML/DL fundamentals.
  • Experience with Deep Learning Frameworks such as PyTorch and TensorFlow.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Diversity and inclusion programs
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service