Senior Network Software Engineer

$220,000 - $419,750/Yr

Nvidia - Redmond, WA

posted 12 days ago

Full-time - Senior
Redmond, WA
Computer and Electronic Product Manufacturing

About the position

The Senior Network Software Engineer at NVIDIA will play a crucial role in co-designing and implementing innovative solutions that enhance AI applications at scale. This position is part of the AI Efficiency Team, which focuses on optimizing the efficiency and resiliency of machine learning workloads and developing scalable AI infrastructure tools and services. The engineer will collaborate with multi-functional teams to drive the development of networking software and hardware, ensuring a stable and scalable environment for NVIDIA's AI researchers.

Responsibilities

  • Collaborate with multi-functional teams to analyze, co-design, and develop networking software and hardware for innovative AI platforms.
  • Drive the development of new networking algorithms and protocols for point-to-point and collective operations at scale.
  • Identify bottlenecks and inefficiencies in application code, proposing optimizations to enhance performance and network utilization.
  • Design and implement performance benchmarks and testing methodologies to evaluate performance at scale.
  • Provide guidance and recommendations for optimizing AI applications for speed, scalability, and resource efficiency.
  • Share knowledge with domain expert teams as they develop applications for the next generation of AI platforms.
  • Contribute to the development of tools and frameworks to facilitate network optimization.

Requirements

  • PhD in Computer Science, Computer Engineering, or related field, or equivalent experience.
  • 10+ years of experience with a focus on high-performance networking and AI applications.
  • Expertise in RDMA networking (InfiniBand, ROCE), Ethernet, and PCIe.
  • Experience with at least one high-performance networking library: NCCL, UCX, libfabric, MPI, UCC.
  • Deep understanding of various aspects of high-performance networking, including network technologies, debugging, and performance analysis.
  • Experience in developing and optimizing deep learning frameworks such as PyTorch and TensorFlow.
  • Proficiency in Python and C/C+.
  • Experience in CUDA programming.
  • Track record of delivering performance improvements for software used in large-scale deployments.
  • Knowledge of Kubernetes (k8s) and cloud-native application principles is a plus.
  • Familiarity with continuous integration and delivery practices for performance optimization.

Nice-to-haves

  • Hands-on experience in optimizing networking building blocks for DL frameworks like PyTorch and TensorFlow.
  • Experience in developing communication libraries such as NCCL, UCX, UCC, MPI.
  • In-depth knowledge of RDMA, GPU-Direct, and network technologies.
  • Provide references to your code contributions.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Diversity and inclusion programs
  • Ongoing professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service