Nvidia - Santa Clara, CA

posted about 1 month ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The Senior Software Architect for Deep Learning and HPC Communications at NVIDIA will play a pivotal role in co-designing next-generation data center platforms and scalable communications software. This position focuses on enhancing communication performance for AI and HPC workloads, which are increasingly demanding in terms of compute resources. The architect will investigate bottlenecks, design new communication technologies, and explore innovative hardware and software solutions to improve performance across large GPU clusters.

Responsibilities

  • Investigate opportunities to improve communication performance by identifying bottlenecks in today's systems.
  • Design and implement new communication technologies to accelerate AI and HPC workloads.
  • Explore innovative solutions in hardware and software for next-generation platforms as part of co-design efforts involving GPU, Networking, and Software architects.
  • Build proofs-of-concept, conduct experiments, and perform quantitative modeling to evaluate and drive new innovations.
  • Use simulation to explore performance of large GPU clusters, scaling to hundreds of thousands of GPUs.

Requirements

  • M.S./Ph.D. degree in Computer Science, Computer Engineering, or equivalent experience.
  • 5+ years of relevant experience in software architecture and development.
  • Excellent C/C++ programming and debugging skills.
  • Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC).
  • Deep understanding of operating systems, computer and system architecture.
  • Solid fundamentals of network architecture, topology, algorithms, and communication scaling relevant to AI and HPC workloads.
  • Strong experience with Linux.
  • Ability to work and communicate effectively in a multi-national, multi-time-zone corporate environment.

Nice-to-haves

  • Expertise in related technology and a passion for the field.
  • Experience with CUDA programming and NVIDIA GPUs.
  • Knowledge of high-performance networks like InfiniBand, RoCE, NVLink, etc.
  • Experience with Deep Learning Frameworks such as PyTorch and TensorFlow.
  • Knowledge of deep learning parallelisms and mapping to the communication subsystem.
  • Experience with HPC applications.
  • Strong collaborative and interpersonal skills with a proven track record of guiding and influencing in a dynamic environment.

Benefits

  • Equity options as part of compensation package.
  • Comprehensive health benefits including medical, dental, and vision insurance.
  • Flexible work hours and remote work options.
  • Paid time off and holidays.
  • Professional development opportunities.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service