Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Mid Level
Santa Clara, CA
5,001-10,000 employees
Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer on the GPU AI/HPC Infrastructure team at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support AI research. This role focuses on enhancing the reliability, efficiency, and performance of these clusters while driving automation to improve researcher productivity. You will tackle strategic challenges related to compute, networking, and storage for large-scale workloads, contributing to the evolution of NVIDIA's private/public cloud strategy.

Responsibilities

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Tackle strategic challenges in large-scale, high-performance computing environments
  • Troubleshoot, diagnose and root cause system failures and isolate components/failure scenarios
  • Build automation for AI-HPC GPU Cluster bring up and scaled up operation
  • Write and review code, develop documentation and capacity plans, debug complex systems
  • Implement remediations across software and hardware stack according to plan

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5 years of experience designing and operating large scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Passion for solving complex technical challenges and optimizing system performance
  • Experience with AI/HPC advanced job schedulers, ideally familiarity with schedulers such as Slurm
  • Working knowledge of cluster configuration management tools such as BCM or Ansible
  • In-depth understanding of container technologies like Docker and Enroot
  • Experience programming in Python and Bash scripting

Nice-to-haves

  • Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
  • Familiarity with InfiniBand with IBoIP and RDMA
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity options
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service