Senior Site Reliability Engineer - AI Research Clusters

$148,000 - $276,000/Yr

Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Mid Level

Santa Clara, CA

5,001-10,000 employees

Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer on the GPU AI/HPC Infrastructure team at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support AI research. This role focuses on enhancing the reliability, efficiency, and performance of these clusters while driving automation to improve researcher productivity. You will tackle strategic challenges related to compute, networking, and storage for large-scale workloads, contributing to the evolution of NVIDIA's private/public cloud strategy.

Responsibilities

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for maximum reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Tackle strategic challenges in large-scale, high-performance computing environments
Troubleshoot, diagnose and root cause system failures and isolate components/failure scenarios
Build automation for AI-HPC GPU Cluster bring up and scaled up operation
Write and review code, develop documentation and capacity plans, debug complex systems
Implement remediations across software and hardware stack according to plan

Requirements

Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
Minimum 5 years of experience designing and operating large scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster
Deep understanding of GPU computing and AI infrastructure
Passion for solving complex technical challenges and optimizing system performance
Experience with AI/HPC advanced job schedulers, ideally familiarity with schedulers such as Slurm
Working knowledge of cluster configuration management tools such as BCM or Ansible
In-depth understanding of container technologies like Docker and Enroot
Experience programming in Python and Bash scripting

Nice-to-haves

Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
Familiarity with InfiniBand with IBoIP and RDMA
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
Familiarity with deep learning frameworks like PyTorch and TensorFlow

Benefits

Highly competitive salaries
Comprehensive benefits package
Equity options

Senior Site Reliability Engineer - AI Research Clusters

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company