Senior Site Reliability Engineer - AI Research Clusters

$180,000 - $339,250/Yr

Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Mid Level

Santa Clara, CA

Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer at NVIDIA, you will lead the design and implementation of cutting-edge GPU compute clusters that support AI research. This role focuses on building and operating these clusters with high reliability, efficiency, and performance, while driving automation and foundational improvements to enhance researcher productivity. You will be part of a diverse team that values intellectual curiosity and problem-solving, working in a collaborative environment that encourages innovation and self-direction.

Responsibilities

Design and implement state-of-the-art GPU compute clusters.
Optimize cluster operations for maximum reliability, efficiency, and performance.
Drive foundational improvements and automation to enhance researcher productivity.
Tackle strategic challenges in large-scale, high-performance computing environments.
Troubleshoot, diagnose and root cause system failures, isolating components and failure scenarios.
Scale systems sustainably through automation and push for changes that improve reliability and velocity.
Practice sustainable incident response and conduct blameless postmortems.
Participate in an on-call rotation to support production systems.
Write and review code, develop documentation and capacity plans, and debug complex systems.
Implement remediations across software and hardware stacks according to plan, maintaining thorough procedural records and data logs.
Manage upgrades and automated rollbacks across all clusters.

Requirements

Bachelor's degree in Computer Science, Electrical Engineering or related field, or equivalent experience.
Minimum 6+ years of experience designing and operating large scale compute infrastructure.
Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster.
Deep understanding of GPU computing and AI infrastructure.
Passion for solving complex technical challenges and optimizing system performance.
Experience with AI/HPC advanced job schedulers, ideally familiarity with schedulers such as Slurm.
Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications like Kubernetes, Terraform, MySQL, etc.
In-depth understanding of container technologies like Docker and Enroot.
Experience programming in Python and Bash scripting.

Nice-to-haves

Interest in crafting, analyzing and fixing large-scale distributed systems.
Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking.
Familiarity with InfiniBand with IBoIP and RDMA.
Experience with Cloud Deployment, BCM, Terraform.
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
Familiarity with deep learning frameworks like PyTorch and TensorFlow.
Multi-cloud experience.

Benefits

Equity options
Comprehensive health insurance
401k retirement plan
Paid time off and holidays
Flexible work hours
Professional development opportunities

Senior Site Reliability Engineer - AI Research Clusters

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company