Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Mid Level
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

As a Site Reliability Engineer at NVIDIA, you will lead the design and implementation of cutting-edge GPU compute clusters that support AI research. This role focuses on building and operating these clusters with high reliability, efficiency, and performance, while driving automation and foundational improvements to enhance researcher productivity. You will be part of a diverse team that values intellectual curiosity and problem-solving, working in a collaborative environment that encourages innovation and self-direction.

Responsibilities

  • Design and implement state-of-the-art GPU compute clusters.
  • Optimize cluster operations for maximum reliability, efficiency, and performance.
  • Drive foundational improvements and automation to enhance researcher productivity.
  • Tackle strategic challenges in large-scale, high-performance computing environments.
  • Troubleshoot, diagnose and root cause system failures, isolating components and failure scenarios.
  • Scale systems sustainably through automation and push for changes that improve reliability and velocity.
  • Practice sustainable incident response and conduct blameless postmortems.
  • Participate in an on-call rotation to support production systems.
  • Write and review code, develop documentation and capacity plans, and debug complex systems.
  • Implement remediations across software and hardware stacks according to plan, maintaining thorough procedural records and data logs.
  • Manage upgrades and automated rollbacks across all clusters.

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering or related field, or equivalent experience.
  • Minimum 6+ years of experience designing and operating large scale compute infrastructure.
  • Proven experience in site reliability engineering for high-performance computing environments with operational experience of at least 5K GPUs cluster.
  • Deep understanding of GPU computing and AI infrastructure.
  • Passion for solving complex technical challenges and optimizing system performance.
  • Experience with AI/HPC advanced job schedulers, ideally familiarity with schedulers such as Slurm.
  • Working knowledge of cluster configuration management tools such as BCM or Ansible and infrastructure level applications like Kubernetes, Terraform, MySQL, etc.
  • In-depth understanding of container technologies like Docker and Enroot.
  • Experience programming in Python and Bash scripting.

Nice-to-haves

  • Interest in crafting, analyzing and fixing large-scale distributed systems.
  • Familiarity with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking.
  • Familiarity with InfiniBand with IBoIP and RDMA.
  • Experience with Cloud Deployment, BCM, Terraform.
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow.
  • Multi-cloud experience.

Benefits

  • Equity options
  • Comprehensive health insurance
  • 401k retirement plan
  • Paid time off and holidays
  • Flexible work hours
  • Professional development opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service