Nvidia - Austin, TX

posted 7 days ago

Full-time - Senior
Austin, TX
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of large-scale GPU clusters that power AI workloads across multiple teams and projects. This role is crucial for ensuring the efficiency, scalability, and reliability of GPU clusters, which significantly impact the future of machine learning and artificial intelligence at NVIDIA. The ideal candidate will have a passion for operational excellence and automation, working in a multi-cloud environment, and collaborating with a diverse team to improve infrastructure provisioning and resiliency.

Responsibilities

  • Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.
  • Continuously improve infrastructure provisioning, management, and monitoring through automation.
  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
  • Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on-prem.
  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
  • Participate in the team's on-call rotation to support critical infrastructure.
  • Drive the evaluation and integration of new GPU technologies and cloud technologies to improve system performance.

Requirements

  • Minimum BS degree in Computer Science (or equivalent experience).
  • 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.
  • Expertise in designing, deploying, and running production-level cloud services.
  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
  • Strong proficiency with Linux operating systems and TCP/IP fundamentals.
  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
  • Diligent with strong communication and documentation skills.

Nice-to-haves

  • Experience managing large-scale Slurm and/or BCM deployments in production environments.
  • Expertise in modern container networking and storage architectures.
  • Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

Benefits

  • Equity and benefits eligibility.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service