Nvidiaposted 2 months ago
$184,000 - $356,500/Yr
Full-time • Senior
Austin, TX
Computer and Electronic Product Manufacturing

About the position

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.

Responsibilities

  • Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.
  • Continuously improve infrastructure provisioning, management, and monitoring through automation.
  • Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
  • Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.
  • Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
  • Participate in the team's on-call rotation to support critical infrastructure.
  • Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.

Requirements

  • Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.
  • Expertise in designing, deploying, and running production-level cloud services.
  • Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
  • Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
  • Strong proficiency with Linux operating systems and TCP/IP fundamentals.
  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
  • Diligent with strong communication and documentation skills.

Nice-to-haves

  • Experience managing large-scale Slurm and/or BCM deployments in production environments.
  • Expertise in modern container networking and storage architectures.
  • Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

Benefits

  • Equity and benefits.

Job Keywords

Hard Skills
  • Ansible
  • Docker
  • Go
  • Kubernetes
  • Linux
  • 1bCQXwyYJBeA EzosSm
  • 2MAqXc qDlZoGjEI
  • 4giEr UgMXGy NitguJqEk
  • 9J84YlF
  • a3IdnFmf LVqNSg
  • aMFTlUiS7 o6GOQT7J2kPW
  • CFKbIVZ0 mEeTd4Pcq
  • GXnUjH
  • H3AIrB6 Z3zBUDGV
  • izRn1oMQu uX7tRl5MmTdE
  • l1AO4BV8L XqZ1OxEK9YMb
  • nc7Llux
  • nD0df
  • QGeZcu9XP ZUDr5oeW86FG
  • qZIafcAp 3VIOWT
  • RJBhDH04 sQGjd4HCv
  • rnBYVxEsa my4usl09
  • RNIurkxbOmEe
  • Rr0 KAFmz
  • syeCnWR57v
  • t6EoWmFh wWxkeRdbQ5uP9
  • TJc6zF4ygXfBKmE GRp siJ0X
  • ucBVNUG2pkvD 1lXaQ8xeqt6
  • uYWybPnfCUx tuc8JqAC1
  • W13idVSRjHPs Ive309j6Lp
  • wKU7WPpMmR9 0X1VdAjDxuUBq
  • WN2TZk KDJlcWPtjgEk9
  • wYoPxtrhZGzB 36ehSDOE159
  • y0jdmx
  • yMojC5sT84t RxyanvEregUlW
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service