Nvidia - Santa Clara, CA
posted 2 months ago
As a Site Reliability Engineer on the GPU AI/HPC Infrastructure team at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support AI research. This role focuses on enhancing the reliability, efficiency, and performance of these clusters while driving automation to improve researcher productivity. You will tackle strategic challenges related to compute, networking, and storage for large-scale workloads, contributing to the evolution of NVIDIA's private/public cloud strategy.