This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Senior AI-HPC Cluster Engineer

Nvidiaposted 5 months ago

$148,000 - $339,250/Yr

Full-time • Senior

Santa Clara, CA

Computer and Electronic Product Manufacturing

Match Score

Add your resume to Teal and unlock your Job Match score for free

About the position

As a Senior AI-HPC Cluster Engineer at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support demanding deep learning and high-performance computing workloads. This role involves addressing strategic challenges related to compute, networking, and storage design, while also focusing on effective resource utilization and evolving cloud strategies within a global computing environment.

Responsibilities

Building and improving the ecosystem around GPU-accelerated computing, including developing large scale automation solutions.
Maintaining and building deep learning clusters at scale.
Supporting researchers in running their workflows on clusters, including performance analysis and optimizations of deep learning workflows.
Conducting root cause analysis and suggesting corrective actions for problems of various scales.
Proactively finding and fixing problems before they occur.

Requirements

Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
Minimum 5 years of experience designing and operating large scale compute infrastructure.
Experience analyzing and tuning performance for a variety of AI/HPC workloads.
Working knowledge of cluster configuration management tools such as Ansible, Puppet, or Salt.
Experience with AI/HPC advanced job schedulers, ideally familiar with Slurm, K8s, RTDA, or LSF.
In-depth understanding of container technologies like Docker, Singularity, Shifter, or Charliecloud.
Proficient in CentOS/RHEL and/or Ubuntu Linux distros, including Python programming and bash scripting.
Experience with AI/HPC workflows that use MPI.

Nice-to-haves

Experience with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking.
Experience with Machine Learning and Deep Learning concepts, algorithms, and models.
Familiarity with InfiniBand with IBOP and RDMA.
Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
Familiarity with deep learning frameworks like PyTorch and TensorFlow.

Benefits

Highly competitive salaries
Comprehensive benefits package
Equity options
Ongoing application acceptance

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder

Senior AI-HPC Cluster Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company