This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Senior AI-HPC Cluster Engineer

$148,000 - $339,250/Yr

Nvidia - Santa Clara, CA

posted 2 months ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

As a Senior AI-HPC Cluster Engineer at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters that support demanding deep learning and high-performance computing workloads. This role involves addressing strategic challenges related to compute, networking, and storage design, while also focusing on effective resource utilization and evolving cloud strategies within a global computing environment.

Responsibilities

  • Building and improving the ecosystem around GPU-accelerated computing, including developing large scale automation solutions.
  • Maintaining and building deep learning clusters at scale.
  • Supporting researchers in running their workflows on clusters, including performance analysis and optimizations of deep learning workflows.
  • Conducting root cause analysis and suggesting corrective actions for problems of various scales.
  • Proactively finding and fixing problems before they occur.

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience.
  • Minimum 5 years of experience designing and operating large scale compute infrastructure.
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Working knowledge of cluster configuration management tools such as Ansible, Puppet, or Salt.
  • Experience with AI/HPC advanced job schedulers, ideally familiar with Slurm, K8s, RTDA, or LSF.
  • In-depth understanding of container technologies like Docker, Singularity, Shifter, or Charliecloud.
  • Proficient in CentOS/RHEL and/or Ubuntu Linux distros, including Python programming and bash scripting.
  • Experience with AI/HPC workflows that use MPI.

Nice-to-haves

  • Experience with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking.
  • Experience with Machine Learning and Deep Learning concepts, algorithms, and models.
  • Familiarity with InfiniBand with IBOP and RDMA.
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow.

Benefits

  • Highly competitive salaries
  • Comprehensive benefits package
  • Equity options
  • Ongoing application acceptance
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service