Nvidia - Santa Clara, CA

posted 25 days ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a highly motivated Data Center System Software Architect to join the DGX Cloud Software Team. This role involves leading the architecture, design, and implementation of next-generation DGX cloud clusters, focusing on hybrid deployments between cloud and on-premises environments. The ideal candidate will have a strong programming background, a deep understanding of distributed systems, and excellent communication skills, contributing to the advancement of NVIDIA's AI infrastructure solutions.

Responsibilities

  • Lead technical activities for data centers with a focus on hybrid deployments between cloud and on-prem.
  • Provide expertise in infrastructure workflows, including hardware, workload orchestration, and application tuning.
  • Provide fast and creative solutions for complex problems and write effective, clear, and reliable architecture specifications.
  • Translate requirements into vision, architecture, and roadmap.
  • Work with engineering teams across NVIDIA to ensure seamless integration of software from hardware to AI training applications.

Requirements

  • Masters or PhD in Computer Science, Computer Engineering, Physics, or equivalent experience.
  • 10+ years of experience in the field of Data Sciences, Deep Learning, or Machine Learning.
  • Ability to seamlessly shift between Linux system environments and Python programming.
  • Programming skills in one or more high-level languages (C, C++, Go, Rust, etc.).
  • System-level experience with both hardware and software.
  • Strong problem-solving skills and customer-facing communication skills.
  • Strong design, coding, analytical, debugging, and problem-solving skills.
  • Passion for continuous learning and knowledge transfer.
  • Ability to work concurrently with multiple groups locally and abroad.

Nice-to-haves

  • Experience with GPU deep learning and data sciences.
  • Experience using TensorFlow, PyTorch, or other deep learning frameworks.
  • Experience working with Docker containers, Slurm, Terraform, and Kubernetes.
  • CUDA programming and NCCL experience.
  • HPC programming experience including MPI, OpenACC, or other parallel programming tools.
  • Hands-on experience with DGX Cloud, NVIDIA AI Enterprise AI Software, Base Command Manager, NEMO, and NVIDIA Inference Microservices.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Opportunities for professional development
  • Diversity and inclusion programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service