Nvidia - Seattle, WA

posted about 1 month ago

Full-time - Senior
Seattle, WA
Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a Cloud Platform Engineer to design and build foundational elements of high-performing cloud services for Artificial Intelligence and high-performance computing under the DGX Cloud umbrella. This role involves innovating on behalf of clients by providing scalable managed self-service APIs and integrating the latest GPU technology into cloud offerings. The engineer will work collaboratively with software engineers and product teams to create a unified platform that leverages both HPC and Kubernetes technologies.

Responsibilities

  • Build and design platforms for DGX Cloud services.
  • Integrate best practices from HPC and Kubernetes to create a unified platform.
  • Collaborate with software engineers, product teams, and engineering teams across NVIDIA on DGX Cloud AI Compute services.
  • Write Infrastructure as Code (IaC) and work on Kubernetes.
  • Design and implement release pipelines.
  • Utilize GitOps and Pipelines effectively.

Requirements

  • BS in Computer Science, Information Systems, Computer Engineering, or equivalent experience.
  • 12+ years of platform engineering experience on large-scale production systems.
  • Solid technical foundation in distributed computing and storage, including experience with server systems, storage, I/O, networking, and system software.
  • Expertise in Kubernetes and IaC as an engineer.
  • Ability to communicate complex designs and infrastructure requirements to peers, customers, and vendors.
  • General knowledge of shared storage systems such as NFS, LustreFS, GlusterFS, etc.
  • Familiarity with system-level architecture, including interconnects, memory hierarchy, interrupts, and memory-mapped I/O.

Nice-to-haves

  • Proven experience in high performance computing, Deep Learning, and/or GPU accelerated computing domains.
  • Experience with large-scale distributed systems, HPC, ML, and training using Slurm and Kubernetes.
  • Deep knowledge of both software and hardware in HPC and ML infrastructure.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Ongoing professional development opportunities
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service