Nvidia - Santa Clara, CA

posted 16 days ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The Senior Software Developer for HPC Cluster Management at NVIDIA is responsible for developing and enhancing the software that manages hardware integration and bare-metal provisioning in Linux-based cluster environments. This role involves working on NVIDIA's Bright Cluster Manager, which supports a wide range of Linux clusters, from small setups to large-scale deployments. The position requires a strong background in software development, particularly in Linux systems, and offers the opportunity to work with cutting-edge technologies in high-performance computing.

Responsibilities

  • Development of the head node and compute node installation and provisioning processes.
  • Work on functionality in the area of edge site deployment.
  • Integrating our product with the latest hardware (e.g., GPUs, DPUs, accelerators, high-speed interconnects such as Infiniband).
  • Work on features related to deployable infrastructure management.
  • Develop new features for BIOS and firmware upgrade management.
  • Enhance functionality to make Bright clusters usable for a wider range of workloads and increase scalability.
  • Add support for new Linux distributions.
  • Improve support for alternative CPU architectures such as ARM.
  • Work on adding features to our Ansible collections for Cluster Installation and Management.
  • Assist the support team with customer support requests related to the mentioned features.

Requirements

  • Degree in Computer Science or related field (or equivalent experience).
  • 7+ years of experience in software development and/or related roles.
  • Familiarity with the Linux operating system and networking concepts in Linux.
  • Proficient in Python and familiar with object-oriented software design, design patterns, and concurrent programming techniques.
  • Emphasis on high quality of work and producing clean code.
  • Eager to learn and use new technologies.

Nice-to-haves

  • Experience with Ansible.
  • Experience with high-performance computing and system administration.
  • Knowledge of Kubernetes, AWS, Azure, GCE, OpenStack, Jenkins, and distributed programming.
  • Proficiency in C++.

Benefits

  • Equity and benefits package.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service