Nvidia - Santa Clara, CA

posted about 1 month ago

Full-time - Senior
Santa Clara, CA
Computer and Electronic Product Manufacturing

About the position

The HPC Operations Manager at NVIDIA is responsible for leading a dynamic infrastructure team to develop and maintain global HPC clusters that support hardware design teams. This role involves collaboration with various partners to enhance storage, networking, and compute capabilities, ensuring high reliability and efficiency in the computing environment. The manager will mentor a multi-national team, oversee program schedules, and drive continuous improvement initiatives while evaluating new technologies to evolve the infrastructure.

Responsibilities

  • Collaborate with partners to develop programs for storage, networking, and compute in data centers.
  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers.
  • Ensure the highest reliability of HPC clusters.
  • Develop critical metrics and program schedules to measure program health and achievements.
  • Identify failures and lead retrospective analysis to develop improvement action plans.
  • Build standard methodologies for continuous improvement across NVIDIA and its partners.
  • Evaluate latest technologies and recommend future infrastructure evolution.
  • Plan deployments and refresh of hardware and associated software stack.
  • Work with hardware engineering leaders to support chip design needs and engineer an efficient HPC environment.
  • Lead all aspects of the HPC scheduler (LSF) and ensure delivery of forecasted compute demand.
  • Track software licensing servers and drive efficient license utilization.
  • Develop and manage program schedules, milestones, and deliverables.
  • Regularly communicate program status and key issues to senior management.

Requirements

  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science or equivalent experience.
  • 15+ years overall experience in IT infrastructure.
  • 5+ years managing IT infrastructure teams of 10+ people.
  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks.
  • Knowledge of HPC schedulers (IBM LSF preferred).
  • Knowledge of hardware design workflows (EDA tools and methodology).
  • Experience using project management and capacity planning software.
  • Datacenter operations experience (rack and stack, maintenance).

Nice-to-haves

  • Experience with HPC storage solutions (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon).
  • Infiniband operations, debugging, and performance tuning experience.
  • Software development experience, especially in a devops context.
  • Knowledge of relational databases and data lakes.
  • Experience with metrics/visualization/analytics platforms.
  • Experience deploying and maintaining FlexLM-based software license servers.
  • Established relationships with enterprise-level equipment suppliers.

Benefits

  • Equity options
  • Comprehensive health benefits
  • Flexible work hours
  • Diversity and inclusion programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service