HPC Operations Manager Hardware Engineering

$272,000 - $419,750/Yr

Nvidia - Santa Clara, CA

posted about 2 months ago

Full-time - Senior

Santa Clara, CA

Computer and Electronic Product Manufacturing

About the position

NVIDIA is seeking a highly motivated HPC Operations Manager to join our innovative infrastructure team. This role is pivotal in crafting global and dynamic High-Performance Computing (HPC) clusters that are essential for our hardware design teams. As an industry leader in High-Performance Computing, Artificial Intelligence, and Visualization, NVIDIA is at the forefront of technology, and this position plays a crucial role in enabling our hardware designers to build the next generation of GPUs and System on Chips (SOCs). The HPC Operations Manager will be responsible for ensuring the highest reliability of HPC clusters, developing critical metrics, and leading a multi-national team of sysadmins and DevOps engineers. This position requires collaboration with various partners to develop programs focused on storage, networking, and computing within our growing fleet of data centers. In this role, you will lead the evaluation of the latest technologies, plan deployments and refreshes of hardware, and work multi-functionally with hardware engineering leaders to support their future chip design needs. You will also manage the HPC scheduler (LSF), track software licensing servers, and communicate program status and key issues to senior management. The ideal candidate will have extensive experience in managing IT infrastructure teams, running Linux servers, and knowledge of HPC schedulers, as well as a strong background in hardware design workflows. This position offers an exciting opportunity to influence continuous improvement across NVIDIA and its partners, ensuring that our computing environment meets the evolving needs of our hardware design teams.

Responsibilities

Collaborate with partners to develop programs focused on storage, networking, and computing in data centers.
Lead, cultivate, and mentor a multi-national team of sysadmins and DevOps engineers in support of chip design teams.
Ensure the highest reliability of HPC clusters and develop critical metrics to measure program health and achievements.
Identify failures, lead retrospective analysis, and develop improvement action plans.
Build standard methodologies that simplify complexity and can be used across NVIDIA and influence partners for continuous improvement.
Evaluate the latest technologies and recommend future evolution of the infrastructure.
Plan deployments and refresh of hardware, storage, network equipment, and associated software stack.
Work with hardware engineering leaders to support future chip design needs and engineer an efficient HPC environment.
Lead all aspects of the HPC scheduler (LSF) and ensure delivery of forecasted compute demand to each hardware division.
Track software licensing servers and drive efficient license utilization.
Develop and manage program schedules, milestones, and deliverables, adjusting as needed based on customer product roadmap.
Regularly communicate program status and key issues to senior management.

Requirements

B.S. or M.S. in Computer Science, Computer Engineering, Information Science, or equivalent experience.
15+ years of overall experience in IT infrastructure.
5+ years managing IT infrastructure teams of 10+ people.
10+ years experience running Linux servers, NFS storage, and Ethernet networks.
Knowledge of HPC schedulers, preferably IBM LSF.
Knowledge of hardware design workflows, including EDA tools and methodology.
Experience using project management and capacity planning software.
Experience in datacenter operations, including rack and stack and maintenance.

Nice-to-haves

Experience with HPC storage solutions such as Netapp, Pure Storage, Lustre, ZFS, Isilon.
Knowledge of Infiniband operations, debugging, and performance tuning.
Experience in software development, especially in a DevOps context.
Knowledge of relational databases, data lakes, and metrics/visualization/analytics platforms.
Experience deploying and maintaining FlexLM-based software license servers.
Established relationships with enterprise-level equipment suppliers.

Benefits

Equity options
Comprehensive health benefits
Flexible work hours
Paid time off
Retirement savings plan
Employee discounts
Professional development opportunities

HPC Operations Manager Hardware Engineering

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company