Nvidia - Santa Clara, CA
posted about 2 months ago
NVIDIA is seeking a highly motivated HPC Operations Manager to join our innovative infrastructure team. This role is pivotal in crafting global and dynamic High-Performance Computing (HPC) clusters that are essential for our hardware design teams. As an industry leader in High-Performance Computing, Artificial Intelligence, and Visualization, NVIDIA is at the forefront of technology, and this position plays a crucial role in enabling our hardware designers to build the next generation of GPUs and System on Chips (SOCs). The HPC Operations Manager will be responsible for ensuring the highest reliability of HPC clusters, developing critical metrics, and leading a multi-national team of sysadmins and DevOps engineers. This position requires collaboration with various partners to develop programs focused on storage, networking, and computing within our growing fleet of data centers. In this role, you will lead the evaluation of the latest technologies, plan deployments and refreshes of hardware, and work multi-functionally with hardware engineering leaders to support their future chip design needs. You will also manage the HPC scheduler (LSF), track software licensing servers, and communicate program status and key issues to senior management. The ideal candidate will have extensive experience in managing IT infrastructure teams, running Linux servers, and knowledge of HPC schedulers, as well as a strong background in hardware design workflows. This position offers an exciting opportunity to influence continuous improvement across NVIDIA and its partners, ensuring that our computing environment meets the evolving needs of our hardware design teams.