Engineering Manager, Hardware Health

$360,000 - $530,000/Yr

Openai - San Francisco, CA

posted about 1 month ago

Full-time - Mid Level

San Francisco, CA

Professional, Scientific, and Technical Services

About the position

As an Engineering Manager on the Hardware Health team at OpenAI, you will lead a group of engineers focused on ensuring the reliability, performance, and scalability of our custom-built hyperscale supercomputers. This role involves building infrastructure and automation to manage the lifecycle and operations of these supercomputers, which are critical for supporting cutting-edge AI research. You will collaborate with various engineering teams to optimize system health and minimize downtime, while fostering a culture of innovation and inclusivity within your team.

Responsibilities

Lead a team of software engineers to manage and optimize the critical infrastructure of our hyperscale supercomputers.
Develop and implement strategies to monitor and maintain system health, ensuring minimal downtime and optimal performance.
Collaborate closely with hardware engineers, systems engineers, and researchers to understand the infrastructure requirements and challenges of our AI training workloads.
Drive the development of tools and automation to detect, diagnose, and mitigate hardware health-related issues.
Build a team culture that prioritizes reliability, scalability, and performance, while fostering innovation and continuous learning.
Create a diverse, equitable, and inclusive environment that encourages team members to contribute their best ideas and challenge assumptions.

Requirements

5+ years of experience in engineering management, with a focus on large-scale infrastructure roles.
Deep expertise in managing and optimizing hardware and software in large-scale, high-performance computing environments.
Strong background in troubleshooting complex system issues and developing solutions to prevent future occurrences.
Excellent communication skills, with the ability to convey complex technical concepts to both technical and non-technical audiences.

Nice-to-haves

Familiarity with modern AI infrastructure technologies (e.g., A100s, H100s, etc.).
Experience working in large-scale HPC environments.
Skilled in collaborating across teams to solve complex problems and deliver high-impact results.
A humble attitude, an eagerness to help colleagues, and a desire to do whatever it takes to make the team succeed.

Benefits

Relocation assistance
Hybrid work model (3 days in the office per week)

Engineering Manager, Hardware Health

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company