Openai - San Francisco, CA

posted about 1 month ago

Full-time - Mid Level
San Francisco, CA
Professional, Scientific, and Technical Services

About the position

As an Engineering Manager on the Hardware Health team at OpenAI, you will lead a group of engineers focused on ensuring the reliability, performance, and scalability of our custom-built hyperscale supercomputers. This role involves building infrastructure and automation to manage the lifecycle and operations of these supercomputers, which are critical for supporting cutting-edge AI research. You will collaborate with various engineering teams to optimize system health and minimize downtime, while fostering a culture of innovation and inclusivity within your team.

Responsibilities

  • Lead a team of software engineers to manage and optimize the critical infrastructure of our hyperscale supercomputers.
  • Develop and implement strategies to monitor and maintain system health, ensuring minimal downtime and optimal performance.
  • Collaborate closely with hardware engineers, systems engineers, and researchers to understand the infrastructure requirements and challenges of our AI training workloads.
  • Drive the development of tools and automation to detect, diagnose, and mitigate hardware health-related issues.
  • Build a team culture that prioritizes reliability, scalability, and performance, while fostering innovation and continuous learning.
  • Create a diverse, equitable, and inclusive environment that encourages team members to contribute their best ideas and challenge assumptions.

Requirements

  • 5+ years of experience in engineering management, with a focus on large-scale infrastructure roles.
  • Deep expertise in managing and optimizing hardware and software in large-scale, high-performance computing environments.
  • Strong background in troubleshooting complex system issues and developing solutions to prevent future occurrences.
  • Excellent communication skills, with the ability to convey complex technical concepts to both technical and non-technical audiences.

Nice-to-haves

  • Familiarity with modern AI infrastructure technologies (e.g., A100s, H100s, etc.).
  • Experience working in large-scale HPC environments.
  • Skilled in collaborating across teams to solve complex problems and deliver high-impact results.
  • A humble attitude, an eagerness to help colleagues, and a desire to do whatever it takes to make the team succeed.

Benefits

  • Relocation assistance
  • Hybrid work model (3 days in the office per week)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service