Meta - Menlo Park, CA

posted 5 months ago

Full-time - Manager
Menlo Park, CA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

AI Training and Inference is a core pillar of Meta's success. To achieve Meta's AI goals, the network infrastructure, from the networking software stack through to the network switches, must operate with a high level of reliability. Production Engineers play a key role in driving the reliability of this network by deep diving into production issues through the entire stack and building software systems to ensure that operations can be scaled appropriately. To support delivering on these goals, Production Engineering Managers play a critical role in supporting and growing the organization to ensure the success of shared goals across the domain. As a Manager of Production Engineering (Network), you will support and lead engineers who are responsible for reliably scaling Meta's AI/HPC networking operations. You will partner with teams across Meta's AI/HPC environment to ensure alignment on operational priorities and approaches across the domain. Your role will involve understanding and contributing to technical architectures, capacity plans, tooling needs, automation plans, product launch plans, and creating comprehensive plans for prioritizing technical and resourcing challenges. You will drive technical architecture discussions, even on subjects you haven't had direct experience working with, and help define and drive a technical roadmap to meet organizational objectives. In addition, you will help engineers develop their careers by assigning them to projects tailored to their skill levels, long-term skill development, personalities, and work styles. You will also play a vital role in building and enriching an inclusive work environment comprised of people from diverse backgrounds. Regular assessment of employee performance, addressing under-performance, and recognizing and promoting performance will be part of your responsibilities. Balancing the need to keep operations running with allocating time to long-term, high-impact projects will be crucial in this role.

Responsibilities

  • Support and lead engineers responsible for reliably scaling Meta's AI/HPC networking operations.
  • Partner with teams across Meta's AI/HPC environment to ensure alignment on operational priorities and approaches across the domain.
  • Understand and contribute to technical architectures, capacity plans, tooling needs, automation plans, and product launch plans.
  • Create comprehensive plans for prioritizing technical and resourcing challenges.
  • Drive technical architecture discussions, even on subjects you haven't had direct experience working with.
  • Help define and drive a technical roadmap to meet organizational objectives.
  • Help engineers develop their careers by assigning them to projects tailored to their skill levels and work styles.
  • Build and enrich an inclusive work environment comprised of people from diverse backgrounds.
  • Assess employee performance frequently, address under-performance, and recognize and promote performance.
  • Balance the need to keep operations running with allocating time to long-term, high-impact projects.

Requirements

  • 4+ years of direct management experience in a technology role.
  • BS or MS in Computer Science, Engineering, or a related technical discipline, or equivalent experience.
  • Experience with operating, designing, implementing, and troubleshooting servers and networking components.
  • Experience drafting and reviewing code.
  • Experience with building teams and/or organizations, including hiring and managing performance.
  • Experience working in a cross-functional domain with high collaboration demands.

Nice-to-haves

  • Expert knowledge of data center networking concepts (routing, switching, etc.).
  • Experience operating an IB/RDMA/RoCE network in production.
  • Understanding of host side communication libraries which enable running AI training workloads.
  • Experience building infrastructure automation software.
  • Experience in efficiently coding in at least one programming language.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service