Meta - Menlo Park, CA

posted 4 months ago

Full-time - Manager
Menlo Park, CA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

In this role, you will be a member of the Network AI Software team and part of the larger DC networking organization. The team is responsible for developing and owning the software stack around collective communication libraries at Meta. The overarching goal of the team is to enable Meta-wide machine learning (ML) products and innovations to leverage our large-scale training and inference fleet through an observable, reliable, and high-performance distributed AI communication stack. Currently, one of the team's primary focuses is on building customized features, software benchmarks, performance tuners, and software stacks around PyTorch to enhance the full-stack distributed ML reliability and performance, particularly in areas such as Large-Scale Generative AI (GenAI) and Large Language Model (LLM) training. We are seeking leaders who can contribute to the scaling reliability and performance of GenAI and LLM initiatives. As a Software Engineering Manager in AI Networking, you will help define the technical roadmap for the team, drive the execution of associated tasks, and support the team in resolving dependencies. You will collaborate effectively with other groups such as Hardware, Infrastructure, and Operations, and interact with external partners as needed to resolve dependencies associated with objectives. Additionally, you will guide and help team members develop the appropriate skill sets to grow in their careers, addressing underperformance where necessary. Effective cross-functional communication and driving engineering efforts will be key components of your role.

Responsibilities

  • Help define the technical roadmap for the team and drive execution of associated tasks.
  • Support the team in resolving dependencies and collaborating effectively with other groups such as Hardware, Infrastructure, and Operations.
  • Interact with external partners as needed to resolve dependencies associated with objectives.
  • Guide and help team members develop appropriate skillsets to grow in their careers.
  • Address underperformance where necessary and communicate cross-functionally to drive engineering efforts.

Requirements

  • BS or MS in Computer Science or related technical discipline or equivalent experience.
  • 2+ years experience managing a networking related Software Engineering Team.
  • Working knowledge of network transport stack such as RoCE (RDMA).
  • Experience with software development for Distributed and Embedded systems.

Nice-to-haves

  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.
  • Experience working with DL frameworks like PyTorch, Caffe2, or TensorFlow.
  • Knowledge of ML, deep learning, and LLM.

Benefits

  • Competitive salary ranging from $177,000/year to $251,000/year + bonus + equity + benefits.
  • Comprehensive health insurance coverage.
  • 401(k) retirement savings plan with company matching contributions.
  • Paid time off and holidays.
  • Opportunities for professional development and career growth.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service