Meta - Menlo Park, CA
posted 4 months ago
In this role, you will be a member of the Network AI Software team and part of the larger DC networking organization. The team is responsible for developing and owning the software stack around collective communication libraries at Meta. The overarching goal of the team is to enable Meta-wide machine learning (ML) products and innovations to leverage our large-scale training and inference fleet through an observable, reliable, and high-performance distributed AI communication stack. Currently, one of the team's primary focuses is on building customized features, software benchmarks, performance tuners, and software stacks around PyTorch to enhance the full-stack distributed ML reliability and performance, particularly in areas such as Large-Scale Generative AI (GenAI) and Large Language Model (LLM) training. We are seeking leaders who can contribute to the scaling reliability and performance of GenAI and LLM initiatives. As a Software Engineering Manager in AI Networking, you will help define the technical roadmap for the team, drive the execution of associated tasks, and support the team in resolving dependencies. You will collaborate effectively with other groups such as Hardware, Infrastructure, and Operations, and interact with external partners as needed to resolve dependencies associated with objectives. Additionally, you will guide and help team members develop the appropriate skill sets to grow in their careers, addressing underperformance where necessary. Effective cross-functional communication and driving engineering efforts will be key components of your role.