Software Engineer, SystemML - AI Networking

$117,000 - $173,000/Yr

Meta - Seattle, WA

posted 4 days ago

Full-time - Entry Level

Seattle, WA

Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Software Engineer, SystemML - AI Networking role involves working within the AI Networking Software team at Meta, focusing on developing and enhancing the software stack around the NVIDIA Collective Communications Library (NCCL). This position is critical for enabling reliable and scalable distributed machine learning (ML) training on Meta's large-scale GPU infrastructure, particularly for Generative AI (GenAI) and Large Language Models (LLM). The team aims to improve the performance and reliability of distributed ML workloads, ensuring that Meta's ML products can leverage extensive GPU resources effectively.

Responsibilities

Enable reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
Specialized experience in one or more of the following domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Nice-to-haves

PhD in Computer Science, Computer Engineering, or a relevant technical field.
Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.
Experience working with deep learning frameworks like PyTorch, Caffe2, or TensorFlow.
Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models.
Experience in HPC and parallel computing.
Knowledge of GPU architectures and CUDA programming.
Knowledge of ML, deep learning, and LLM.

Benefits

$56.25/hour to $173,000/year + bonus + equity + benefits
Individual compensation determined by skills, qualifications, experience, and location.

Software Engineer, SystemML - AI Networking

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company