Meta - Seattle, WA

posted 4 days ago

Full-time - Entry Level
Seattle, WA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Software Engineer, SystemML - AI Networking role involves working within the AI Networking Software team at Meta, focusing on developing and enhancing the software stack around the NVIDIA Collective Communications Library (NCCL). This position is critical for enabling reliable and scalable distributed machine learning (ML) training on Meta's large-scale GPU infrastructure, particularly for Generative AI (GenAI) and Large Language Models (LLM). The team aims to improve the performance and reliability of distributed ML workloads, ensuring that Meta's ML products can leverage extensive GPU resources effectively.

Responsibilities

  • Enable reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

  • Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
  • Specialized experience in one or more of the following domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Nice-to-haves

  • PhD in Computer Science, Computer Engineering, or a relevant technical field.
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.
  • Experience working with deep learning frameworks like PyTorch, Caffe2, or TensorFlow.
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models.
  • Experience in HPC and parallel computing.
  • Knowledge of GPU architectures and CUDA programming.
  • Knowledge of ML, deep learning, and LLM.

Benefits

  • $56.25/hour to $173,000/year + bonus + equity + benefits
  • Individual compensation determined by skills, qualifications, experience, and location.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service