This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Meta - Menlo Park, CA

posted about 2 months ago

Full-time
Menlo Park, CA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Software Engineer, SystemML - Scaling / Performance role at Meta involves working within the Network.AI Software team to enhance the software stack around the NVIDIA Collective Communications Library (NCCL). This position focuses on enabling reliable and scalable distributed machine learning (ML) training, particularly for Generative AI (GenAI) and Large Language Models (LLM). The team is responsible for improving the performance and reliability of distributed ML workloads across Meta's extensive GPU infrastructure, contributing to innovations in ML products and applications.

Responsibilities

  • Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
  • Specialized experience in one or more of the following domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Nice-to-haves

  • PhD in Computer Science, Computer Engineering, or relevant technical field.
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.
  • Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow.
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models.
  • Experience in HPC and parallel computing.
  • Knowledge of GPU architectures and CUDA programming.
  • Knowledge of ML, deep learning and LLM.

Benefits

  • $70.67/hour to $208,000/year + bonus + equity + benefits
  • Individual compensation determined by skills, qualifications, experience, and location.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service