This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Meta - Bellevue, WA

posted about 2 months ago

Full-time - Mid Level
Bellevue, WA
Web Search Portals, Libraries, Archives, and Other Information Services

About the position

The Software Engineer, SystemML - Scaling / Performance role at Meta involves working within the Network.AI Software team to enhance the software stack around the NVIDIA Collective Communications Library (NCCL). This position focuses on enabling reliable and scalable distributed machine learning (ML) training, particularly for Generative AI (GenAI) and Large Language Models (LLM). The team is responsible for improving the performance and reliability of distributed ML workloads across Meta's extensive GPU infrastructure, ensuring that innovations in ML can leverage this technology effectively.

Responsibilities

  • Enable reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure with a focus on GenAI/LLM scaling.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, or a relevant technical field, or equivalent practical experience.
  • Specialized experience in one or more of the following domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch).

Nice-to-haves

  • PhD in Computer Science, Computer Engineering, or a relevant technical field.
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband.
  • Experience working with deep learning frameworks like PyTorch, Caffe2, or TensorFlow.
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel.
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models.
  • Experience in HPC and parallel computing.
  • Knowledge of GPU architectures and CUDA programming.
  • Knowledge of ML, deep learning, and LLM.

Benefits

  • Competitive hourly wage ranging from $70.67/hour to $208,000/year plus bonus and equity.
  • Comprehensive benefits package including health insurance, retirement plans, and paid time off.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service