Wayve - Sunnyvale, CA

posted about 1 month ago

Full-time - Mid Level
Remote - Sunnyvale, CA

About the position

Wayve is seeking skilled engineers to join our Machine Learning Platform team, focusing on optimizing large-scale training jobs to enhance the efficiency of our AI models. The role involves working with GPU training infrastructure and collaborating with research teams to improve training throughput and performance.

Responsibilities

  • Maximizing the MFU of large-scale training jobs.
  • Profiling and identifying bottlenecks in training code.
  • Implementing GPU kernels to improve training throughput.
  • Working closely with Research teams to integrate and test training efficiency improvements.
  • Owning and improving GPU training clusters.

Requirements

  • 5+ years experience in performance optimization or ML engineering.
  • Experience optimizing large scale training jobs on GPU compute clusters.
  • Experience in working in platform teams and collaborating with research teams.
  • Experience in reporting and tracking benchmarked performance in an open and accessible way.
  • Ability to write high quality, well-structured and tested Python code.
  • BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience.

Nice-to-haves

  • Solid experience working with concurrent, parallel and distributed computing.
  • Experience using Nvidia NSight Systems.
  • Experience implementing GPU kernels.
  • Knowledge of computing fundamentals - what makes code fast, secure and reliable.

Benefits

  • Hybrid working policy combining office and remote work.
  • Core working hours for flexible scheduling.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service