Waymo - Mountain View, CA

posted about 1 month ago

Full-time - Mid Level
Mountain View, CA
Administrative and Support Services

About the position

The Machine Learning Engineer, Training at Waymo is responsible for developing infrastructure components necessary for distributed training of machine learning models, particularly in the context of autonomous driving technology. This role involves implementing automation solutions, monitoring system health, diagnosing issues, and optimizing performance to enhance the developer experience and the efficiency of the ML framework.

Responsibilities

  • Develop the infrastructure components necessary for distributed training, including job scheduling, resource management, data distribution, and model synchronization.
  • Implement automation solutions for provisioning, deployment, monitoring, and scaling of distributed training infrastructure to improve operations and reliability.
  • Monitor system health, diagnose and troubleshoot issues, and perform routine maintenance tasks to ensure the reliability of the distributed training infrastructure.
  • Identify performance bottlenecks and optimization opportunities.
  • Improve the developer experience and performance of our scalable ML framework.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field, or 2+ years equivalent experience.
  • Experience with distributed systems principles and experience building distributed systems for production environments.
  • Solid Python or C++ skills.
  • Prior experience with Machine Learning frameworks (e.g., TensorFlow, PyTorch) and distributed training algorithms.
  • Debug complex distributed systems issues.
  • Experience communicating updates and resolutions to customers and other partners.

Nice-to-haves

  • Practical familiarity using ML accelerator profiling tools to uncover performance bottlenecks.
  • Familiarity with cloud computing platforms (e.g., AWS, Azure, GCP) and experience deploying and managing distributed systems in cloud environments.
  • Knowledge of optimization and deep learning algorithms.

Benefits

  • Discretionary annual bonus program
  • Equity incentive plan
  • Generous Company benefits program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service