Zoom - Seattle, WA

posted 3 months ago

Full-time - Mid Level
Remote - Seattle, WA
Administrative and Support Services

About the position

As a Machine Learning Platform Engineer at Zoom, you will play a crucial role in developing and managing the AI infrastructure and framework that supports our machine learning initiatives. Your primary focus will be on enhancing the training, deployment, and operational capabilities of our AI systems, ensuring they are functional, scalable, and reliable. This position is pivotal in shaping and optimizing Zoom's AI capabilities, which are integral to our product offerings and overall mission to improve communication and collaboration. You will be part of a dedicated AI infrastructure team that is responsible for managing the entire Machine Learning Platform. This includes overseeing model training processes and the underlying infrastructure that supports these activities. Your work will directly contribute to improving efficiency in GPU training and enhancing the throughput and latency of language model inference. The team is committed to pushing the boundaries of what is possible with AI, and your contributions will be essential in achieving these goals. In this role, you will be tasked with developing the Machine Learning Platform management system, which involves building the necessary toolchains, services, and pipelines for model development workflows and model serving architecture. You will prioritize various metrics for monitoring model training and inference, ensuring that our systems operate at peak performance. Additionally, you will be responsible for developing and maintaining a high-performance GPU infrastructure for large language model (LLM) training, as well as understanding autoscaling for inference services and managing multiple models for dynamic loading. Your expertise will also be critical in supporting, troubleshooting, and resolving any issues that arise during the training and inference processes.

Responsibilities

  • Developing the Machine Learning Platform management system.
  • Building the toolchains, service, pipeline for model development workflow, and model serving architecture.
  • Prioritizing various metrics for model training and inferencing monitoring.
  • Developing and maintaining the high-performance LLM training GPU infrastructure and cluster.
  • Understanding the autoscale for inference service and multi-models for dynamical loading.
  • Supporting, troubleshooting, and resolving any issues during the training and inferencing.

Requirements

  • Completed a Computer Science program or a comparable undergraduate program in a related field.
  • Deep understanding of AI concepts, Software Engineering, or Machine Learning.
  • Skilled in Python and PyTorch, with familiarity in Git and software development practices.
  • Experience in cloud computing platforms (e.g., AWS, Azure, Google Cloud).
  • Understanding of AI concepts and use of frameworks like TensorFlow, PyTorch, and Nvidia/CUDA.
  • Expertise in Docker, hands-on experience with Kubernetes, YAML, Deployment, ConfigMap, and PV/PVC.
  • Experience with operating systems such as Linux and Ubuntu, proficiency in Shell scripting.

Benefits

  • Comprehensive health benefits including medical, dental, and vision insurance.
  • 401(k) retirement plan with company matching.
  • Flexible work hours and hybrid work environment.
  • Generous paid time off and holidays.
  • Employee wellness programs and mental health support.
  • Professional development opportunities and tuition reimbursement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service