Zoom - Seattle, WA
posted 3 months ago
As a Machine Learning Platform Engineer at Zoom, you will play a crucial role in developing and managing the AI infrastructure and framework that supports our machine learning initiatives. Your primary focus will be on enhancing the training, deployment, and operational capabilities of our AI systems, ensuring they are functional, scalable, and reliable. This position is pivotal in shaping and optimizing Zoom's AI capabilities, which are integral to our product offerings and overall mission to improve communication and collaboration. You will be part of a dedicated AI infrastructure team that is responsible for managing the entire Machine Learning Platform. This includes overseeing model training processes and the underlying infrastructure that supports these activities. Your work will directly contribute to improving efficiency in GPU training and enhancing the throughput and latency of language model inference. The team is committed to pushing the boundaries of what is possible with AI, and your contributions will be essential in achieving these goals. In this role, you will be tasked with developing the Machine Learning Platform management system, which involves building the necessary toolchains, services, and pipelines for model development workflows and model serving architecture. You will prioritize various metrics for monitoring model training and inference, ensuring that our systems operate at peak performance. Additionally, you will be responsible for developing and maintaining a high-performance GPU infrastructure for large language model (LLM) training, as well as understanding autoscaling for inference services and managing multiple models for dynamic loading. Your expertise will also be critical in supporting, troubleshooting, and resolving any issues that arise during the training and inference processes.