Tiktok - San Jose, CA
posted 26 days ago
As a Tech Lead Machine Learning Ops Engineer within the Global SRE team at TikTok, you will play a crucial role in ensuring the stability and efficiency of machine learning systems that support our Global Monetization Products and Technology organization. Your primary responsibility will be to oversee the operations of machine learning models throughout their lifecycle, from data preparation and development to training, deployment, and serving. This position is pivotal in maintaining the performance and reliability of our online and offline machine learning systems, which are essential for delivering a seamless user experience on our platform. In this role, you will be responsible for setting Service Level Objectives (SLOs) for online machine learning serving systems, ensuring their stability and performance. You will also focus on maintaining the stability of offline machine learning training tasks, working to improve their success rates. A significant aspect of your job will involve rolling out GPU model training in regions outside of China, which requires careful planning and execution. Additionally, you will oversee the stability of AIGC-related machine learning tasks and manage resources effectively, including budgeting and enhancing resource efficiency for both online and offline operations. Your expertise will be critical in troubleshooting application issues and production operations, ensuring that our machine learning systems operate smoothly and efficiently. You will collaborate with cross-functional teams to drive improvements and innovations in our machine learning infrastructure, contributing to TikTok's mission of inspiring creativity and bringing joy to users around the globe.