Tiktok - San Jose, CA

posted 26 days ago

Full-time - Mid Level
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a Tech Lead Machine Learning Ops Engineer within the Global SRE team at TikTok, you will play a crucial role in ensuring the stability and efficiency of machine learning systems that support our Global Monetization Products and Technology organization. Your primary responsibility will be to oversee the operations of machine learning models throughout their lifecycle, from data preparation and development to training, deployment, and serving. This position is pivotal in maintaining the performance and reliability of our online and offline machine learning systems, which are essential for delivering a seamless user experience on our platform. In this role, you will be responsible for setting Service Level Objectives (SLOs) for online machine learning serving systems, ensuring their stability and performance. You will also focus on maintaining the stability of offline machine learning training tasks, working to improve their success rates. A significant aspect of your job will involve rolling out GPU model training in regions outside of China, which requires careful planning and execution. Additionally, you will oversee the stability of AIGC-related machine learning tasks and manage resources effectively, including budgeting and enhancing resource efficiency for both online and offline operations. Your expertise will be critical in troubleshooting application issues and production operations, ensuring that our machine learning systems operate smoothly and efficiently. You will collaborate with cross-functional teams to drive improvements and innovations in our machine learning infrastructure, contributing to TikTok's mission of inspiring creativity and bringing joy to users around the globe.

Responsibilities

  • Responsible for setting SLOs of online machine learning serving systems, maintaining the stability of the online serving systems.
  • Responsible for maintaining stability of offline machine learning training tasks, improving the success rate of the training tasks.
  • Responsible for rolling out GPU model training in Non-China regions.
  • Responsible for stability of AIGC related machine learning tasks.
  • Responsible for resource management and planning of machine learning resources, including: cost and budget, resource efficiency enhancement, offline and online resources tides.

Requirements

  • Bachelor's degree in Computer Science or Software Engineering, similar technical field of study, or equivalent practical experience.
  • Expertise in Linux operating systems, networking, storage.
  • Experience programming in at least one of the following programming languages: Python, Go, C, C++, or Java.
  • Experience in troubleshooting application issues, or production operations.
  • Effective communication skills and a sense of ownership and drive.

Nice-to-haves

  • Experience in SRE of machine learning systems.
  • Experience in SRE of ads/recommendation/search systems.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave.
  • 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service