Tiktok - San Jose, CA

posted 27 days ago

Full-time - Senior
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. The MLOps - Global SRE team plays a crucial role in ensuring the stability and efficiency of machine learning systems under the Global Monetization Products and Technology organization. This position focuses on the operational aspects of machine learning models, encompassing data preparation, development, training, deployment, and serving. As a Senior Machine Learning Ops Engineer, you will be responsible for setting Service Level Objectives (SLOs) for online machine learning serving systems and maintaining their stability. You will also oversee the stability of offline machine learning training tasks, working to improve their success rates. Additionally, you will roll out GPU model training in non-China regions and ensure the stability of AIGC-related machine learning tasks. Resource management and planning for machine learning resources, including cost and budget considerations, will also be part of your responsibilities. At TikTok, we believe that every challenge is an opportunity to learn, innovate, and grow as a team. We are committed to creating an inclusive environment where employees are valued for their unique skills and perspectives. Our platform connects people globally, and we strive to reflect the diverse communities we serve. Join us in our mission to inspire creativity and bring joy, and be part of a team that drives impact for ourselves, our company, and the communities we serve.

Responsibilities

  • Responsible for setting SLOs of online machine learning serving systems, maintaining the stability of the online serving systems.
  • Responsible for maintaining stability of offline machine learning training tasks, improving the success rate of the training tasks.
  • Responsible for rolling out GPU model training in Non-China regions.
  • Responsible for stability of AIGC related machine learning tasks.
  • Responsible for resource management and planning of machine learning resources, including: cost and budget, resource efficiency enhancement, offline and online resources tides.

Requirements

  • Bachelor's degree in Computer Science or Software Engineering, similar technical field of study, or equivalent practical experience.
  • Expertise in Linux operating systems, networking, storage.
  • Experience programming in at least one of the following programming languages: Python, Go, C, C++, or Java.
  • Experience in troubleshooting application issues, or production operations.
  • Effective communication skills and a sense of ownership and drive.

Nice-to-haves

  • Experience in SRE of machine learning systems.
  • Experience in SRE of ads/recommendation/search systems.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave.
  • 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service