ByteDanceposted 12 days ago
Mid Level
San Jose, CA
Publishing Industries

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, innovation, and teamwork in achieving its goals. The Doubao (Seed) Team, established in 2023, focuses on pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, and AI safety. The Machine Learning (ML) System sub-team is dedicated to developing and maintaining distributed ML training and inference systems globally, ensuring high performance and reliability.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage resources and planning, including computing and storage resources, while overseeing cost and budget.
  • Oversee global system disaster recovery, cluster machine governance, and stability of business services.
  • Improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide system and business on-call support as part of the global team.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in developing and maintaining distributed systems.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business service stability.
  • Ability to build software tools for infrastructure management.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage systems.
  • Familiarity with large-scale heterogeneous systems.
  • Background in AI safety and multimodal capabilities.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative global team environment.
  • Access to substantial data and computing resources.

Job Keywords

Hard Skills
  • Build Tools
  • Data Centers
  • Machine Learning
  • Systems Engineering
  • TikTok
  • 928A3I0 PDlprFWxK SPX8vNFqE1Z
  • aDbeG7UK 4rL6ct7ls
  • AEOsM8VJY H03X jIp9TAyl2Bo5
  • CvKJyIb ER6f b8UCXKAe5PJ
  • dOimF8H0DtEB
  • j7Lg18nu OwC2FP3Jv mrt4 gdS3mektXVR
  • Ope3r gkAKPSvRn ctHGUslK
  • R9dgCs1LSqGU 6KcLuDMfZSi
  • uw5reYc 4KkpZ7EYza
  • Ymt07ZTPdMzr XV3kJSE2U
Soft Skills
  • dNk2Euw68FtS UqP 0Fekcr
  • utDrHlNqW pI1Plqy
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service