ByteDanceposted 6 days ago
Mid Level
San Jose, CA
Publishing Industries

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, innovation, and teamwork in achieving its goals. The Doubao (Seed) Team, established in 2023, focuses on pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, and AI safety. The Machine Learning (ML) System sub-team is dedicated to developing and maintaining distributed ML training and inference systems globally, ensuring high performance and reliability.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage resources and planning, including computing and storage resources, while overseeing cost and budget.
  • Oversee global system disaster recovery, cluster machine governance, and stability of business services.
  • Improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide system and business on-call support as part of the global team.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in developing and maintaining distributed systems.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business service stability.
  • Ability to build software tools for infrastructure management.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage systems.
  • Familiarity with large-scale heterogeneous systems.
  • Background in AI safety and multimodal capabilities.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative global team environment.
  • Access to substantial data and computing resources.

Job Keywords

Hard Skills
  • Build Tools
  • Data Centers
  • Machine Learning
  • Systems Engineering
  • TikTok
  • 4cDEypQv3WSb BLJFxZhsS
  • 5Lu8yDns fKye4hmVN
  • 7i8LC1wTH OSad UTyXE56b7oHP
  • f2RHKxv MZCT pkaSFfny3hr
  • gbJMT7Uk RNQe3v0sD wLAk ZIEXV6uSOan
  • Kug8yvM MPwn51EHB WJnMHhB893S
  • LBzN6rs mjHEcqtMk0
  • M8mN7 k3AKEOdC6 PSXpwKfe
  • vtPadJ7Ax4Gp
  • xRYio3VwQ5br 7zyEvikFtIK
Soft Skills
  • kICWl1dVsvfe PtR kgnK1p
  • wvckMR7aq ktN92Bh
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service