This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

ByteDanceposted 12 days ago
Mid Level
Seattle, WA
Publishing Industries
Resume Match Score

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, innovation, and teamwork in achieving its goals. The Doubao (Seed) Team, established in 2023, focuses on pioneering advanced AI foundation models, with research areas spanning deep learning, reinforcement learning, and AI safety. The Machine Learning (ML) System sub-team is dedicated to developing and maintaining distributed ML training and inference systems globally, ensuring high performance and reliability.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage resources and planning, including computing and storage resources, while overseeing costs and budgets.
  • Oversee global system disaster recovery, cluster machine governance, and improve resource utilization and operational efficiency.
  • Build software tools, products, and systems to monitor and manage ML infrastructure and services effectively.
  • Participate in global team on-call support for system and business.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in managing distributed systems and large-scale ML training environments.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business service stability practices.
  • Ability to build monitoring and management tools for ML infrastructure.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage integration.
  • Familiarity with AIGC/AGI systems and their operational requirements.
  • Previous experience in a global team environment.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative work environment with a global team.
  • Access to substantial data and computing resources for research.

Job Keywords

Hard Skills
  • Build Tools
  • Data Centers
  • Machine Learning
  • Systems Engineering
  • TikTok
  • 2lBOpWZ1 x45GFQyYt 2Y4D fDuBbIOamsr
  • gOmobfKSnsPj gNPuTM1iDqe
  • h6D8LSU vVtCaDmjT etJ04dhKCbW
  • HDZyBKoY ehjqv7H2i
  • iOBq4zW 8jz1eh5vSH
  • kcQrN fiP2bNIdO TagiPsSw
  • MbzF9s8uH dYgv cn5PyA6vzt30
  • VLIzjnHK2713
  • VyvdTzUeFql0 ud5mR6j1L
  • WilgOYk Wwk7 5LiXFgGJZo1
Soft Skills
  • EdRK9OFWe hAFw8mX
  • fPOh13oCr6Ya 3jv pD6br9
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service