ByteDanceposted 12 days ago
Mid Level
Seattle, WA
Publishing Industries

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, not just in its products but also within its teams, fostering an environment where challenges are seen as opportunities for learning and innovation. The ByteDance Doubao (Seed) Team, established in 2023, is focused on pioneering advanced AI foundation models, with research areas that include deep learning, reinforcement learning, and AI safety. The team operates globally, leveraging substantial data and computing resources to develop proprietary models that power numerous ByteDance applications. The Machine Learning (ML) System sub-team is dedicated to creating and maintaining distributed ML training and inference systems, ensuring high performance and reliability across various regions and cloud environments.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage and plan resources, including computing and storage, while overseeing costs and budgets.
  • Oversee global system disaster recovery, cluster machine governance, and improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide on-call support for system and business operations as part of a global team.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in managing distributed systems and large-scale infrastructure.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business continuity planning.
  • Ability to develop software tools for monitoring and managing ML systems.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage integration.
  • Familiarity with AIGC/AGI systems and technologies.
  • Previous work in a global team environment.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative work environment with a global team.
  • Access to substantial data and computing resources for research.

Job Keywords

Hard Skills
  • Build Tools
  • Data Centers
  • Machine Learning
  • Systems Engineering
  • TikTok
  • 3WLFsQPr4bvM
  • 7MEKvtSq4L3A X7iom2JUCqE
  • ASOD3jrqnxPG tmSwGE6rh
  • HrgjzPC v2i8RFW3t HUd4o2bwlWP
  • iveBz7w rBYy gy7CxQHZTq3
  • LYXno xJUAuK6nE rGqQa4WX
  • mtLD7f6H I0xlNGB1v ICxE zh21pxqU5YH
  • nObAou2tM u5Op gi3PDbBaJHAw
  • YBxmDPV 9f20EOPorw
  • YDMudkSC FtEe1gXsO
Soft Skills
  • fVX93ewMgaO6 TZC eSX4hU
  • G4UlTe3W5 pRanWEe
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service