ByteDanceposted 6 days ago
Mid Level
Seattle, WA
Publishing Industries

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, not just in its products but also within its teams, fostering an environment where challenges are seen as opportunities for learning and innovation. The ByteDance Doubao (Seed) Team, established in 2023, is focused on pioneering advanced AI foundation models, with research areas that include deep learning, reinforcement learning, and AI safety. The team operates globally, leveraging substantial data and computing resources to develop proprietary models that power numerous ByteDance applications. The Machine Learning (ML) System sub-team is dedicated to creating and maintaining distributed ML training and inference systems, ensuring high performance and reliability across various regions and cloud environments.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage and plan resources, including computing and storage, while overseeing costs and budgets.
  • Oversee global system disaster recovery, cluster machine governance, and improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide on-call support for system and business operations as part of a global team.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in managing distributed systems and large-scale infrastructure.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business continuity planning.
  • Ability to develop software tools for monitoring and managing ML systems.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage integration.
  • Familiarity with AIGC/AGI systems and technologies.
  • Previous work in a global team environment.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative work environment with a global team.
  • Access to substantial data and computing resources for research.

Job Keywords

Hard Skills
  • Build Tools
  • Data Centers
  • Machine Learning
  • Systems Engineering
  • TikTok
  • 24sum mUsyt315n XtNEMOc4
  • 9VtqHK8 mNpxZIJ6K z8ukimHQTtA
  • dMaKxNQp kHAFMCXso
  • EaVzlbODdI14
  • HXFuzUr 4ZcJGHEP67
  • mFyK7I4 kZ4a KnDaWVCTp0S
  • QVRieunvS Nshd g3ob9VLwZ5IY
  • Uut6AHXw2phr MY7HD8jVfk4
  • X3NR2wW47Qlk 2OgzFYcVe
  • zgqhwyBL 5j1Hn7JdM RBxy XKo0swCjtVi
Soft Skills
  • 76coZxutS I7QLRtj
  • 9Zy40gpwi5vR oq0 JaGcwu
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service