ByteDance - Seattle, WA

posted 2 days ago

- Mid Level
Seattle, WA

About the position

ByteDance, founded in 2012, is on a mission to inspire creativity and enrich life through its diverse suite of products, including TikTok and various platforms tailored for the Chinese market. The company emphasizes the importance of creation, not just in its products but also within its teams, fostering an environment where challenges are seen as opportunities for learning and innovation. The ByteDance Doubao (Seed) Team, established in 2023, is focused on pioneering advanced AI foundation models, with research areas that include deep learning, reinforcement learning, and AI safety. The team operates globally, leveraging substantial data and computing resources to develop proprietary models that power numerous ByteDance applications. The Machine Learning (ML) System sub-team is dedicated to creating and maintaining distributed ML training and inference systems, ensuring high performance and reliability across various regions and cloud environments.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services across multi-data center, multi-region, and multi-cloud scenarios.
  • Manage and plan resources, including computing and storage, while overseeing costs and budgets.
  • Oversee global system disaster recovery, cluster machine governance, and improve resource utilization and operational efficiency.
  • Build software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
  • Provide on-call support for system and business operations as part of a global team.

Requirements

  • Strong experience in system engineering and machine learning.
  • Proficiency in managing distributed systems and large-scale infrastructure.
  • Experience with resource management and planning in cloud environments.
  • Knowledge of disaster recovery and business continuity planning.
  • Ability to develop software tools for monitoring and managing ML systems.

Nice-to-haves

  • Experience with GPU/NPU/RDMA/Storage integration.
  • Familiarity with AIGC/AGI systems and technologies.
  • Previous work in a global team environment.

Benefits

  • Opportunity to work with cutting-edge AI technologies.
  • Collaborative work environment with a global team.
  • Access to substantial data and computing resources for research.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service