Bytedance - Seattle, WA

posted about 2 months ago

Full-time - Mid Level
Seattle, WA
Professional, Scientific, and Technical Services

About the position

The Software Engineer, ML System Architecture position at ByteDance involves working with the Doubao (Seed) Team, which is dedicated to building industry-leading AI foundation models. Established in 2023, this team focuses on conducting research in various areas, including natural language processing (NLP), computer vision (CV), and speech recognition and generation. The team operates across multiple locations, including China, Singapore, and the US, leveraging substantial data and computing resources to develop proprietary general-purpose models with multimodal capabilities. These models are already powering over 50 ByteDance applications in the Chinese market and have been launched to external enterprise clients through Volcano Engine. The Doubao app is recognized as the most used AIGC app in China. As part of the AML Machine Learning Systems team, the role involves providing end-to-end machine learning experiences and resources for the company. The team is responsible for building heterogeneous ML training and inference systems based on GPU and AI chips, advancing the state-of-the-art in ML systems technology. This includes accelerating models such as stable diffusion and large language models (LLM). The team also focuses on research and development of hardware acceleration technologies for AI and cloud computing, utilizing distributed systems, high-performance computing (HPC), and RDMA networking. The team has a strong publication record at top-tier conferences, showcasing their commitment to innovation in the field. In this role, the engineer will be responsible for the design and development of machine learning infrastructure for LLM and AIGC applications. This includes building a super large machine learning system that integrates GPUs, RDMA networking, and high-performance storage. The engineer will also tackle technical challenges related to system stability and availability, while coordinating efforts across multiple teams, including data center, network, computing, storage, and resource teams.

Responsibilities

  • Responsible for the design and development of Machine Learning infrastructure for LLM/AIGC, etc.
  • Build up a super large machine learning system integrating GPUs, RDMA networking, and high-performance storage.
  • Responsible for solving technical problems such as high stability and availability of the system.
  • Organize and coordinate multiple teams to complete the construction of the system, including Data center team, network team, computing team, storage team, resource team.

Requirements

  • Be proficient in 1 to 2 programming languages such as C++/Go/Python/Shell in Linux environment.
  • Understand the principles of distributed systems and have experience in design, development and maintenance of large-scale machine learning systems.
  • Be familiar with Kubernetes architecture, and have rich experience in system-level development and tuning.
  • Have an excellent logical analysis ability, able to reasonably abstract and split business logic.
  • Have a strong sense of responsibility, good learning ability, communication skills and self-drive.

Nice-to-haves

  • Familiar with the ML Infrastructure of Large Model training and inference.
  • Experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking).

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) and 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match, gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service