Bytedance - Seattle, WA
posted about 2 months ago
The Software Engineer, ML System Architecture position at ByteDance involves working with the Doubao (Seed) Team, which is dedicated to building industry-leading AI foundation models. Established in 2023, this team focuses on conducting research in various areas, including natural language processing (NLP), computer vision (CV), and speech recognition and generation. The team operates across multiple locations, including China, Singapore, and the US, leveraging substantial data and computing resources to develop proprietary general-purpose models with multimodal capabilities. These models are already powering over 50 ByteDance applications in the Chinese market and have been launched to external enterprise clients through Volcano Engine. The Doubao app is recognized as the most used AIGC app in China. As part of the AML Machine Learning Systems team, the role involves providing end-to-end machine learning experiences and resources for the company. The team is responsible for building heterogeneous ML training and inference systems based on GPU and AI chips, advancing the state-of-the-art in ML systems technology. This includes accelerating models such as stable diffusion and large language models (LLM). The team also focuses on research and development of hardware acceleration technologies for AI and cloud computing, utilizing distributed systems, high-performance computing (HPC), and RDMA networking. The team has a strong publication record at top-tier conferences, showcasing their commitment to innovation in the field. In this role, the engineer will be responsible for the design and development of machine learning infrastructure for LLM and AIGC applications. This includes building a super large machine learning system that integrates GPUs, RDMA networking, and high-performance storage. The engineer will also tackle technical challenges related to system stability and availability, while coordinating efforts across multiple teams, including data center, network, computing, storage, and resource teams.