Engineering Manager Machine Learning Infrastructure

$210,000 - $358,000/Yr

Bytedance - San Jose, CA

posted about 2 months ago

Full-time - Manager

San Jose, CA

Professional, Scientific, and Technical Services

About the position

ByteDance is seeking an Engineering Manager for its Machine Learning Infrastructure team, which is dedicated to advancing the next-generation AI infrastructure and recommendation platform for various ranking systems including ads, search, and e-commerce. The successful candidate will lead a team responsible for designing and implementing distributed systems that support machine learning models, ensuring their reliability and scalability. This role is pivotal in enhancing the performance of core business systems and driving substantial impact across the company. As an Engineering Manager, you will oversee the development of tools for monitoring and managing machine learning infrastructure, identifying system inefficiencies, and leading efforts to optimize performance. You will collaborate closely with product teams to provide tailored solutions that meet their specific needs. The ideal candidate will have a strong background in machine learning systems, experience in leading engineering teams, and a passion for solving complex challenges in a collaborative environment. At ByteDance, we believe in the power of creativity and innovation. Our mission is to inspire creativity and enrich life, and we are committed to fostering an inclusive workplace that values diverse perspectives. We encourage candidates who are excited about pushing the boundaries of technology and making a meaningful impact to apply for this role.

Responsibilities

Lead the team to design and implement distributed inference/training/scheduling/orchestration/storage/parameter server infrastructure for feeds, ads, and search ranking models.
Oversee the development of monitoring and management tools to ensure the reliability and scalability of machine learning infrastructure.
Manage the identification and prioritization of system inefficiencies and bottlenecks, leading efforts to enhance system performance.
Lead the team in creating tools to analyze bottlenecks and sources of instability, formulating and implementing effective solutions.
Collaborate with product teams, offering comprehensive solutions tailored to their specific requirements.

Requirements

Experience in leading an engineering team.
Experience in developing and deploying large-scale machine learning systems.
Strong sense of responsibility and good at communication and teamwork.
Passionate about solving complex and challenging problems.
Experience contributing to an open-sourced machine learning framework (tensorflow / jax / pytorch / torchscript / mxnet / tensorrt).
Experience in big data frameworks (e.g., Spark/Hadoop/Flink), experience in resource management and task scheduling for large scale distributed systems.
Participated in Parameter Server system optimization, or index structure optimization for search systems.
Strong background in one of the following fields: Hardware-Software Co-Design, High Performance Computing, ML Hardware Acceleration (e.g., GPU/RDMA) or ML for Systems.

Benefits

100% premium coverage for employee medical insurance
Approximately 75% premium coverage for dependents
Health Savings Account (HSA) with company match
Dental insurance
Vision insurance
Short/Long term Disability insurance
Basic Life insurance
Voluntary Life and AD&D insurance plans
Flexible Spending Account (FSA) options
10 paid holidays per year
17 days of Paid Personal Time Off (PPTO)
10 paid sick days per year
12 weeks of paid Parental leave
8 weeks of paid Supplemental Disability
Mental and emotional health benefits through EAP and Lyra
401K company match
Gym reimbursement
Cellphone service reimbursement

Engineering Manager Machine Learning Infrastructure

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company