Bytedance - San Jose, CA

posted about 2 months ago

Full-time - Manager
San Jose, CA
Professional, Scientific, and Technical Services

About the position

ByteDance is seeking an Engineering Manager for its Machine Learning Infrastructure team, which is dedicated to advancing the next-generation AI infrastructure and recommendation platform for various ranking systems including ads, search, and e-commerce. The successful candidate will lead a team responsible for designing and implementing distributed systems that support machine learning models, ensuring their reliability and scalability. This role is pivotal in enhancing the performance of core business systems and driving substantial impact across the company. As an Engineering Manager, you will oversee the development of tools for monitoring and managing machine learning infrastructure, identifying system inefficiencies, and leading efforts to optimize performance. You will collaborate closely with product teams to provide tailored solutions that meet their specific needs. The ideal candidate will have a strong background in machine learning systems, experience in leading engineering teams, and a passion for solving complex challenges in a collaborative environment. At ByteDance, we believe in the power of creativity and innovation. Our mission is to inspire creativity and enrich life, and we are committed to fostering an inclusive workplace that values diverse perspectives. We encourage candidates who are excited about pushing the boundaries of technology and making a meaningful impact to apply for this role.

Responsibilities

  • Lead the team to design and implement distributed inference/training/scheduling/orchestration/storage/parameter server infrastructure for feeds, ads, and search ranking models.
  • Oversee the development of monitoring and management tools to ensure the reliability and scalability of machine learning infrastructure.
  • Manage the identification and prioritization of system inefficiencies and bottlenecks, leading efforts to enhance system performance.
  • Lead the team in creating tools to analyze bottlenecks and sources of instability, formulating and implementing effective solutions.
  • Collaborate with product teams, offering comprehensive solutions tailored to their specific requirements.

Requirements

  • Experience in leading an engineering team.
  • Experience in developing and deploying large-scale machine learning systems.
  • Strong sense of responsibility and good at communication and teamwork.
  • Passionate about solving complex and challenging problems.
  • Experience contributing to an open-sourced machine learning framework (tensorflow / jax / pytorch / torchscript / mxnet / tensorrt).
  • Experience in big data frameworks (e.g., Spark/Hadoop/Flink), experience in resource management and task scheduling for large scale distributed systems.
  • Participated in Parameter Server system optimization, or index structure optimization for search systems.
  • Strong background in one of the following fields: Hardware-Software Co-Design, High Performance Computing, ML Hardware Acceleration (e.g., GPU/RDMA) or ML for Systems.

Benefits

  • 100% premium coverage for employee medical insurance
  • Approximately 75% premium coverage for dependents
  • Health Savings Account (HSA) with company match
  • Dental insurance
  • Vision insurance
  • Short/Long term Disability insurance
  • Basic Life insurance
  • Voluntary Life and AD&D insurance plans
  • Flexible Spending Account (FSA) options
  • 10 paid holidays per year
  • 17 days of Paid Personal Time Off (PPTO)
  • 10 paid sick days per year
  • 12 weeks of paid Parental leave
  • 8 weeks of paid Supplemental Disability
  • Mental and emotional health benefits through EAP and Lyra
  • 401K company match
  • Gym reimbursement
  • Cellphone service reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service