Bytedance - San Jose, CA

posted 4 months ago

Full-time - Senior
San Jose, CA
Professional, Scientific, and Technical Services

About the position

As a Senior Machine Learning Ops Engineer at ByteDance, you will play a crucial role in ensuring the efficiency and reliability of our machine learning systems, which are pivotal for large model development, training, evaluation, and inference. You will be part of the Machine Learning (ML) System sub-team, which combines system engineering with machine learning to develop and maintain massively distributed ML training and inference systems/services globally. This position offers the opportunity to work with cutting-edge technology and contribute to the development of advanced AI large model technology in the industry. Your responsibilities will include ensuring that our ML systems operate efficiently across multiple data centers, regions, and cloud environments. You will be tasked with resource management and planning, focusing on computing and storage resources, while also overseeing global system disaster recovery and improving operational efficiency. Additionally, you will build software tools and systems to monitor and manage the ML infrastructure effectively, ensuring that our systems are stable and reliable. In this role, you will collaborate with a global team of experts from the United States, China, and Singapore, working together towards a unified project direction. You will have the chance to enrich your expertise in coding, performance analysis, and distributed systems, while also being involved in the decision-making process. ByteDance values creativity and innovation, and you will be encouraged to embrace challenges as opportunities for growth and learning.

Responsibilities

  • Ensure ML systems operate efficiently for large model development, training, evaluation, and inference.
  • Maintain stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios.
  • Manage and plan resources, including computing and storage resources, while overseeing cost and budget.
  • Oversee global system disaster recovery, cluster machine governance, and improve resource utilization and operational efficiency.
  • Build software tools, products, and systems to monitor and manage ML infrastructure and services efficiently.
  • Provide system and business on-call support as part of the global team roster.

Requirements

  • Bachelor's degree or above in computer science, computer engineering, or a related field.
  • Strong proficiency in at least one programming language such as Go, Python, or Shell in a Linux environment.
  • Hands-on experience with Kubernetes and containers, with more than 2 years of relevant operation and maintenance experience.
  • Excellent logical analysis ability to abstract and split business logic effectively.
  • Good documentation principles and habits to write and update workflow and technical documentation on time.
  • Strong sense of responsibility, good learning ability, communication skills, self-drive, and team spirit.

Nice-to-haves

  • Experience in the operation and maintenance of large-scale ML distributed systems.
  • Experience in the operation and maintenance of GPU servers.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% for dependents.
  • Health Savings Account (HSA) with company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans.
  • Flexible Spending Account (FSA) options for healthcare and dependent care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service