Tiktok - New York, NY

posted 3 days ago

Full-time - Mid Level
New York, NY
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. U.S. Data Security (USDS) is a subsidiary of TikTok in the U.S., created to enhance focus and governance on data protection policies and content assurance protocols to ensure the safety of U.S. users. The USDS team is dedicated to providing oversight and protection of the TikTok platform and U.S. user data, allowing millions of Americans to continue using TikTok for learning, earning, self-expression, and entertainment. The teams within USDS include Trust & Safety, Security & Privacy, Engineering, User & Product Ops, and Corporate Functions, all working together to fulfill this commitment. As a Site Reliability Engineer (SRE) within the AML (Applied Machine Learning) team, you will combine system engineering with machine learning to develop and operate a massively distributed AI/ML recommendation system for users in the United States and globally. This role offers the opportunity to enhance your skills in coding, performance analysis, and large-scale systems operation, while also allowing you to shape the future of AML systems and make a significant impact on TikTok users. In this position, you will be responsible for designing, building, and maintaining highly available, scalable, and fault-tolerant systems. You will monitor and analyze system performance, proactively identifying and resolving issues before they affect users. Additionally, you will develop and maintain automated monitoring, alerting, and incident response systems, collaborating closely with software engineering teams to ensure applications are designed with reliability, scalability, and performance in mind. Security best practices will be a priority, and you will participate in on-call rotations to respond to incidents, conducting root cause analyses and implementing preventative measures to minimize future risks.

Responsibilities

  • Design, build, and maintain highly available, scalable, and fault-tolerant systems.
  • Monitor and analyze system performance, identifying and resolving issues before causing user impact.
  • Develop and maintain automated monitoring, alerting, and incident response systems.
  • Collaborate closely with software engineering teams to ensure that applications are designed with reliability, scalability, and performance in mind.
  • Implement and maintain security best practices and ensure compliance with regulatory requirements.
  • Participate in on-call rotations and respond to issues and incidents within and outside of normal business hours.
  • Conduct root cause analysis of incidents, hold post-mortem reviews with stakeholders, and implement preventative measures to minimize the risk of similar incidents occurring in the future.

Requirements

  • Expertise in analyzing and troubleshooting Linux-based distributed systems.
  • Bachelor's/Master's degree in Computer Science, Computer Engineering, or equivalent years of experience in a SRE or software engineering role.
  • Experience programming with at least one commonly used language (C, C++, Python, Go).
  • Strong understanding of data structures and algorithms.
  • Competent knowledge of relational database systems.

Nice-to-haves

  • Ability to design and maintain large-scale systems.
  • Strong understanding of code optimization and routine task automation.
  • Proficiency in at least one machine learning framework: TensorFlow, PyTorch, MXNet or PaddlePaddle.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service