Tiktok - Seattle, WA

posted 3 days ago

Full-time - Mid Level
Seattle, WA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. U.S. Data Security (USDS) is a subsidiary of TikTok in the U.S., created to enhance focus and governance on data protection policies and content assurance protocols to keep U.S. users safe. The teams within USDS are dedicated to providing oversight and protection of the TikTok platform and U.S. user data, ensuring that millions of Americans can continue to use TikTok for learning, earning, expressing creativity, or entertainment. The Site Reliability Engineering (SRE) team within the AML (Applied Machine Learning) division combines system engineering with machine learning to develop and operate a large-scale AI/ML recommendation system for users in the United States and globally. As a Site Reliability Engineer, you will have the opportunity to sharpen your skills in coding, performance analysis, and large-scale systems operation. You will play a crucial role in shaping the future of AML systems and making a tangible impact on TikTok users. The SRE team is committed to collaboration and cross-functional partnerships, and currently follows a hybrid work schedule requiring employees to work in the office three days a week, with flexibility as directed by management. This model is regularly reviewed, and specific requirements may change over time. In this role, you will be responsible for designing, building, and maintaining highly available, scalable, and fault-tolerant systems. You will monitor and analyze system performance, identifying and resolving issues proactively to prevent user impact. Additionally, you will develop and maintain automated monitoring, alerting, and incident response systems, collaborating closely with software engineering teams to ensure applications are designed with reliability, scalability, and performance in mind. You will also implement and maintain security best practices, ensuring compliance with regulatory requirements, and participate in on-call rotations to respond to issues and incidents as they arise.

Responsibilities

  • Design, build, and maintain highly available, scalable, and fault-tolerant systems.
  • Monitor and analyze system performance, identifying and resolving issues before causing user impact.
  • Develop and maintain automated monitoring, alerting, and incident response systems.
  • Collaborate closely with software engineering teams to ensure that applications are designed with reliability, scalability, and performance in mind.
  • Implement and maintain security best practices and ensure compliance with regulatory requirements.
  • Participate in on-call rotations and respond to issues and incidents within and outside of normal business hours.
  • Conduct root cause analysis of incidents, hold post-mortem reviews with stakeholders, and implement preventative measures to minimize the risk of similar incidents occurring in the future.

Requirements

  • Expertise in analyzing and troubleshooting Linux-based distributed systems.
  • Bachelor's/Master's degree in Computer Science, Computer Engineering, or equivalent years of experience in a SRE or software engineering role.
  • Experience programming with at least one commonly used language (C, C++, Python, Go).
  • Strong understanding of data structures and algorithms.
  • Competent knowledge of relational database systems.

Nice-to-haves

  • Ability to design and maintain large-scale systems.
  • Strong understanding of code optimization and routine task automation.
  • Proficiency in at least one machine learning framework: TensorFlow, PyTorch, MXNet or PaddlePaddle.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave.
  • 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service