Tiktok - Seattle, WA

posted 3 days ago

Full-time - Mid Level
Seattle, WA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a Site Reliability Engineer (SRE) within TikTok's U.S. Data Security (USDS) division, you will play a crucial role in enhancing the reliability and scalability of our recommendation systems. TikTok is a leading platform for short-form mobile video, and our mission is to inspire creativity and bring joy to our users. The USDS division was established to ensure the protection of U.S. user data and to uphold our data protection policies and content assurance protocols. This role is pivotal in maintaining the integrity and performance of our services, allowing millions of users to engage with TikTok safely and effectively. In this position, you will engage in the entire lifecycle of recommendation systems, from system design consulting to deployment, operation, and refinement. You will be responsible for delivering tools and software that enhance the reliability of our services, automate operations, and improve research and development efficiency. Your work will involve building the availability of large-scale services deployed across global data centers, planning and managing cloud resource utilization, and ensuring that service level agreements (SLAs) for large-scale clusters are met. You will also measure and monitor service health, latency, and availability, and practice sustainable incident response and postmortems to continuously improve our systems. Joining TikTok means being part of a team that values creativity and innovation. We believe that every challenge is an opportunity to learn and grow, and we encourage our employees to collaborate and inspire one another. Our hybrid work model requires employees to work in the office three days a week, fostering collaboration and cross-functional partnerships. We regularly review this model to adapt to the needs of our teams and the organization.

Responsibilities

  • Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting through to launch reviews, deployment, operation and refinement
  • Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
  • Build availability of large-scale services deployed across global data centers
  • Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
  • Measure and monitor availability, latency and overall service health
  • Practice sustainable incident response and postmortems.

Requirements

  • Bachelor's degree or above majoring in Computer Science or related fields, with at least 2 years + of related work experience
  • Experience in SRE of large-scale systems deployment with high reliability and scalability
  • Familiar with system operation skills in Linux and network
  • Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++
  • Experience in designing, analyzing and troubleshooting large-scale distributed systems
  • Familiar with popular CI/CD procedures and environments
  • Effective communication skills and a sense of ownership and drive.

Benefits

  • 100% premium coverage for employee medical insurance
  • Approximately 75% premium coverage for dependents
  • Health Savings Account (HSA) with company match
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care
  • 10 paid holidays per year
  • 17 days of Paid Personal Time Off (PPTO)
  • 10 paid sick days per year
  • 12 weeks of paid Parental leave
  • 8 weeks of paid Supplemental Disability
  • Mental and emotional health benefits through EAP and Lyra
  • 401K company match
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service