Tiktok - San Jose, CA

posted 4 months ago

Full-time - Mid Level
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. The company operates globally, with headquarters in Los Angeles and Singapore, and offices in major cities including New York, London, and Tokyo. The Recommendation Infrastructure Team at TikTok is tasked with building and optimizing the architecture for the recommendation system, ensuring a stable and high-quality experience for users. Site Reliability Engineers (SREs) within this team are responsible for maintaining system availability and creating automated systems and pipelines to enhance operational efficiency. In this role, you will engage in and improve the entire lifecycle of recommendation systems, from system design consulting to launch reviews, deployment, operation, and refinement. You will deliver tools and software aimed at improving the reliability and scalability of services, automating operations, and enhancing research and development efficiency. Additionally, you will be responsible for ensuring the availability of large-scale services deployed across global data centers, managing and optimizing cloud resource utilization, and ensuring service level agreements (SLAs) for large-scale clusters. Monitoring and measuring service health, latency, and overall availability will also be key components of your responsibilities, along with practicing sustainable incident response and conducting postmortems to learn from incidents.

Responsibilities

  • Engage in and improve the whole lifecycle of Recommendation systems from system design consulting through to launch reviews, deployment, operation, and refinement.
  • Deliver tools/software to improve the reliability and scalability of services, automate operations, and improve R&D efficiency.
  • Build availability of large-scale services deployed across global data centers.
  • Plan, manage, and optimize cloud resources utilization, ensuring SLA of large-scale clusters.
  • Measure and monitor availability, latency, and overall service health.
  • Practice sustainable incident response and postmortems.

Requirements

  • Bachelor's degree or above majoring in Computer Science or related fields.
  • At least 2 years of work experience in SRE of large-scale systems deployment with high reliability and scalability.
  • Familiar with system operation skills in Linux and network.
  • Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++.
  • Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Familiar with popular CI/CD procedures and environments.
  • Effective communication skills and a sense of ownership and drive.

Benefits

  • 100% premium coverage for employee medical insurance
  • Approximately 75% premium coverage for dependents
  • Health Savings Account (HSA) with company match
  • Dental insurance
  • Vision insurance
  • Short/Long term Disability insurance
  • Basic Life insurance
  • Voluntary Life and AD&D insurance plans
  • Flexible Spending Account (FSA) options
  • 10 paid holidays per year
  • 17 days of Paid Personal Time Off (PPTO)
  • 10 paid sick days per year
  • 12 weeks of paid Parental leave
  • 8 weeks of paid Supplemental Disability
  • Mental and emotional health benefits through EAP and Lyra
  • 401K company match
  • Gym reimbursement
  • Cellphone service reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service