Senior Site Reliability Engineer, Recommendation Infrastructure

$334,000 - $435,000/Yr

Tiktok - San Jose, CA

posted 4 months ago

Full-time - Mid Level

San Jose, CA

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. The company operates globally, with headquarters in Los Angeles and Singapore, and offices in major cities including New York, London, and Tokyo. The Recommendation Infrastructure Team at TikTok is tasked with building and optimizing the architecture for the recommendation system, ensuring a stable and high-quality experience for users. Site Reliability Engineers (SREs) within this team are responsible for maintaining system availability and creating automated systems and pipelines to enhance operational efficiency. In this role, you will engage in and improve the entire lifecycle of recommendation systems, from system design consulting to launch reviews, deployment, operation, and refinement. You will deliver tools and software aimed at improving the reliability and scalability of services, automating operations, and enhancing research and development efficiency. Additionally, you will be responsible for ensuring the availability of large-scale services deployed across global data centers, managing and optimizing cloud resource utilization, and ensuring service level agreements (SLAs) for large-scale clusters. Monitoring and measuring service health, latency, and overall availability will also be key components of your responsibilities, along with practicing sustainable incident response and conducting postmortems to learn from incidents.

Responsibilities

Engage in and improve the whole lifecycle of Recommendation systems from system design consulting through to launch reviews, deployment, operation, and refinement.
Deliver tools/software to improve the reliability and scalability of services, automate operations, and improve R&D efficiency.
Build availability of large-scale services deployed across global data centers.
Plan, manage, and optimize cloud resources utilization, ensuring SLA of large-scale clusters.
Measure and monitor availability, latency, and overall service health.
Practice sustainable incident response and postmortems.

Requirements

Bachelor's degree or above majoring in Computer Science or related fields.
At least 2 years of work experience in SRE of large-scale systems deployment with high reliability and scalability.
Familiar with system operation skills in Linux and network.
Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++.
Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
Familiar with popular CI/CD procedures and environments.
Effective communication skills and a sense of ownership and drive.

Benefits

100% premium coverage for employee medical insurance
Approximately 75% premium coverage for dependents
Health Savings Account (HSA) with company match
Dental insurance
Vision insurance
Short/Long term Disability insurance
Basic Life insurance
Voluntary Life and AD&D insurance plans
Flexible Spending Account (FSA) options
10 paid holidays per year
17 days of Paid Personal Time Off (PPTO)
10 paid sick days per year
12 weeks of paid Parental leave
8 weeks of paid Supplemental Disability
Mental and emotional health benefits through EAP and Lyra
401K company match
Gym reimbursement
Cellphone service reimbursement

Senior Site Reliability Engineer, Recommendation Infrastructure

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company