Site Reliability Engineer, TikTok Server Architecture

$145,000 - $250,000/Yr

Tiktok - San Jose, CA

posted 3 days ago

Full-time - Mid Level

San Jose, CA

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. As a Site Reliability Engineer (SRE) within the Server Architecture team, you will play a crucial role in ensuring the reliability and performance of our services. The SRE team at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In this position, you will have the opportunity to manage complex challenges of scale while utilizing your expertise in coding, algorithms, complexity analysis, and large-scale system design. Your responsibilities will encompass the entire lifecycle of services, from inception and design through development, capacity planning, launch reviews, deployment, operation, and refinement. You will design and implement software platforms and monitoring frameworks that facilitate efficient, automated, and intelligent service-oriented architecture (SOA) governance. Additionally, you will focus on scaling systems sustainably through automation and evolving system reliability, efficiency, and velocity by advocating for necessary changes. You will also practice sustainable user support, incident response, and conduct blameless postmortems to learn from incidents and improve our systems continuously. At TikTok, we believe that every challenge is an opportunity to learn, innovate, and grow as a team. We are committed to creating an inclusive environment where employees are valued for their skills, experiences, and unique perspectives. Join us in our mission to inspire creativity and bring joy to our users around the globe.

Responsibilities

Engage in and improve the whole lifecycle of services from inception and design, throughout development, capacity planning, and launch reviews, to deployment, operation, and refinement.
Design and implement software platforms and monitor frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance.
Scale systems sustainably through mechanisms such as automation; evolve systems reliability, efficiency, and velocity by pushing for changes.
Practice sustainable user support, incident response, and blameless postmortems.

Requirements

Bachelor's degree in Computer Science or a related technical field with 3+ years of experience.
Experience programming in one of the languages: C, C++, Java, Python, Go, and Rust.
Familiar with Unix/Linux system internals, networking, and distributed systems.
Preferred experience in designing and analyzing large-scale distributed systems.
Preferred strong skills in problem solving and communication.

Nice-to-haves

Experience with cloud services and infrastructure management.
Familiarity with containerization technologies such as Docker and Kubernetes.
Knowledge of monitoring and logging tools like Prometheus, Grafana, or ELK stack.

Benefits

100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
Health Savings Account (HSA) with a company match.
Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) and 10 paid sick days per year.
12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
Mental and emotional health benefits through EAP and Lyra.
401K company match, gym and cellphone service reimbursements.

Site Reliability Engineer, TikTok Server Architecture

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company