Tiktok - San Jose, CA

posted 3 days ago

Full-time - Senior
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a Senior Site Reliability Engineer (SRE) for TikTok's Global E-Commerce team, you will play a crucial role in ensuring the reliability and performance of our mission-critical e-commerce platform. This position is part of a global SRE on-call rotation, where you will be responsible for Tier-1 online incident response and DevOps support. Your primary focus will be on maintaining service levels for our revenue-generating e-commerce platform and its supporting infrastructure, emphasizing service reliability, highly-scalable design, and effective release management in a cloud-native environment. In this role, you will define service level indicators and data-driven objectives, developing DevOps and SRE standards, processes, and methodologies to enhance uptime, latency, and overall system health. Collaboration is key; you will work closely with engineering and product teams to ensure that essential stability and maintainability requirements, such as capacity planning and launch reviews, are met, facilitating transparent service delivery to our customers. You will also design strategies for risk detection and mitigation, disaster recovery, release management, cost optimization, and engineering quality. Automation will be a significant part of your responsibilities, focusing on infrastructure-as-code, scalability, and service resiliency. Additionally, you will implement best practices around incident management and post-mortems while participating in on-call rotations, ensuring that we learn from incidents and continuously improve our systems and processes.

Responsibilities

  • Be part of global SRE oncall rotation and be responsible for Tier-1 online incident response and devops support.
  • Be responsible for service levels of mission critical, revenue-generating E-commerce platform as well as all supporting infrastructure and services.
  • Define service level indicators and data-driven objectives, and develop devops / SRE standards, processes and methodologies, to uphold and improve uptime, latency, and system health of a core global e-commerce production platform.
  • Collaborate cross-team with engineering and product to ensure that key stability and maintainability requirements, such as capacity planning and launch reviews, are performed to enable transparent service delivery to customers.
  • Design strategies for risk detection and mitigation, disaster recovery & simulation, release management, cost optimisation, engineering quality etc.
  • Automation geared towards infrastructure-as-code, scalability and service resiliency.
  • Implement best practices around incident management, post-mortems while being part of on-call rotations.

Requirements

  • Bachelor's or higher degree in Computer Science, similar technical field of study, or equivalent practical experience.
  • 5+ years experience developing, provisioning or maintaining production-grade large scaled distributed systems.
  • High level of proficiency in Linux OS internals, networking, microservices, databases, caches etc in cloud-native environments.
  • Demonstrable familiarity with programming or scripting languages (Go/Python/Bash/C++ etc).
  • Demonstrable experience in the development and implementation of devops and SRE methodologies.

Nice-to-haves

  • Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) and 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match, gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service