Tiktok - San Jose, CA

posted 3 days ago

Full-time - Mid Level
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a Site Reliability Engineer (SRE) for TikTok's Global E-Commerce team, you will play a crucial role in ensuring the reliability and performance of our mission-critical e-commerce platform. This position is part of a global on-call rotation, where you will be responsible for Tier-1 online incident response and DevOps support. Your primary focus will be on maintaining service levels for our revenue-generating e-commerce platform, which includes overseeing all supporting infrastructure and services. You will work in a cloud-native environment, emphasizing service reliability, highly-scalable design, and effective release management. In this role, you will define service level indicators and data-driven objectives, developing DevOps and SRE standards, processes, and methodologies to enhance uptime, latency, and overall system health. Collaboration is key; you will work closely with engineering and product teams to ensure that essential stability and maintainability requirements, such as capacity planning and launch reviews, are met. This collaboration will enable transparent service delivery to our customers. You will also design strategies for risk detection and mitigation, disaster recovery simulations, release management, cost optimization, and engineering quality. Automation will be a significant part of your responsibilities, focusing on infrastructure-as-code, scalability, and service resiliency. Implementing best practices around incident management and conducting post-mortems will be essential, as you will be part of the on-call rotations that ensure our services remain operational and efficient.

Responsibilities

  • Be part of global SRE oncall rotation and be responsible for Tier-1 online incident response and devops support.
  • Be responsible for service levels of mission critical, revenue-generating E-commerce platform as well as all supporting infrastructure and services.
  • Define service level indicators and data-driven objectives, and develop devops / SRE standards, processes and methodologies, to uphold and improve uptime, latency, and system health of a core global e-commerce production platform.
  • Collaborate cross-team with engineering and product to ensure that key stability and maintainability requirements, such as capacity planning and launch reviews, are performed to enable transparent service delivery to customers.
  • Design strategies for risk detection and mitigation, disaster recovery & simulation, release management, cost optimisation, engineering quality etc.
  • Automation geared towards infrastructure-as-code, scalability and service resiliency.
  • Implement best practices around incident management, post-mortems while being part of on-call rotations.

Requirements

  • Bachelor's or higher degree in Computer Science, similar technical field of study, or equivalent practical experience.
  • 5+ years experience developing, provisioning or maintaining production-grade large scaled distributed systems.
  • High level of proficiency in Linux OS internals, networking, microservices, databases, caches etc in cloud-native environments.
  • Demonstrable familiarity with programming or scripting languages (Go/Python/Bash/C++ etc).
  • Demonstrable experience in the development and implementation of devops and SRE methodologies.

Nice-to-haves

  • Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents.
  • Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match.
  • Gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service