Tiktok - San Jose, CA

posted 3 days ago

Full-time - Mid Level
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. Our Compute Platform Site Reliability Engineering (SRE) team plays a crucial role in supporting all Big Data services and products across the company. As a newly established team, we are looking for talented individuals to help shape our future. The team is responsible for ensuring the reliability of TikTok's major data warehouse products, services, and query engines, including ClickHouse, Spark, Presto, and Doris. We serve business needs across various domains within TikTok, and we are excited to welcome you to our team. In this role, you will be responsible for upholding Service Level Agreements (SLAs) and ensuring that all service level objectives from ByteDance's Data Platform services are met. You will respond promptly to any system outages or issues, and continuously analyze service performance to identify potential bottlenecks. Your proactive measures will help prevent service disruptions, and you will work closely with development teams to optimize application performance. You will lead efforts in incident management, troubleshooting, and resolving service incidents while coordinating with cross-functional teams to mitigate service-impacting events. Automation will be a key focus, as you will automate infrastructure provisioning, scaling, and management processes to enhance service quality. Collaboration with product and development teams will be essential to integrate reliability and performance considerations into the software lifecycle. Additionally, you will assess and forecast infrastructure needs based on growth patterns and upcoming initiatives, while staying updated with industry trends and emerging technologies related to site reliability and infrastructure engineering.

Responsibilities

  • Responsible for the reliability of all TikTok's major data warehouse products, services, and query engines, such as ClickHouse, Spark, Presto, Doris, etc.
  • Uphold Service Level Agreements (SLAs): Ensure that all service level objectives and agreements from ByteDance's Data Platform services are met. Respond promptly to any system outages or issues.
  • Continuous Performance Optimization: Analyze service performance and reliability patterns to identify potential performance bottlenecks. Implement proactive measures to prevent service disruptions. Work with development teams to optimize application performance, ensuring that services run efficiently and that resources are utilized effectively.
  • Incident Management: Lead efforts to troubleshoot and resolve service incidents and postmortems. Coordinate with cross-functional teams to manage and mitigate service-impacting events.
  • Infrastructure Automation: Automate infrastructure provisioning, scaling, and management processes to reduce manual interventions and improve service quality.
  • Collaboration: Engage with product and development teams to integrate reliability and performance considerations into the software lifecycle.
  • Capacity and Demand Planning: Assess and forecast infrastructure needs based on growth patterns and upcoming initiatives.
  • Stay Updated: Keep current with industry trends, best practices, and emerging technologies related to site reliability and infrastructure engineering.

Requirements

  • Bachelor's Degree or above, in Computer Science, Engineering, or a related field.
  • In-depth understanding of Linux, computer networking, and databases.
  • Proficient in common SRE/DevOps open-source toolsets, system monitoring tools, and container orchestration platforms like Kubernetes.
  • Experience or familiarity with open-source or commercial technologies such as ClickHouse, Hadoop, Doris, Spark, Presto and Kubernetes.
  • Strong coding skills in at least one scripting or programming language, including but not limited to Python, Shell, Java, Go, etc.
  • Excellent problem-solving skills and the ability to think critically under pressure.
  • Strong written and verbal communication skills, with a great customer-first mindset.

Benefits

  • 100% premium coverage for employee medical insurance
  • Approximately 75% premium coverage for dependents
  • Health Savings Account (HSA) with a company match
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans
  • Flexible Spending Account (FSA) Options like Health Care, Limited Purpose and Dependent Care
  • 10 paid holidays per year
  • 17 days of Paid Personal Time Off (PPTO)
  • 10 paid sick days per year
  • 12 weeks of paid Parental leave
  • 8 weeks of paid Supplemental Disability
  • Mental and emotional health benefits through EAP and Lyra
  • 401K company match
  • Gym and cellphone service reimbursements
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service