Tiktok - San Jose, CA

posted 4 days ago

Full-time - Manager
San Jose, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. Our Compute Platform SRE team is a newly established group responsible for supporting all Big Data services and products across the company. We ensure the reliability of TikTok's major data warehouse products, services, and query engines, including ClickHouse, Spark, Presto, and Doris. As a Tech Lead Manager, you will lead a global SRE team distributed across the US and Singapore, focusing on maintaining high service reliability and performance optimization. You will be responsible for upholding Service Level Agreements (SLAs), managing incident responses, and developing robust incident management mechanisms. Your role will also involve continuous performance optimization, infrastructure automation, and collaboration with product and development teams to integrate reliability into the software lifecycle. In this position, you will assess and forecast infrastructure needs based on growth patterns and upcoming initiatives, while staying updated with industry trends and best practices. You will lead efforts to troubleshoot and resolve service incidents, coordinate with cross-functional teams, and implement proactive measures to prevent service disruptions. Your leadership will be crucial in shaping the future of the Compute Platform SRE team, driving impact for TikTok and the communities we serve. We are looking for someone who is passionate about computer science and Internet technology, with a strong sense of ownership and the ability to collaborate effectively across time zones.

Responsibilities

  • Lead a global SRE team for TikTok's Data Platform, distributed across the US and Singapore.
  • Ensure that all service level objectives and agreements from ByteDance's Data Platform services are met.
  • Lead team members to respond promptly to any system outages or issues.
  • Analyze service performance and reliability patterns to identify potential performance bottlenecks.
  • Implement proactive measures to prevent service disruptions and optimize application performance.
  • Build robust incident management mechanisms and lead efforts to troubleshoot and resolve service incidents.
  • Coordinate with cross-functional teams to manage and mitigate service-impacting events.
  • Develop highly efficient toolchains covering end-to-end deployment and reliability assurance operations.
  • Automate infrastructure provisioning, scaling, and management processes to improve service quality.
  • Engage with product and development teams to integrate reliability and performance considerations into the software lifecycle.
  • Assess and forecast infrastructure needs based on growth patterns and upcoming initiatives.
  • Stay current with industry trends, best practices, and emerging technologies related to site reliability and infrastructure engineering.

Requirements

  • Bachelor's Degree or above in Computer Science, Engineering, or a related field.
  • 5+ years of experience in the SRE domain and 2+ years of experience in team management.
  • In-depth understanding of Linux, computer networking, and databases.
  • Proficient in common SRE/DevOps open-source toolsets, system monitoring tools, and container orchestration platforms like Kubernetes.
  • Experience or familiarity with technologies such as ClickHouse, Hadoop, Doris, Spark, Presto, and Kubernetes.
  • Strong coding skills in at least one scripting or programming language, including Python, Shell, Java, or Go.
  • Excellent problem-solving skills and the ability to think critically under pressure.
  • Strong written and verbal communication skills with a customer-first mindset.
  • Ability to collaborate effectively with partners and team members across time zones.

Nice-to-haves

  • Experience with chaos engineering and disaster recovery drills.
  • Familiarity with cloud platforms and services.
  • Knowledge of data warehousing concepts and technologies.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% for dependents.
  • Health Savings Account (HSA) with company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • Flexible Spending Account (FSA) options for healthcare and dependent care.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO).
  • 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match, gym and cellphone service reimbursements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service