Tiktok - San Jose, CA
posted 3 days ago
TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. Our Compute Platform Site Reliability Engineering (SRE) team plays a crucial role in supporting all Big Data services and products across the company. As a newly established team, we are looking for talented individuals to help shape our future. The team is responsible for ensuring the reliability of TikTok's major data warehouse products, services, and query engines, including ClickHouse, Spark, Presto, and Doris. We serve business needs across various domains within TikTok, and we are excited to welcome you to our team. In this role, you will be responsible for upholding Service Level Agreements (SLAs) and ensuring that all service level objectives from ByteDance's Data Platform services are met. You will respond promptly to any system outages or issues, and continuously analyze service performance to identify potential bottlenecks. Your proactive measures will help prevent service disruptions, and you will work closely with development teams to optimize application performance. You will lead efforts in incident management, troubleshooting, and resolving service incidents while coordinating with cross-functional teams to mitigate service-impacting events. Automation will be a key focus, as you will automate infrastructure provisioning, scaling, and management processes to enhance service quality. Collaboration with product and development teams will be essential to integrate reliability and performance considerations into the software lifecycle. Additionally, you will assess and forecast infrastructure needs based on growth patterns and upcoming initiatives, while staying updated with industry trends and emerging technologies related to site reliability and infrastructure engineering.