Tiktok - San Jose, CA
posted 4 months ago
TikTok is the leading destination for short-form mobile video, with a mission to inspire creativity and bring joy. The company operates globally, with headquarters in Los Angeles and Singapore, and offices in major cities including New York, London, and Tokyo. The Recommendation Infrastructure Team at TikTok is tasked with building and optimizing the architecture for the recommendation system, ensuring a stable and high-quality experience for users. Site Reliability Engineers (SREs) within this team are responsible for maintaining system availability and creating automated systems and pipelines to enhance operational efficiency. In this role, you will engage in and improve the entire lifecycle of recommendation systems, from system design consulting to launch reviews, deployment, operation, and refinement. You will deliver tools and software aimed at improving the reliability and scalability of services, automating operations, and enhancing research and development efficiency. Additionally, you will be responsible for ensuring the availability of large-scale services deployed across global data centers, managing and optimizing cloud resource utilization, and ensuring service level agreements (SLAs) for large-scale clusters. Monitoring and measuring service health, latency, and overall availability will also be key components of your responsibilities, along with practicing sustainable incident response and conducting postmortems to learn from incidents.