Tiktok - San Jose, CA
posted 3 months ago
At TikTok, we are on a mission to inspire creativity and bring joy to our users. As a leading destination for short-form mobile video, our platform is designed to help imaginations thrive. Our global headquarters are located in Los Angeles and Singapore, with offices in major cities around the world including New York, London, and Tokyo. We believe that every challenge is an opportunity for growth and innovation, and we are committed to creating an environment where our teams can collaborate and drive impact together. The role of Site Reliability Engineer (SRE) is crucial to our success, as it involves providing support for the deployment and maintenance of our machine learning (ML) systems and platforms. This includes overseeing training, inference, and pipeline orchestration in a production environment, all while working under the guidance of senior-level SREs. The ideal candidate will be responsible for designing and implementing software platforms and infrastructures, ensuring system health through effective monitoring, and developing large-scale distributed ML training and serving systems. In addition to technical responsibilities, the SRE will assist in managing frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance. We value sustainable user support, incident response, and conducting blameless postmortems to continuously improve our processes. This position offers an exciting opportunity to be part of a dynamic team that is tackling new challenges and developing innovative solutions in a fast-paced environment.