Tiktok - Los Angeles, CA

posted 3 months ago

Full-time - Mid Level
Los Angeles, CA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Intelligent Creation Site Reliability Engineering (SRE) Team at TikTok is on a mission to enhance the content creation platform through the application of visual intelligence and artificial intelligence. As a Site Reliability Engineer, you will play a crucial role in ensuring the reliability and performance of our services, which are essential for empowering content creators and users alike. This position is ideal for individuals who are passionate about software reliability and enjoy tackling complex challenges in a dynamic environment. You will work closely with product teams to implement the latest AI Generative Content, Intelligent Editing, and Content Understanding technologies, making a tangible impact on TikTok users around the world. In this role, you will be responsible for deploying and maintaining the content creation platform, which includes training, inference, and pipeline orchestration in a production environment. You will continuously integrate and deploy services to the cloud, ensuring optimal performance and reliability. Your expertise will be vital in developing and maintaining software, identifying performance bottlenecks, and debugging issues. Additionally, you will engage in service capacity planning, demand forecasting, and system tuning to enhance the overall efficiency of our services. The SRE team is dedicated to monitoring the health and performance of over 100 microservices that power TikTok's content creation platform. You will intervene as needed to rectify outages or issues, ensuring that our platform remains robust and reliable. This position requires a collaborative mindset, as you will work closely with cross-functional teams to foster effective partnerships and enhance our service-oriented architecture governance. TikTok promotes a hybrid work schedule, requiring employees to work in the office three days a week, with flexibility based on departmental needs.

Responsibilities

  • Provide site reliability engineering support to deploy and maintain the content creation platform, including training, inference, and pipeline orchestration in the production environment under the guidance of Senior-level SREs.
  • Continuously integrate and deploy our services to the cloud environment, ensuring optimal performance and reliability.
  • Develop and maintain software while looking into performance bottlenecks and debugging software issues.
  • Engage in service capacity planning and demand forecasting, software performance analysis, and system tuning.
  • Assist the team in managing frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance.
  • Monitor health and performance of 100+ microservices that power TikTok's content creation platform; intervene as needed to rectify outages or issues.

Requirements

  • A minimum of 2 years previous experience as an SRE or similar software engineering role.
  • Ability to write clean, maintainable code, with proficiency in languages like Python, Java, or Go.
  • Extensive experience working within a cloud environment, with tools such as AWS, GCP, or Azure.
  • Strong understanding of software development and cloud architecture best practices.
  • Experience supporting microservices at scale; familiarity with observability tools desirable.
  • Excellent problem-solving skills and an ability to manage complex tasks efficiently.
  • Good communication skills for effective collaboration within the team and external departments.

Nice-to-haves

  • Prior experience using tools like Kubernetes, Docker, Prometheus, or other similar technologies.
  • Knowledge of or experience in DevOps methodologies and continuous integration/continuous deployment (CI/CD) processes.
  • Familiarity with network protocols, security, and DNS.
  • Certifications from recognized bodies in relevant fields, e.g. Google Certified Professional - Cloud Architect, AWS Certified DevOps Engineer.
  • Knowledge of Machine Learning and AI concepts could be advantageous.

Benefits

  • Inclusive workplace culture
  • Reasonable accommodations for candidates with disabilities or other protected reasons
  • Opportunities for professional development and growth
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service