Bytedance - San Jose, CA
posted 3 days ago
As a Site Reliability Engineer at CapCut, you will play a crucial role in ensuring the stability and performance of our large-scale systems. Your primary responsibility will be to design and develop solutions that automate technical operations, working closely with various teams to enhance system stability from a Software Development Lifecycle perspective. You will be tasked with strengthening the stability of CapCut systems, which includes monitoring, logging, dashboard creation, and developing diagnostic tools. Conducting regular drills and creating remedy plans will be essential to achieve rapid service restoration, and you will be expected to take shifts to respond to production issues across different regions. In addition to operational responsibilities, you will define key performance indicators to evaluate system performance and runtime, improving observability and facilitating the system development and troubleshooting processes. You will also be involved in planning system capacities in line with business expansion and scheduled promotions. This position requires a proactive approach to problem-solving and a strong sense of ownership, as you will be addressing system issues and collaborating with teams to implement effective solutions. At CapCut, we are committed to fostering a culture of creativity and innovation. Our team is passionate about learning and taking on challenges, and we value good ideas that drive impact for our users. As part of a young and dynamic team, you will have the opportunity to contribute to the development of cutting-edge AI technology that enhances content creation while ensuring user privacy and data security.