Splunk
posted 3 months ago
Splunk is dedicated to building a safer and more resilient digital world, and as a Site Reliability Engineer (SRE) early in your career, you will play a crucial role in this mission. The Cloud organization at Splunk is focused on developing and maintaining robust and resilient platform solutions for the Software as a Service (SaaS) hosting of Splunk's enterprise software. This position is part of the TechOps team, which is responsible for monitoring and resolving issues that affect the availability and performance of Splunk for our cloud customers around the clock. As a member of this team, you will be the authority on customer experience, providing support and guidance to ensure that all technical issues are addressed promptly and effectively. In this fully remote position, you will work 4 x 10 shifts from Wednesday to Saturday, 4 PM to 2 AM. Your primary responsibilities will include providing technical support for the Splunk Cloud fleet, performing impact assessments, documenting issues and remediation steps, and leading support cases. You will also communicate with TechOps engineers and business partners regarding cloud-related issues, assist colleagues with complex tasks, and represent the TechOps team in meetings to recommend new procedures and processes. Your role will require you to use internal tools to restore normal service operations quickly during escalated incidents, ensuring a quality customer experience at all times. The ideal candidate for this position will have a passion for large complex systems and experience working with distributed systems. You will be expected to think critically about automation and efficiency, constantly asking yourself how processes can be improved and scaled across thousands of machines. Data-driven decision-making is key, and you will strive to identify issues before they impact customers. This role requires a proactive approach to problem-solving and a commitment to maintaining high standards of service.