Sharpedge Solutions
posted 2 months ago
The Site Reliability Engineer (SRE) on the DevOps platform will play a crucial role in ensuring the reliability, performance, and availability of Digital Sales & Marketing platforms. This position requires a strong background in software engineering, particularly in Java development, and a deep understanding of Site Reliability Engineering principles. The SRE will be responsible for building and maintaining dashboards, setting up alerts, and proactively monitoring application performance using tools such as Splunk, Grafana, and GCL. As a core member of the SRE support team, the engineer will utilize the latest technology tools to write code, develop test cases, and work with API specifications to automate processes that enhance platform resiliency. The role involves collaborating with various engineering teams, including Security, Networking, and Infrastructure, to address challenges that may impact platform health. The SRE will also represent the platform engineering teams during production outages, working closely with stakeholders to conduct root cause analysis (RCA) and implement permanent resolutions. The ideal candidate will have extensive experience in production support and a proven track record of improving platform health. They will be expected to identify opportunities for adopting new technologies, drive efficiency, and optimize processes while maintaining compliance with governance programs. The SRE will also be responsible for maintaining service level agreements (SLAs) and service level objectives (SLOs), constantly seeking ways to enhance platform metrics and communicate improvements to stakeholders. This position requires the ability to work shifts in a 12/7 support organization, ensuring continuous support and availability of services.