Site Reliability engineering on DevOps platform

Sharpedge Solutions

posted 3 months ago

Full-time - Mid Level

Professional, Scientific, and Technical Services

About the position

The Site Reliability Engineer (SRE) on the DevOps platform will play a crucial role in ensuring the reliability, performance, and availability of Digital Sales & Marketing platforms. This position requires a strong background in software engineering, particularly in Java development, and a deep understanding of Site Reliability Engineering principles. The SRE will be responsible for building and maintaining dashboards, setting up alerts, and proactively monitoring application performance using tools such as Splunk, Grafana, and GCL. As a core member of the SRE support team, the engineer will utilize the latest technology tools to write code, develop test cases, and work with API specifications to automate processes that enhance platform resiliency. The role involves collaborating with various engineering teams, including Security, Networking, and Infrastructure, to address challenges that may impact platform health. The SRE will also represent the platform engineering teams during production outages, working closely with stakeholders to conduct root cause analysis (RCA) and implement permanent resolutions. The ideal candidate will have extensive experience in production support and a proven track record of improving platform health. They will be expected to identify opportunities for adopting new technologies, drive efficiency, and optimize processes while maintaining compliance with governance programs. The SRE will also be responsible for maintaining service level agreements (SLAs) and service level objectives (SLOs), constantly seeking ways to enhance platform metrics and communicate improvements to stakeholders. This position requires the ability to work shifts in a 12/7 support organization, ensuring continuous support and availability of services.

Responsibilities

Build and maintain dashboards and set up alerts using Splunk, Grafana, and GCL.
Proactively monitor application performance through various APM tools.
Collaborate with engineering teams to resolve production outages and conduct root cause analysis (RCA).
Identify opportunities to adopt new technologies and improve operational efficiency.
Maintain service level agreements (SLAs) and service level objectives (SLOs).
Support the governance programs and processes in the functional area.
Communicate and mitigate risks originating from non-compliance and operational errors.
Work with legacy and cloud infrastructure to ensure platform resiliency.
Influence SRE practices to foster a strong DevOps culture within the organization.

Requirements

10+ years of Software Engineering experience or equivalent.
10+ years of experience in Production support/Site Reliability Engineering.
Hands-on expertise with automated testing and process automation.
Experience with distributed systems, algorithms, relational databases, and NoSQL databases.
Knowledge of caching tools (Redis, memcache) and messaging tools (MQ, Kafka).
Working knowledge of APM tools such as Splunk, GCL, ELK, Grafana, and Prometheus.
Experience with CI/CD tools and source control like Git and Jenkins.
Ability to work with engineering teams across various functions.
Proficiency in shell scripting and DevOps tools like Ansible.

Nice-to-haves

Experience with distributed storage technologies like NFS.
Familiarity with dynamic resource management frameworks like PCF, Kubernetes, or OpenShift.
Knowledge of cloud platforms such as AWS or Azure.
Experience with different API styles such as SOAP, REST, and Microservices.

Benefits

Competitive salary
Health insurance
401k plan with matching contributions
Flexible working hours
Opportunities for professional development
Paid time off and holidays

Site Reliability engineering on DevOps platform

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company