Kr Elixir - Alpharetta, GA
posted 3 months ago
The Site Reliability Engineer Lead position is a critical role within the engineering team, focusing on the development and delivery of a comprehensive data operations management solution for the client Data Fabric Platform. This role requires an experienced engineer who can operate independently with limited guidance and oversight, demonstrating a passion for end-user experience. The SRE will be involved in the entire Software Development Life Cycle (SDLC), which includes coding, scaling, and ensuring production stability, as well as responding to on-call incidents. As part of the responsibilities, the SRE will contribute to development activities by participating in design, development, testing, deployment, and operation of both frontend and backend systems. Collaboration with global teams is essential to integrate with existing internal systems and the Google Cloud Platform. The SRE will also be responsible for triaging and resolving product or system issues, ensuring quality and performance, and writing technical documentation, support guides, and run books. Agile practices will be a part of the role, requiring participation in sprint planning, retrospectives, and other agile activities. The SRE will ensure that software meets secure development guidelines and engineering standards, utilizing coding, automation, and software engineering principles to ensure scalability, performance, and reliability. This includes building infrastructure as code (IAC) patterns using technologies such as Terraform, scripting with cloud CLI, and programming with cloud SDK. The role also involves building CI/CD pipelines for application and cloud architecture patterns, automating tooling for service requests, and managing change in compliance with security policies. Incident management is a key responsibility, requiring the SRE to solve problems and triage complex distributed architecture service maps, as well as lead root cause analysis and blameless postmortems. The SRE will focus on customer needs, ensuring service disruptions are addressed, and will own the reliability roadmap, participating in Production Readiness Reviews and ensuring disaster recovery plans are in place.