Site Reliability Engineer Lead - Alpharetta, GA - Locals, F2F

Kr Elixir - Alpharetta, GA

posted 3 months ago

Full-time - Mid Level

Alpharetta, GA

Administrative and Support Services

About the position

The Site Reliability Engineer Lead position is a critical role within the engineering team, focusing on the development and delivery of a comprehensive data operations management solution for the client Data Fabric Platform. This role requires an experienced engineer who can operate independently with limited guidance and oversight, demonstrating a passion for end-user experience. The SRE will be involved in the entire Software Development Life Cycle (SDLC), which includes coding, scaling, and ensuring production stability, as well as responding to on-call incidents. As part of the responsibilities, the SRE will contribute to development activities by participating in design, development, testing, deployment, and operation of both frontend and backend systems. Collaboration with global teams is essential to integrate with existing internal systems and the Google Cloud Platform. The SRE will also be responsible for triaging and resolving product or system issues, ensuring quality and performance, and writing technical documentation, support guides, and run books. Agile practices will be a part of the role, requiring participation in sprint planning, retrospectives, and other agile activities. The SRE will ensure that software meets secure development guidelines and engineering standards, utilizing coding, automation, and software engineering principles to ensure scalability, performance, and reliability. This includes building infrastructure as code (IAC) patterns using technologies such as Terraform, scripting with cloud CLI, and programming with cloud SDK. The role also involves building CI/CD pipelines for application and cloud architecture patterns, automating tooling for service requests, and managing change in compliance with security policies. Incident management is a key responsibility, requiring the SRE to solve problems and triage complex distributed architecture service maps, as well as lead root cause analysis and blameless postmortems. The SRE will focus on customer needs, ensuring service disruptions are addressed, and will own the reliability roadmap, participating in Production Readiness Reviews and ensuring disaster recovery plans are in place.

Responsibilities

Participate in SDLC activities including design, development, testing, deployment, and operation.
Collaborate with global teams to integrate with existing internal systems and Google Cloud Platform.
Triage and resolve product or system issues, ensuring quality and performance.
Write technical documentation, support guides, and run books.
Participate in sprint planning, retrospectives, and other agile activities.
Ensure software meets secure development guidelines and engineering standards.
Use coding, automation, and software engineering principles to ensure scalability, performance, and reliability.
Build infrastructure as code (IAC) patterns using technologies like Terraform and cloud CLI.
Build CI/CD pipelines for application and cloud architecture patterns using Jenkins and cloud-native toolchains.
Build automated tooling to deploy service requests and manage changes in production.
Work closely with the dev team to address DevSecOps issues in compliance with security policies.
Solve problems and triage complex distributed architecture service maps during incidents.
Lead root cause analysis and blameless postmortems to remediate recurrences.
Ensure monitoring of SRE golden signals, SLO, SLIs, and SLAs are honored within error budgets.
Own the reliability roadmap and participate in Production Readiness Reviews.

Requirements

5-7 years of experience in software engineering, systems administration, database administration, and networking.
System administration skills, including automation and orchestration of Linux/Windows using Terraform, Chef, Ansible, and/or containers (Docker, Kubernetes).
3+ years of experience in developing and supporting cloud-native applications.
3+ years of experience as a SRE supporting an end-user facing application, including UI, APIs, and backend systems.
2+ years of general proficiency with Java, or JavaScript/NodeJS.
Experience with Angular, JavaScript, TypeScript, or modern web application development frameworks.
Understanding of modular systems, performance, scalability, and security.
Agile development mindset and experience.
Knowledge of RESTful web services, JSON, AVRO.
Strong debugging, performance tuning, and production support skills.
Strong written and verbal communication skills.
Experience with CI/CD concepts and tools including Jenkins/Bamboo, and release management concepts.
Understanding of Google Cloud Platform services related to big data like BigQuery, Dataflow, Pub/Sub, GCS, Composer/Airflow, or similar solutions in AWS.

Site Reliability Engineer Lead - Alpharetta, GA - Locals, F2F

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company