Team Lead, Site Reliability

Kubra - Tempe, AZ

posted 3 months ago

Full-time - Senior

Tempe, AZ

Professional, Scientific, and Technical Services

About the position

Are you an experienced Site Reliability Engineer with a passion for enhancing platform stability, reliability, and efficiency? We are growing at KUBRA, and we're looking for a skilled Team Lead, Site Reliability Engineer, where you will guide our DevOps team in optimizing our customer experience management platforms. In this dynamic role, you will work collaboratively with cross-functional teams to apply SRE principles and drive continuous improvement. Your technical expertise will be pivotal in identifying potential issues, resolving complex problems, and leading technical and business discussions. You will leverage your experience in IT Service Delivery and Management to standardize operations, enhance service levels, and support technology system evolution. This is a hybrid opportunity in Tempe, AZ. What you get to do every day: Ensure that infrastructure and applications perform within established Service Level Agreements (SLA) and Service Level Objectives (SLO). Maintain well-documented standards and best practices to ensure services are built for high availability and security. Implement appropriate automation and observability to achieve low and continuously improving mean time to recovery (MTTR) for service-impacting incidents. Document any incidents thoroughly, along with corresponding problem records and corrective actions. Participate in the Architectural Review Process for new and existing services, ensuring compliance with high-availability, observability, security, and cost efficiency standards. Enhance governance processes to ensure all platform components meet current standards. Lead root cause analysis for major incidents, communicating with senior stakeholders, driving problem-solving, and debugging using best practice techniques. Design and conduct fault injection experiments to identify potential weak points in high-availability architecture and work with engineering teams to remediate findings. Collaborate with engineering teams to optimize infrastructure for security, resiliency, and cost targets based on collected feedback. Document processes and maintain records related to infrastructure procedures and strategies, ensuring appropriate alerts and support procedures are in place for quick incident remediation.

Responsibilities

Ensure that infrastructure and applications perform within established Service Level Agreements (SLA) and Service Level Objectives (SLO).
Maintain well-documented standards and best practices to ensure services are built for high availability and security.
Implement appropriate automation and observability to achieve low and continuously improving mean time to recovery (MTTR) for service-impacting incidents.
Document any incidents thoroughly, along with corresponding problem records and corrective actions.
Participate in the Architectural Review Process for new and existing services, ensuring compliance with high-availability, observability, security, and cost efficiency standards.
Enhance governance processes to ensure all platform components meet current standards.
Lead root cause analysis for major incidents, communicating with senior stakeholders, driving problem-solving, and debugging using best practice techniques.
Design and conduct fault injection experiments to identify potential weak points in high-availability architecture and work with engineering teams to remediate findings.
Collaborate with engineering teams to optimize infrastructure for security, resiliency, and cost targets based on collected feedback.
Document processes and maintain records related to infrastructure procedures and strategies, ensuring appropriate alerts and support procedures are in place for quick incident remediation.

Requirements

Bachelor's degree in Computer Science, Engineering, Information Technology, or equivalent experience.
5+ years of experience in site reliability engineering or a related field.
Proven leadership and team management experience.
Experience with systems programming languages, such as Go or Python, and shell scripting.
Proficient with Terraform and infrastructure as code principles.
Demonstrated proficiency in public cloud environments, particularly AWS.
Hands-on experience with Kubernetes management within AWS EKS.
Experience with CI/CD automation tools, such as CircleCI and ArgoCD.
Experience with monitoring and logging using tools like Prometheus, Grafana, Open Telemetry, CloudWatch, and Honeycomb.
AWS and Kubernetes Certifications (Solutions Architect, SysOps Administrator, DevOps Engineer, CKA, CKS, CKAD, KCNA) are desirable.

Nice-to-haves

AWS and Kubernetes Certifications (Solutions Architect, SysOps Administrator, DevOps Engineer, CKA, CKS, CKAD, KCNA) are desirable.

Benefits

Tuition reimbursement
Flexible schedule
Paid day off for your birthday
Access to LinkedIn learning courses
Bi-annual performance-based bonus
Continued education with our education reimbursement program
Flexible schedules
Free unlimited access to our refreshment stations (fully stocked with tea, coffee and other beverages)
Two paid days for volunteer opportunities
A free premium membership for ‘Headspace'; an app geared towards mental health and wellbeing
Access to Perkopolis retail discounts
Generous benefit coverage with low premiums (+ a Health Care Spending Account)
RRSP Matching

Team Lead, Site Reliability

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company