Site Reliability Engineer II

Zip Co Limitedposted 19 days ago

$120,000 - $140,000/Yr

Mid Level

Upload and Match ResumeTrack Jobs with Teal

About the position

Join Zip’s Infrastructure Engineering team and play a key role in building the reliability and scalability of our cloud-native platform, which serves millions of customers and processes billions in payments. We value pragmatic problem solvers who use code, systems design, and operational best practices to improve both developer experience and product uptime. As a Site Reliability Engineer at Zip, you'll play a critical role in ensuring the reliability, performance, and scalability of our Azure-based infrastructure. You'll collaborate closely with software engineers to embed SRE best practices across the development lifecycle, define and track SLIs/SLOs, and maintain robust observability systems. This hands-on role involves building automated deployment pipelines using tools like Azure DevOps and Terraform, supporting a Kubernetes-based platform, and contributing to incident response and recovery efforts. We're looking for someone with strong experience in cloud infrastructure, container orchestration, and infrastructure as code. In return, you’ll join a fast-paced, supportive environment where you’re trusted to drive impact, grow your skills, and be yourself.

Responsibilities

Ensure service reliability, availability, and performance in a growing Azure-based infrastructure
Collaborate with software engineers to integrate SRE practices across the development lifecycle
Define and track SLIs/SLOs and contribute to reliability goals using metrics and monitoring tools
Build and maintain automated deployment pipelines using Azure DevOps, Terraform, Env0, and Atlantis
Support a Kubernetes-based platform including service mesh technologies
Help design self-healing infrastructure and automated recovery using metrics and health checks
Participate in on-call rotations and contribute to incident response and post-incident reviews
Continuously improve observability, monitoring, and alerting systems

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles
2+ years of hands-on experience with Kubernetes or similar container orchestration platforms
Strong experience working with Azure services and cloud infrastructure
Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or similar
Understanding of CI/CD pipelines and experience with tools like Azure DevOps
Solid foundation in networking concepts, load balancing, and service communication
Experience with observability tools (e.g., Prometheus, Grafana, Azure Monitor) is a plus
Willingness to participate in an on-call rotation and respond to incidents