Staff Site Reliability Engineer - Incident Response

$136,500 - $136,500/Yr

Zscaler - Boston, MA

posted 16 days ago

Full-time - Mid Level

Remote - Boston, MA

Professional, Scientific, and Technical Services

About the position

The Staff Site Reliability Engineer - Incident Response at Zscaler is responsible for leading the transformation of the Site Reliability Engineering (SRE) organization, ensuring high service reliability and operational efficiency. This role involves coordinating critical incident responses, promoting a customer-focused approach, and developing scalable processes and observability strategies. The position requires collaboration with product teams to analyze failures and improve service reliability, making it essential for maintaining Zscaler's reputation as a leader in cloud security.

Responsibilities

Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.
Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.
Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team.
Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability.
Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.

Requirements

5+ years of experience as a Site Reliability Engineer, with relevant experience in an Operations or Engineering environment.
Hands-on experience troubleshooting Linux-based systems.
Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.
Coding experience (preferably Python) building tools, scripting, or automation.
Bachelor's degree in Computer Science, a related technical field involving computer systems engineering, or equivalent practical experience.

Nice-to-haves

Experience supporting High/Moderate FedRAMP environments.
Understanding of Observability practices and Tools - Grafana, DataDog, Splunk, etc.
Experience Leading Major Incidents in large scale, high uptime environments.

Benefits

Various health plans
Time off plans for vacation and sick time
Parental leave options
Retirement options
Education reimbursement
In-office perks, and more!

Staff Site Reliability Engineer - Incident Response

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company