Zscaler - Boston, MA

posted 16 days ago

Full-time - Mid Level
Remote - Boston, MA
Professional, Scientific, and Technical Services

About the position

The Staff Site Reliability Engineer - Incident Response at Zscaler is responsible for leading the transformation of the Site Reliability Engineering (SRE) organization, ensuring high service reliability and operational efficiency. This role involves coordinating critical incident responses, promoting a customer-focused approach, and developing scalable processes and observability strategies. The position requires collaboration with product teams to analyze failures and improve service reliability, making it essential for maintaining Zscaler's reputation as a leader in cloud security.

Responsibilities

  • Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.
  • Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.
  • Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team.
  • Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability.
  • Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.

Requirements

  • 5+ years of experience as a Site Reliability Engineer, with relevant experience in an Operations or Engineering environment.
  • Hands-on experience troubleshooting Linux-based systems.
  • Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.
  • Coding experience (preferably Python) building tools, scripting, or automation.
  • Bachelor's degree in Computer Science, a related technical field involving computer systems engineering, or equivalent practical experience.

Nice-to-haves

  • Experience supporting High/Moderate FedRAMP environments.
  • Understanding of Observability practices and Tools - Grafana, DataDog, Splunk, etc.
  • Experience Leading Major Incidents in large scale, high uptime environments.

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service