Zscaler - Boston, MA

posted 3 months ago

Full-time - Mid Level
Remote - Boston, MA
Professional, Scientific, and Technical Services

About the position

As a Staff Site Reliability Engineer - Technical Duty Officer at Zscaler, you will play a pivotal role in leading the transformation of our Site Reliability Engineering (SRE) organization. This position is designed for an experienced professional who is passionate about promoting SRE principles within the Engineering Department. You will be responsible for providing expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution of issues. Your focus will be on fostering a customer-centric approach by addressing and mitigating global customer environment issues, while also promoting a culture of continuous learning and technical excellence within the SRE team. In this role, you will develop and implement scalable process frameworks and observability strategies that ensure rapid problem diagnosis, response, and service reliability. Collaboration with product teams will be essential as you analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency. Your expertise will be crucial in guiding the team through complex technical challenges and ensuring that our services remain robust and reliable for our customers. This position requires a strong background in Site Reliability Engineering, with a minimum of 5 years of experience in an Operations or Engineering environment. You will need to have hands-on experience troubleshooting Linux-based systems, as well as a solid understanding of networking concepts and protocols. Coding experience, particularly in Python, will be beneficial as you build tools, scripts, and automation processes to enhance our operational capabilities.

Responsibilities

  • Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.
  • Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.
  • Promote a customer-focused approach by addressing and mitigating global customer environment issues.
  • Foster a culture of continuous learning and technical excellence within the SRE team.
  • Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability.
  • Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.

Requirements

  • 5+ years of experience as a Site Reliability Engineer, with relevant experience in an Operations or Engineering environment.
  • Hands-on experience troubleshooting Linux-based systems.
  • Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.
  • Coding experience (preferably Python) building tools, scripting, or automation.
  • Bachelor's degree in Computer Science, a related technical field involving computer systems engineering, or equivalent practical experience.

Nice-to-haves

  • Experience supporting High/Moderate FedRAMP environments.
  • Understanding of Observability practices and Tools - Grafana, DataDog, Splunk, etc.
  • Experience Leading Major Incidents in large scale, high uptime environments.

Benefits

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service