Senior Director - Site Reliability Engineering

Visaposted 5 days ago

$192,700 - $301,150/Yr

Full-time • Senior

Foster City, CA

Match Score

Add your resume to Teal and unlock your Job Match score for free

Add Resume Bookmark with Teal

About the position

As the Senior Director of Site Reliability Engineering (SRE), you will lead a team of SREs to ensure the highest level of performance and reliability of our services. You will be responsible for the end-to-end availability and performance of mission-critical services and building automation to prevent problem recurrence. The role requires a strategic leader who can create a vision for the SRE function and drive a culture of ‘automation first’ to improve the scalability and stability of our systems.

Responsibilities

Lead and scale the SRE team, setting objectives and key results that align with the company’s strategic goals.
Develop and implement SRE policies, standards, and best practices for enterprise-wide systems.
Define standards for building reliable applications that are highly available and resilient.
Drive the adoption of a DevSecOps culture, fostering collaboration between development and operations teams.
Oversee the design and implementation of solutions for system monitoring, logging, alerting, and incident response.
Collaborate with product development teams to ensure reliability and scalability are considered at the design phase.
Manage on-call rotations, incident management processes, and post-mortem analyses to ensure continuous improvement.
Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for all critical services.
Work closely with the security team to ensure compliance with industry standards and regulatory requirements.
Lead initiatives to improve CI/CD pipelines and automate infrastructure provisioning and deployment.
Provide technical leadership and mentorship to team members, encouraging professional growth and technical excellence.

Requirements

12 or more years of work experience with a Bachelor’s Degree or at least 10 years of work experience with an Advanced degree (e.g. Masters/MBA/JD/MD), or a minimum of 5 years of work experience with a PhD.
Minimum of 10 years in a site reliability engineering role with at least 5 years in a leadership position managing large SRE teams.
Proficiency in system design and architecture, particularly in a cloud environment.
Expertise in automation and orchestration systems like Kubernetes, Terraform, and Ansible.
Strong coding skills in languages such as Go, Python, Ruby, or Java.
Deep understanding of networking concepts and protocols.
Experience with continuous integration and continuous deployment (CI/CD) pipelines and tools.
Proven track record of leading teams through complex system outages and scalability challenges.
Ability to mentor and grow an SRE team, fostering a culture of continuous learning and innovation.
Strong project management skills, with experience in Agile methodologies.
Excellent verbal and written communication abilities.
Proficient in creating technical documentation and system diagrams.
Experience presenting to C-level executives and stakeholders.
Demonstrated experience in incident management and post-mortem analysis.
Commitment to high availability, fault tolerance, and reliability in all aspects of work.
Knowledge of compliance and security best practices in a highly regulated industry.

Nice-to-haves

15 or more years of experience with a Bachelor’s Degree or 12 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, or MD), PhD with 9+ years of experience in Computer Science, Engineering, or a related technical field.
Certifications in cloud technologies (AWS, GCP, Azure).
Contributions to open-source projects or public speaking at relevant tech conferences.
Strategic thinker with a vision for the future of SRE within the organization.
Resilient and adaptable in the face of changing technology landscapes.
Collaborative mindset with a focus on cross-functional partnerships.