Visaposted 5 days ago
$192,700 - $301,150/Yr
Full-time • Senior
Foster City, CA

About the position

As the Senior Director of Site Reliability Engineering (SRE), you will lead a team of SREs to ensure the highest level of performance and reliability of our services. You will be responsible for the end-to-end availability and performance of mission-critical services and building automation to prevent problem recurrence. The role requires a strategic leader who can create a vision for the SRE function and drive a culture of ‘automation first’ to improve the scalability and stability of our systems.

Responsibilities

  • Lead and scale the SRE team, setting objectives and key results that align with the company’s strategic goals.
  • Develop and implement SRE policies, standards, and best practices for enterprise-wide systems.
  • Define standards for building reliable applications that are highly available and resilient.
  • Drive the adoption of a DevSecOps culture, fostering collaboration between development and operations teams.
  • Oversee the design and implementation of solutions for system monitoring, logging, alerting, and incident response.
  • Collaborate with product development teams to ensure reliability and scalability are considered at the design phase.
  • Manage on-call rotations, incident management processes, and post-mortem analyses to ensure continuous improvement.
  • Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for all critical services.
  • Work closely with the security team to ensure compliance with industry standards and regulatory requirements.
  • Lead initiatives to improve CI/CD pipelines and automate infrastructure provisioning and deployment.
  • Provide technical leadership and mentorship to team members, encouraging professional growth and technical excellence.

Requirements

  • 12 or more years of work experience with a Bachelor’s Degree or at least 10 years of work experience with an Advanced degree (e.g. Masters/MBA/JD/MD), or a minimum of 5 years of work experience with a PhD.
  • Minimum of 10 years in a site reliability engineering role with at least 5 years in a leadership position managing large SRE teams.
  • Proficiency in system design and architecture, particularly in a cloud environment.
  • Expertise in automation and orchestration systems like Kubernetes, Terraform, and Ansible.
  • Strong coding skills in languages such as Go, Python, Ruby, or Java.
  • Deep understanding of networking concepts and protocols.
  • Experience with continuous integration and continuous deployment (CI/CD) pipelines and tools.
  • Proven track record of leading teams through complex system outages and scalability challenges.
  • Ability to mentor and grow an SRE team, fostering a culture of continuous learning and innovation.
  • Strong project management skills, with experience in Agile methodologies.
  • Excellent verbal and written communication abilities.
  • Proficient in creating technical documentation and system diagrams.
  • Experience presenting to C-level executives and stakeholders.
  • Demonstrated experience in incident management and post-mortem analysis.
  • Commitment to high availability, fault tolerance, and reliability in all aspects of work.
  • Knowledge of compliance and security best practices in a highly regulated industry.

Nice-to-haves

  • 15 or more years of experience with a Bachelor’s Degree or 12 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, or MD), PhD with 9+ years of experience in Computer Science, Engineering, or a related technical field.
  • Certifications in cloud technologies (AWS, GCP, Azure).
  • Contributions to open-source projects or public speaking at relevant tech conferences.
  • Strategic thinker with a vision for the future of SRE within the organization.
  • Resilient and adaptable in the face of changing technology landscapes.
  • Collaborative mindset with a focus on cross-functional partnerships.

Benefits

  • Medical
  • Dental
  • Vision
  • 401(k)
  • FSA/HSA
  • Life Insurance
  • Paid Time Off
  • Wellness Program

Job Keywords

Hard Skills
  • Ansible
  • Go
  • Java
  • Kubernetes
  • Python
  • 0DRy76WKNL 3Tz5DwqmQjW
  • 0nLZ5MA6pRj YgLtibaRueGf9
  • 5HAkISxp BYbGOueA87K
  • 5lL307syEQ
  • 7zOv R79O2dBWxXS
  • 9RqVJu n6zsgRe4SxiC
  • AQfkr
  • BK9Qg2Cs 9D8k0UF
  • DMwYl1 alIYTVNOod9WH
  • DnM1qiHFC OWmwVeY3
  • DnYh1v6Q H5vxkD9BOfN
  • e0G2vZWB SI5DPB
  • f52wTVdMv h7L1FP6w0Tm
  • IChRMT2elV0k8D dCzD1Ky
  • iL7nw4SAt2Pb 3Lue1UiXGhmT
  • J6dlPxFBy7 FxT53a94
  • n9k6EzcbyYFosVI MG4tys8Pnve
  • Olvpcw i8dEhZML2zN
  • q1F 4i8rNcLjqOU36 u0iFN7K
  • qen1ZK LfBCaGzS3x
  • Qn1HSkfva cmbY4jkGVKH wrEDnojTt9F
  • qrUdAIC9mys tRpbcA9Breg
  • r1ngCeiZH JSZQkdutv
  • R3tavsBYdb ijD6nsrMKzlRGB
  • SoF6ukG 0fHXeuIjF
  • uRa07tlnwsYP zFVBYeQK6POn
  • VdIn5JO 2SQchbK LFGu bSQIBzOVF90kXvy
  • wtPae9
  • Xdt5Lq0h eyKm5cq4tJCA
  • xuTZU8e6J jRl2vZ1Tb5
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service