Principal Site Reliability Engineer (Advanced Threat Prevention Infrastructure)

$147,000 - $225,000/Yr

Palo Alto Networks - Santa Clara, CA

posted 2 months ago

Full-time - Principal

Santa Clara, CA

Professional, Scientific, and Technical Services

About the position

The Principal Site Reliability Engineer at Palo Alto Networks will enhance the Advanced Threat Prevention (ATP) Infrastructure team by developing mission-critical platforms, tools, and processes to ensure high availability and reliability of applications. This role requires innovative problem-solving skills and a deep understanding of cloud infrastructure, particularly within Google Cloud Platform (GCP). The engineer will collaborate with developers to improve service usability and automate infrastructure management.

Responsibilities

Write automation code for provisioning and operating infrastructure at massive scale
Design, build and operate Cloud infrastructure for reliable and rapid deployment of microservices
Work with development teams to ensure applications are production ready, scalable, and reliable
Identify and drive opportunities to improve automation for code deployment and management
Develop tools and frameworks to automate operational tasks and deployment of services
Establish end-to-end monitoring and alerting on critical application components
Participate in on-call rotation supporting the platform and production applications
Direct root cause analysis of critical business and production issues
Develop and mentor other SREs on standard methodologies
Represent SRE in design reviews and collaborate with Engineering teams on operational readiness

Requirements

BS or MS in Computer Science or related field, or equivalent professional/military experience
Expertise in configuration management frameworks such as Terraform, Ansible, and Helm
Strong experience with Kubernetes
Strong Linux administration and network troubleshooting skills
Expertise in Google Cloud Platform (GCP) and resource management
Proficiency in programming languages like Python and shell scripting
Strong experience with CI/CD pipelines, GitHub, Jenkins, and Artifactory
Experience with metrics and monitoring tools such as Grafana and Prometheus
Ability to diagnose and troubleshoot complex distributed systems
Strong fundamentals in API gateways like Nginx or Envoy
Experience with cloud infrastructure performance and cost optimizations
Experience with AWS is a plus
Excellent interpersonal skills and teamwork ability
Passionate about learning new technology stacks
Experience in building and managing large relational database clusters (MySQL/Percona) is a plus

Nice-to-haves

Experience with AWS
Experience in building and managing large relational database clusters (MySQL/Percona)

Benefits

FLEXBenefits wellbeing spending account
Mental and financial health resources
Personalized learning opportunities
Restricted stock units
Bonus opportunities

Match and compare your resume to any job description

Start Matching

Principal Site Reliability Engineer (Advanced Threat Prevention Infrastructure)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company