Principal Site Reliability Engineer (Cortex Data Lake)

Palo Alto Networksposted about 1 month ago

$146,000 - $230,000/Yr

Full-time • Senior

Santa Clara, CA

Upload and Match ResumeTrack Jobs with Teal

About the position

Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the CDL/SLS team, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, observability, troubleshooting, security, and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps, Prometheus, Grafana, Loki, Docker, GCP, Vault, Kafka, MySQL, Python, Bash, and Go.

Responsibilities

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

Requirements

6+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
4+ years building high availability, scalable cloud-native applications on AWS and GCP
BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience required
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Passion for infrastructure and monitoring as code
Solid experience in container workloads and Kubernetes
Familiarity with PKI concepts, Networking concepts
In-depth knowledge of different security controls (app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks
Proficiency with CI/CD pipelines, ArgoCD and GitLab CI/CD
Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
Experience with managing Kafka is a plus
Excellent written and verbal communication, able to collaborate and rally support
Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Ready to understand and dissect new technology stacks quickly

Benefits

FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees
Mental and financial health resources
Personalized learning opportunities