Palo Alto Networks - Santa Clara, CA

posted 22 days ago

Full-time - Principal
Onsite - Santa Clara, CA
Professional, Scientific, and Technical Services

About the position

The Principal Site Reliability Engineer at Palo Alto Networks will play a crucial role in enhancing the reliability and scalability of the company's systems, particularly within the CDSS Advanced URL Filtering team. This position involves optimizing infrastructure costs, defining service-level objectives, and collaborating with cross-functional teams to ensure high availability of applications. The role requires a strong focus on automation, monitoring, and continuous improvement to support the company's mission of protecting the digital way of life.

Responsibilities

  • Optimize infrastructure costs by monitoring resource utilization, rightsizing instances, and reducing waste to improve cost-efficiency.
  • Define and manage service-level objectives (SLOs) and related metrics to ensure service reliability and align with business goals.
  • Design and maintain secure cloud infrastructure that prioritizes reliability, scalability, and efficiency.
  • Develop expertise in new technologies to enhance infrastructure and operations.
  • Collaborate with cross-functional teams to ensure applications are production-ready and highly available.
  • Automate deployments, monitoring, and alerting to streamline operations and improve reliability.
  • Diagnose and resolve critical issues, driving optimization and continuous improvement.
  • Participate in on-call rotations to support seamless service operations.
  • Contribute to design reviews to enhance system performance and scalability.

Requirements

  • Expertise in provisioning and managing cloud infrastructure on public or private cloud platforms (GCP, AWS, or Azure preferred).
  • Strong proficiency in tools like Kubernetes, Terraform, and Ansible.
  • Proficiency in managing and optimizing SQL and NoSQL databases, including operational tasks such as provisioning, scaling, monitoring, backups, and troubleshooting.
  • Deep understanding of distributed systems, high-availability architecture, and strategies for scaling and optimizing system performance.
  • Proven experience defining and managing SLAs, SLOs, and SLIs to ensure service reliability and business alignment.
  • Expertise in monitoring and optimizing cloud infrastructure costs, including resource allocation and implementing efficient practices.
  • Hands-on experience with Envoy or similar load balancing technologies, along with strong Linux system administration and advanced network troubleshooting skills.
  • Advanced skills in programming and automation using Python, Golang, or shell scripting.
  • Proven experience managing production deployments, ensuring system stability, and enforcing DevOps best practices.
  • Familiarity with CI/CD pipelines (GitLab CI preferred) and expertise in designing robust monitoring and alerting systems.
  • Exceptional ability to work with cross-functional teams, communicate effectively, and provide technical leadership.
  • BS/MS in Computer Science, Computer Engineering, or a related field, with 8+ years of hands-on industry experience in Site Reliability Engineering or a similar role.

Nice-to-haves

  • Experience with platforms like BigQuery, MongoDB, Cloud SQL, Firestore, Bigtable, and MySQL.
  • Strong communication skills and a drive to make a meaningful impact.

Benefits

  • FLEXBenefits wellbeing spending account with over 1,000 eligible items.
  • Mental and financial health resources.
  • Personalized learning opportunities.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service