Palo Alto Networks - Santa Clara, CA

posted 22 days ago

Full-time - Mid Level
Onsite - Santa Clara, CA
Professional, Scientific, and Technical Services

About the position

As a Site Reliability Engineer on the CDSS Advanced URL Filtering team at Palo Alto Networks, you will be instrumental in enhancing the reliability and scalability of our systems. This role involves working with cutting-edge technologies to tackle complex challenges and contribute to innovative solutions that protect our customers. You will optimize infrastructure costs, define service-level objectives, and collaborate with cross-functional teams to ensure high availability of applications.

Responsibilities

  • Optimize infrastructure costs by monitoring resource utilization, rightsizing instances, and reducing waste to improve cost-efficiency.
  • Define and manage service-level objectives (SLOs) and related metrics to ensure service reliability and align with business goals.
  • Design and maintain secure cloud infrastructure that prioritizes reliability, scalability, and efficiency.
  • Develop expertise in new technologies to enhance infrastructure and operations.
  • Collaborate with cross-functional teams to ensure applications are production-ready and highly available.
  • Automate deployments, monitoring, and alerting to streamline operations and improve reliability.
  • Diagnose and resolve critical issues, driving optimization and continuous improvement.
  • Participate in on-call rotations to support seamless service operations.
  • Contribute to design reviews to enhance system performance and scalability.

Requirements

  • Creative thinker and collaborative team player with strong communication skills and a drive to make a meaningful impact.
  • Expertise in provisioning and managing cloud infrastructure on public or private cloud platforms (GCP, AWS, or Azure preferred), with strong proficiency in tools like Kubernetes, Terraform, and Ansible.
  • Proficiency in managing and optimizing SQL and NoSQL databases, including operational tasks such as provisioning, scaling, monitoring, backups, and troubleshooting.
  • Deep understanding of distributed systems, high-availability architecture, and strategies for scaling and optimizing system performance.
  • Proven experience defining and managing SLAs, SLOs, and SLIs to ensure service reliability and business alignment.
  • Expertise in monitoring and optimizing cloud infrastructure costs, including resource allocation and implementing efficient practices.
  • Hands-on experience with Envoy or similar load balancing technologies, along with strong Linux system administration and advanced network troubleshooting skills.
  • Advanced skills in programming and automation using Python, Golang, or shell scripting to streamline operations and enhance system reliability.
  • Proven experience managing production deployments, ensuring system stability, and enforcing DevOps best practices.
  • Familiarity with CI/CD pipelines (GitLab CI preferred) and expertise in designing robust monitoring and alerting systems.
  • Exceptional ability to work with cross-functional teams, communicate effectively, and provide technical leadership.
  • Self-disciplined, self-managed, and self-motivated, with a strong sense of ownership, urgency, and drive.

Nice-to-haves

  • Experience with platforms like BigQuery, MongoDB, Cloud SQL, Firestore, Bigtable, and MySQL is preferred.
  • BS/MS in Computer Science, Computer Engineering, or a related field, with 8+ years of hands-on industry experience in Site Reliability Engineering or a similar role managing and improving complex systems at scale.

Benefits

  • FLEXBenefits wellbeing spending account with over 1,000 eligible items selected by employees.
  • Mental and financial health resources.
  • Personalized learning opportunities.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service