Site Reliability Engineer, Security Service Edge

$117,000 - $153,000/Yr

Check Point Software Technologies - Seattle, WA

posted 3 months ago

Full-time - Mid Level

Seattle, WA

Professional, Scientific, and Technical Services

About the position

As a Site Reliability Engineer (SRE) at Check Point Software Technologies, you will play a crucial role in ensuring the reliability and performance of our security services. This position involves investigating complex production issues, enhancing system resilience, and expanding our monitoring coverage. You will collaborate with customer-facing teams to support technical investigations and implementations, while also automating processes to reduce workload and improve system stability. Your expertise will contribute to maintaining a high level of service for our customers, ensuring that we meet their real-time needs in the ever-evolving landscape of cyber security. In this role, you will lead investigations into cross-functional production issues, working closely with other experts to identify root causes and implement effective solutions. You will be responsible for maintaining 100% monitoring coverage, developing strategies that focus on alerting for symptoms rather than just outages. Your efforts will help reduce workload and improve uptime, as well as enhance SLA response times through automation of production issue management. Additionally, you will act as the R&D extension in North America, providing support for critical production issues during business hours. Your advanced troubleshooting skills will be essential in resolving complex network problems and recurring platform issues. You will also support Account Managers and the Customer Success team with complex and strategic implementations, ensuring that our infrastructure can grow and adapt to meet customer demands.

Responsibilities

Lead investigation and collaborate with other group experts to investigate complex cross-function production issues
Maintain 100% Monitoring coverage, including building monitoring strategy that alerts on symptoms rather than on outages
Reduce workload and improve uptime and SLA response time by implementing automation processes for production issues
Act as the R&D extension in North America supporting production critical issues during North American business hours
Perform advanced troubleshooting of complex network problems and recurring platform issues
Support Account Managers and Customer Success team with complex implementations/strategic implementations
Design, build, and maintain core infrastructure that enables growth

Requirements

Strong Experience with AWS
Strong Experience with observability and monitoring systems (Datadog, Prometheus, Grafana, etc.) Including building and designing advance monitoring
Working experience in large-scale network and system engineering environments (ISP, Cloud Providers)
Experience with Linux system administration
Experience with networking technologies and protocols (TCP/IP, LAN, NAT, BGP, VPN, DNS, iSCSI)
Experience with Configuration Management and IaC tools (Ansible, Terraform)
Experience with coding complex automation and runbooks
Good familiarity with virtualization environments (Proxmox, OpenStack)
Scripting experience with Bash, Python, or similar
Proficiency with virtualized and containerized environments (ECS / Kubernetes)
Experience with Hashicorp tools (Consul, Vault, Nomad) - An advantage
Proven network debugging and problem-solving skills
Must be eligible to work in the United States without sponsorship now or in the future

Nice-to-haves

Experience with Hashicorp tools (Consul, Vault, Nomad) - An advantage

Benefits

Healthcare benefits
401(k) plan and company match
Short-term and long-term disability coverage
Basic life insurance
Stock awards
Employee stock purchasing plan

Site Reliability Engineer, Security Service Edge

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company