Sr. Site Reliability Engineer

$152,700 - $190,600/Yr

People Connect - Bellevue, WA

posted 10 days ago

Full-time - Mid Level
Bellevue, WA
Administrative and Support Services

About the position

As a Senior Site Reliability Engineer at Classmates.com, you will be responsible for designing, implementing, and maintaining the infrastructure and systems necessary to support our applications and services. This role involves collaborating with cross-functional teams to drive operational excellence, automate processes, and continuously improve system reliability. Your expertise in cloud technologies, automation, and performance optimization will be crucial for the success of our engineering and operations efforts. The position requires a mix of persistence, innovative thinking, and strong interpersonal skills, all while fostering a fun work environment.

Responsibilities

  • Provide thought leadership, mentorship, and technical vision related to site reliability, DevOps, and a cloud-first culture.
  • Analyze and implement cloud services to meet business goals, focusing on cost optimizations, efficiencies, and scalability.
  • Drive orchestration efforts for cloud services, design self-service aspects, and stay updated with emerging cloud technologies.
  • Collaborate on designing, building, and maintaining scalable infrastructure across cloud and on-prem environments.
  • Automate provisioning and configuration using tools like Terraform, Terragrunt, and Puppet.
  • Develop automation scripts, maintain CI/CD pipelines, and plan for scalability and capacity, conducting load testing as needed.
  • Ensure system reliability, availability, and performance through monitoring, alerting, and incident response.
  • Implement and manage SLOs/SLIs to meet reliability standards.
  • Identify and address performance bottlenecks across the infrastructure and application stack.
  • Build and maintain observability solutions (e.g., monitoring, logging, and tracing) and improve system health dashboards.
  • Implement security measures for Cloud Native applications and ensure compliance with industry standards (SOC2, PCI, etc).
  • Collaborate with security teams to audit and monitor systems, continuously updating security configurations and dashboards.
  • Participate in on-call rotations to provide 24/7 support for production environment.
  • Lead incident response activities and perform root cause analysis to prevent recurring incidents.
  • Conduct and document post-incident retrospectives (postmortems) to drive continuous improvement.
  • Create and maintain runbooks and operational documentation for continuous improvement.
  • Proactively test system resilience through Chaos Engineering experiments and failure injection.
  • Design and test disaster recovery (DR) and business continuity strategies, ensuring backup and failover mechanisms are effective.
  • Monitor cloud usage and implement financial optimization practices (FinOps) to control infrastructure costs.
  • Collaborate across teams to ensure alignment and effective project implementation.
  • Communicate during incidents and changes, providing transparency to stakeholders.
  • Mentor and share knowledge with team members to foster a collaborative and continuous learning environment.
  • Maintain comprehensive documentation of system configurations, processes, and best practices.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.
  • Proficiency in AWS and containerization technologies like Kubernetes and Docker.
  • Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/Shell, or Go.
  • Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN).
  • Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management.
  • Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices.
  • Excellent troubleshooting skills, with experience in performance optimization and root cause analysis.
  • Strong communication and collaboration skills.

Nice-to-haves

  • Experience with Rundeck, Java, Spring Framework, Terragrunt, Puppet, Vector, Loki, VictoriaMetrics, and additional cloud platforms (e.g., Google Cloud Platform, Azure).
  • Relevant certifications such as AWS Solutions Architect or Certified Kubernetes Administrator (CKA).

Benefits

  • Competitive salary range from $152,700 to $190,600 based on experience and qualifications.
  • Hybrid work model with 2-3 days in the office located in Bellevue, WA.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service