Senior Site Reliability Engineer

$131,000 - $131,000/Yr

Toast - Boston, MA

posted 2 months ago

Full-time - Mid Level
Boston, MA
Food Services and Drinking Places

About the position

At Toast, our Site Reliability Engineers (SREs) play a crucial role in ensuring that our customer-facing services and production systems operate smoothly. This position requires a blend of operational expertise and software engineering skills, allowing SREs to apply sound engineering principles and mature automation practices to our environments and codebase. The SRE team is responsible for implementing and evolving a world-class observability technology stack that facilitates rapid detection of issues and enables thorough root cause analysis. This includes providing scalable metrics and dashboarding solutions, distributed tracing capabilities, and log aggregation insights using best-in-class technology. Additionally, SREs are tasked with providing a global view of the customer experience through Real-User Monitoring and external cloud-based solutions. SREs act as champions for reliability, collaborating with partner teams across various business lines to enhance the resiliency and reliability of all services. They facilitate production triage, incident resolution, and retrospective analysis to maintain the reliability and uptime of our platform. A strong understanding of Cloud Architecture is essential, as is experience in developing and operating software on the JVM to diagnose performance bottlenecks and implement optimizations across infrastructure, databases, and applications. SREs will also implement strategies to increase system reliability and performance through on-call rotations and process optimizations, leading incident post-mortems to identify reliability improvements. Furthermore, SREs support the adoption of platforms that enable service resilience testing and chaos engineering, validating that Toast's architecture is resilient to failure. They will build and own a performance testing framework to help R&D teams understand service constraints and improve performance.

Responsibilities

  • Implement and evolve a world-class observability technology stack that allows rapid detection of issues in our system and enables root cause analysis.
  • Provide scalable metrics and dashboarding solutions for R&D.
  • Provide distributed tracing capabilities to visualize and track issues across our complex system.
  • Provide log aggregation and insights for R&D using best-in-class technology.
  • Provide a global view of the true customer experience through usage of Real-User Monitoring & external cloud-based solutions.
  • Act as a champion for reliability and work with partner teams in different lines of business to improve resiliency and reliability of all services.
  • Facilitate and drive production triage, incident resolution, and retrospective/root cause analysis to maintain the reliability and uptime of our platform.
  • Leverage a strong understanding of Cloud Architecture.
  • Experience developing and operating software on the JVM to triage and understand issues within services.
  • Diagnose performance bottlenecks and implement optimizations across infrastructure, database, web, and mobile applications.
  • Implement strategies to increase system reliability and performance through on-call rotation and process optimization.
  • Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion.
  • Support and enable the adoption of a platform that enables service resilience testing/chaos engineering to validate and test Toast's architecture is resilient to failure.
  • Build and own a performance testing framework/environment to enable our R&D teams to understand the constraints of their services and improve performance.

Requirements

  • Extensive and broad industry experience with at least 3-7 years building and running production systems and participating in incident calls.
  • Deep understanding of cloud and microservice architecture, and the JVM.
  • Comfortable reading, writing, and debugging code.
  • Experience with Observability platforms (Datadog, Splunk, New Relic, etc.) - APM, RUM, Synthetic monitoring.
  • Demonstrated experience working with at least one major cloud platform (AWS, GCP, or Azure).
  • Exposure to complex, mission critical, and large scale distributed systems.
  • Polyglot technologist/generalist with a thirst for learning.

Benefits

  • Competitive compensation and benefits programs
  • Cash compensation (overtime, bonus/commissions if eligible)
  • Equity options
  • Flexible lifestyle benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service