Toast - Boston, MA
posted 2 months ago
At Toast, our Site Reliability Engineers (SREs) play a crucial role in ensuring that our customer-facing services and production systems operate smoothly. This position requires a blend of operational expertise and software engineering skills, allowing SREs to apply sound engineering principles and mature automation practices to our environments and codebase. The SRE team is responsible for implementing and evolving a world-class observability technology stack that facilitates rapid detection of issues and enables thorough root cause analysis. This includes providing scalable metrics and dashboarding solutions, distributed tracing capabilities, and log aggregation insights using best-in-class technology. Additionally, SREs are tasked with providing a global view of the customer experience through Real-User Monitoring and external cloud-based solutions. SREs act as champions for reliability, collaborating with partner teams across various business lines to enhance the resiliency and reliability of all services. They facilitate production triage, incident resolution, and retrospective analysis to maintain the reliability and uptime of our platform. A strong understanding of Cloud Architecture is essential, as is experience in developing and operating software on the JVM to diagnose performance bottlenecks and implement optimizations across infrastructure, databases, and applications. SREs will also implement strategies to increase system reliability and performance through on-call rotations and process optimizations, leading incident post-mortems to identify reliability improvements. Furthermore, SREs support the adoption of platforms that enable service resilience testing and chaos engineering, validating that Toast's architecture is resilient to failure. They will build and own a performance testing framework to help R&D teams understand service constraints and improve performance.