Diverse Lynx - New York, NY
posted 5 months ago
The Site Reliability Engineer (SRE) role is a critical position that focuses on ensuring the reliability, availability, and performance of our systems and services. The SRE will be responsible for operational automation and monitoring, acting as a Subject Matter Expert (SME) in identifying toil within existing systems and processes. The primary goal is to implement automated solutions that significantly reduce toil, thereby enhancing operational efficiency and service reliability. The ideal candidate will possess strong cloud engineering experience, particularly with Google Cloud Platform (GCP), and will be adept at defining and creating Customer User Journeys (CUJ), Service Level Objectives (SLO), Service Level Indicators (SLI), and Error Budgeting based on Non-Functional Requirements (NFR). A solid understanding of Infrastructure as Code (IaC) tools such as Terraform, along with version control systems like Git and GitHub, is essential. In addition, the SRE will be expected to have hands-on experience with containerization technologies, particularly Kubernetes, and will be responsible for designing and implementing automated workflows that streamline operations. Proficiency in scripting languages such as Bash, PowerShell, Python, and Ansible is also required. The SRE will play a vital role in reducing toil in Software Development Life Cycle (SDLC) or IT operations environments, ensuring that our systems are not only reliable but also efficient and scalable.