Oak Street Health - Chicago, IL
posted 4 months ago
As a Lead Engineer — Site Reliability Engineer (SRE) at Oak Street Health, you will be instrumental in leading the design, implementation, and maintenance of highly available and scalable systems. This role requires a deep understanding of Site Reliability Engineering principles and practices, as you will leverage your extensive experience to drive best practices within the team. You will also be responsible for mentoring junior team members, fostering a culture of continuous learning and improvement, and collaborating with cross-functional teams to ensure the reliability and performance of our infrastructure. In this position, you will lead the design and implementation of scalable and reliable systems and applications. You will provide technical leadership and guidance to the SRE team, ensuring that best practices, standards, and processes are defined and implemented effectively. Your role will involve architecting and implementing monitoring and observability solutions using tools like Grafana, which will help in the proactive detection and resolution of issues. Additionally, you will oversee Azure infrastructure management, including resource provisioning, configuration, and optimization. You will also be tasked with developing and executing comprehensive performance and load testing strategies to identify and address bottlenecks, thereby optimizing system performance. Automation will be a key focus, as you will develop and maintain scripts and tools using PowerShell, JavaScript, and other scripting languages. Furthermore, you will lead incident response efforts during outages or incidents, ensuring timely resolution and minimizing downtime. Documenting architectures, processes, and procedures will be essential to facilitate knowledge sharing and ensure system reliability.