Lead, Site Reliability Engineer

Oak Street Health - Chicago, IL

posted 4 months ago

Full-time - Mid Level

Remote - Chicago, IL

101-250 employees

Ambulatory Health Care Services

About the position

As a Lead Engineer — Site Reliability Engineer (SRE) at Oak Street Health, you will be instrumental in leading the design, implementation, and maintenance of highly available and scalable systems. This role requires a deep understanding of Site Reliability Engineering principles and practices, as you will leverage your extensive experience to drive best practices within the team. You will also be responsible for mentoring junior team members, fostering a culture of continuous learning and improvement, and collaborating with cross-functional teams to ensure the reliability and performance of our infrastructure. In this position, you will lead the design and implementation of scalable and reliable systems and applications. You will provide technical leadership and guidance to the SRE team, ensuring that best practices, standards, and processes are defined and implemented effectively. Your role will involve architecting and implementing monitoring and observability solutions using tools like Grafana, which will help in the proactive detection and resolution of issues. Additionally, you will oversee Azure infrastructure management, including resource provisioning, configuration, and optimization. You will also be tasked with developing and executing comprehensive performance and load testing strategies to identify and address bottlenecks, thereby optimizing system performance. Automation will be a key focus, as you will develop and maintain scripts and tools using PowerShell, JavaScript, and other scripting languages. Furthermore, you will lead incident response efforts during outages or incidents, ensuring timely resolution and minimizing downtime. Documenting architectures, processes, and procedures will be essential to facilitate knowledge sharing and ensure system reliability.

Responsibilities

Lead the design, implementation, and maintenance of scalable and reliable systems and applications.
Provide technical leadership and guidance to the SRE team, including mentoring junior engineers and fostering a culture of continuous learning and improvement.
Collaborate with development, operations, and other cross-functional teams to define and implement SRE best practices, standards, and processes.
Architect and implement monitoring and observability solutions using Grafana and other tools to ensure proactive detection and resolution of issues.
Oversee Azure infrastructure management, including resource provisioning, configuration, and optimization.
Develop and execute comprehensive performance and load testing strategies to identify and address bottlenecks and optimize system performance.
Drive automation efforts by developing and maintaining scripts and tools using PowerShell, JavaScript, and other scripting languages.
Implement systems integration solutions to enable seamless communication and interoperability between different systems and services.
Lead incident response efforts during outages or incidents, ensuring timely resolution and minimizing downtime.
Document architectures, processes, and procedures to facilitate knowledge sharing and ensure system reliability.

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
5+ years of experience working in a Site Reliability Engineering or similar role.
Extensive experience with Grafana, Azure administration, or similar observability tools and cloud platforms.
Strong expertise in performance and load testing methodologies and tools.
Proficiency in scripting languages such as PowerShell and JavaScript.
Demonstrated leadership skills with the ability to lead and mentor a team effectively.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams.
Proven track record of designing and implementing scalable and reliable systems.
Ability to thrive in a fast-paced environment and effectively prioritize multiple tasks.
US Work Authorization.

Nice-to-haves

Master's degree in Computer Science, Engineering, or a related field.
Certification in Azure administration or related fields.
Experience with containerization technologies such as Docker and Kubernetes.
Familiarity with CI/CD pipelines and DevOps principles.
Deep understanding of networking concepts and protocols.

Benefits

Paid vacation, sick time, and investment/retirement 401K match options.
Health insurance, vision, and dental benefits.

Lead, Site Reliability Engineer

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company