Site Reliability Engineer III - Remote

Pointright - Pittsburgh, PA

posted 5 months ago

Full-time - Mid Level

Remote - Pittsburgh, PA

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

As a Site Reliability Engineer III at Net Health, you will play a crucial role in managing the performance, stability, and redundancy of our Platform systems and infrastructure. This position is designed for individuals who are proactive and relentless in their pursuit of identifying and implementing infrastructure solutions that ensure high degrees of observability, availability, and reliability. You will be part of a collaborative team that is responsible for remediating system instability and slowness through effective monitoring, fault tolerance, tooling, capacity management, and automation. Your partnership with development teams will be critical in ensuring that Net Health Platforms are performant, scalable, fault tolerant, and compliant with HIPAA regulations. In this role, you will lead emergency response efforts in conjunction with Engineering, Infrastructure, and Database teams to establish root causes of incidents. You will also be responsible for building robust monitoring solutions and expanding our current monitoring and alerting capabilities. Your participation in the design of solutions will be essential in increasing the holistic stability of Net Health Platforms while identifying potential risks. Conducting Blameless Postmortems and Anomaly Investigations after incidents will be part of your responsibilities, allowing you to analyze root causes and create permanent solutions to improve serviceability and prevent future outages. You will establish a culture of learning from past issues by promoting a Don't Repeat Incidents (DRI) approach, always looking to improve monitoring and dashboarding capabilities. Collaborating with development teams and architecture will be necessary to ensure applications are performing efficiently and to resolve any application performance issues. Additionally, you will consult with management to analyze short- and long-range business requirements and recommend innovations. Championing automation efforts to reduce or eliminate repetitive, manual processes will be a key focus, as will partnering with project management to define Service Level Objectives (SLO) and implement Service Level Indicators (SLI) to track compliance. Finally, you will lead capacity management and disaster recovery testing efforts to ensure the resilience of our systems.

Responsibilities

Leading emergency response efforts in conjunction with Engineering, Infrastructure, and Database teams to establish root cause.
Leading the efforts to build robust monitoring solutions while expanding our current monitoring and alerting footprint.
Participate in the design of solutions increasing the holistic stability of NH Platforms and identifying potential risks.
Conduct Blameless Postmortems and Anomaly Investigations after incidents to further analyze root cause and create permanent solutions to improve serviceability and prevent future outages.
Establish a Don't Repeat Incidents (DRI) culture by learning from past issues and always looking to improve monitoring and dashboarding capabilities.
Ensuring applications are performing efficiently, collaborating with development teams and architecture to resolve application performance issues.
Consults with management in the analysis of short- and long-range business requirements and recommends innovations.
Championing automation efforts to reduce or eliminate repetitive, manual processes.
Partner with project management to define Service Level Objectives (SLO) and identify and implement Service Level Indicators (SLI) to track compliance.
Championing capacity management and disaster recovery testing efforts.

Requirements

Bachelor's degree in computer science OR equivalent 6+ years' progressive experience in IT Operations and/or systems management.
6+ years direct experience in a technical role dealing with complex enterprise software landscapes (DevOps focused development).
6+ years' experience with scripting and automating technical activities.
Experience with best-in-class application monitoring (APM) tooling (New Relic, Dynatrace, AppDynamics).
Direct, hands-on experience with automated software and system management.
Strong knowledge of change control best practices and methodologies.

Nice-to-haves

Experience with Ansible, Terraform, Python, or Docker (or similar) is a plus.
Experience with Agile development methodology and/or ITIL ITSM is a plus.

Benefits

Unlimited PTO
Comprehensive Benefits Package
Employee Resource Groups
Casual Dress Code
Prioritized Employee Wellness
Diversity and Inclusion Programs
Career Development Opportunities
Educational Assistance
Employee Referral Bonus
Progressive Parental Leave

Site Reliability Engineer III - Remote

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company