Staff Site Reliability Engineer

$109,300 - $218,500/Yr

Abbott Laboratories - Pleasanton, CA

posted 4 months ago

Full-time - Senior

Pleasanton, CA

10,001+ employees

Miscellaneous Manufacturing

About the position

As a Staff Site Reliability Engineer at Abbott, you will be a senior member of the Site Reliability Engineering (SRE) team, playing a pivotal role in establishing and executing a site reliability strategy specifically for the Heart Failure Division Medical Device Mobile and Cloud Digital Software portfolio, which includes both Class II and Class III devices. Your primary responsibility will be to partner with and influence our Architecture and Engineering teams to deliver highly resilient software solutions that meet the needs of our customers. In this role, you will implement SRE improvement processes and procedures, driving change within the organization. A strong software engineering background in a highly secured environment is essential, along with experience in DevOps, formal test automation, load testing, or SRE practices. You will leverage your extensive technical knowledge in the development, delivery, and implementation of complex and critical software systems. Your expertise in SRE principles, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), Error Budgets, Toil, Observability, and Release Engineering, will be critical to your success. You will be expected to develop, communicate, and execute a vision that fosters the adoption of practices and tooling, thereby strengthening Abbott's position as a leader in the Heart Failure business. Your responsibilities will include developing a culture of SRE within our software development and operational practices, implementing a comprehensive SRE strategy in collaboration with the HF Digital team, and identifying critical KPIs and metrics to execute on the SRE roadmap. You will assist software engineering teams and business stakeholders in establishing and evolving reliability goals, automating manual processes, and managing continuous execution of tests. Additionally, you will work closely with various teams to resolve critical issues, evaluate service tiers, and participate in blameless postmortems to enhance future incident responses. Your role will also involve building robust CI/CD pipelines and partnering with customer support for rapid issue resolution.

Responsibilities

Develop and enable a culture of SRE in our software development, delivery and operational practices.
Implement a comprehensive SRE strategy and roadmap in partnership with the HF Digital team.
Identify critical KPI's and Metrics and execute on the SRE roadmap.
Help software engineering teams and business stakeholders establish and evolve reliability goals and measure progress against those goals using SLIs/SLOs.
Automate manual processes via scripting and/or tools. Create automated regression & sustaining test suite and manage continuous execution of the tests.
Work closely with Development, Network, Infrastructure and Requirements team on critical issues by evaluating the problem, proposing and implementing resolutions and partnering with other teams as needed for the resolution.
Evaluate the current tiers of service of our applications, reliability standards and practices to define steps to continuously improve on them.
Participate in blameless postmortems on critical incidents and help teams use their learnings to better predict, detect and prevent future issues.
Experience with building robust continuous integration and continuous delivery (CI/CD) pipelines using the tools on-premises as well as on cloud.
Partner with the customer support team for rapid resolution of issues.
Evaluate and monitor social signals for site reliability. Monitor app store comments and other social channels.
Build a practice of rapid detection and root cause determination while keeping stakeholders informed.

Requirements

Bachelor's Degree in Computer Science, Information Technology, or a relevant field, or an equivalent combination of education and work experience.
Master's Degree in a technical discipline is preferred.
Minimum 12+ years of experience in the field.
At least 10 years of experience developing large-scale digital software systems.
Exposure to cloud development and deployment technologies, including containerization, infrastructure as code, and multi-cloud configurations.
Deep understanding of DevOps and SRE Best Practices.
Hands-on experience with CI/CD tools such as Jenkins, Azure DevOps, Helm, Chef, or Terraform.
Hands-on experience with Kubernetes, Container Orchestrations, Docker, and Cloud Native applications.
Experience managing application configuration using configuration management tools like Ansible, Chef, or Azure App Configuration.
Hands-on experience with performance monitoring tools such as Grafana or JMeter.
Experience creating, maintaining, and managing GIT source code repositories like BitBucket or GitHub.
Strong scripting skills in languages such as Shell, Python, Ruby, Golang, or Java.
Experience with GIT, Jira, Confluence, and similar issue tracking and collaboration tools.

Benefits

Career development with an international company.
Free medical coverage for employees via the Health Investment Plan (HIP) PPO.
Excellent retirement savings plan with high employer contribution.
Tuition reimbursement and education benefits.
Recognition as a great place to work in various countries.
Support for diversity and inclusion in the workplace.

Staff Site Reliability Engineer

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company