Sr. SRE ( Site Reliability Engineer)

E-Solutions Group - Seattle, WA

posted 2 months ago

Full-time - Senior

Seattle, WA

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Senior Site Reliability Engineer (SRE) role is focused on ensuring the health and performance of production systems within a cloud-based environment. The position requires a strong technical background in Linux, microservices, and NoSQL databases, along with excellent troubleshooting skills. The SRE will be responsible for monitoring production systems, creating dashboards, configuring alerts, and leading troubleshooting efforts. This role involves collaboration with cross-functional teams to implement scalable solutions and improve system reliability.

Responsibilities

Responsible for health of production system
Develop monitoring dashboards
Configure alerts and automate process for system recovery
Monitor alerts and take proactive steps to resolve system issues
Troubleshoot production issues
Lead production troubleshooting calls
Responsible for patches and updates on production systems
Design and build cutting-edge, multi-micro service solutions to support Starbucks's growth worldwide
Work with cross-functional teams for ongoing design efforts and systems support
Automate password and certificate rotations on application and DB servers
Help CI/CD team during rolling out application and infrastructure globally
Collaborate with development team and other IT teams' developer leads
Initiate process improvements for new and existing systems
Coach and mentor other team members
Participate in a production support rotation that includes pager responsibilities
Break down complex application designs into component deliverables and estimate design and development timelines

Requirements

10-12 years experience in the IT industry
9+ years of software and DevOps development engineering
Experience in working with cloud environment Azure preferred
Experience with Kubernetes, Azure Kubernetes (AKS) preferred
Experience with using Kafka, Event Hub, NATS or any messaging broker
Experience with Cassandra, PostgresSQL, Mongo, Elastic Search, Cosmos DB
Experience on Azure DevOps, Jenkins, Python, Terraform, Ansible
Experience with Databricks
Experience with DataDog, Splunk or other logging and APM tools
Experience in working with Linux environment
In-depth understanding of Computer Science fundamentals in object-oriented design, data structures, algorithms, and problem solving
Experience building complex, scalable, high-performance software systems
Demonstrated knowledge of best practices for the design and implementation of large-scale systems
Experience building and operating mission critical, highly available (24x7) systems
Ability to work well with a team in a fast-paced agile development environment
Bachelors in Computer Science or equivalent work experience
Excellent communication, analytical and problem-solving skills
Extensive understanding in SDLC and scrum methodologies

Sr. SRE ( Site Reliability Engineer)

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company