Site Reliability Engineer

Egen Solutions - Naperville, IL

posted 3 months ago

Full-time

Naperville, IL

Professional, Scientific, and Technical Services

About the position

Egen is a fast-growing and entrepreneurial company with a data-first mindset. We bring together the best engineering talent working with the most advanced technology platforms, including Google Cloud and Salesforce, to help clients drive action and impact through data and insights. We are committed to being a place where the best people choose to work so they can apply their engineering and technology expertise to envision what is next for how data and platforms can change the world for the better. We are dedicated to learning, thrive on solving tough problems, and continually innovate to achieve fast, effective results. We are seeking a Site Reliability Engineer to ensure system reliability and infrastructure support. You will be responsible for delivering scalability, performance optimization, incident management, and analysis.

Responsibilities

Ensure system reliability and uptime of applications depending on the SLA's
Monitor system performance metrics and determine the approaches to optimize the system
Lead incident management efforts with available methodology and document RCA (Root Cause Analysis), lessons learned, and any SOP's for solving the issue in future
Work closely with DevOps and Application teams to align priorities, share knowledge and drive continuous improvement initiatives
Prioritize response efforts based on issue severity, potential impact on users, and business priorities
Evaluate and approve changes to production systems, balancing the need for innovation with the requirement of stability and reliability
Optimize resource usage and manage costs by identifying inefficiencies, rightsizing infrastructure resources, and implementing cost-saving measures

Requirements

3+ years of SRE experience
Bachelor's Degree is preferred but will consider relevant experience as an equivalent
Scripting: Python, Bash/Shell, Ruby, Java, .Net, SQL
Experience with monitoring tools: DataDog, NewRelic, Splunk, Grafana
Familiarity with containerization and orchestration: Docker, Kubernetes
Proficient in Linux environments
Experience with incident management tools: VictorOps, PagerDuty
Version control experience: Git, Bitbucket
Ability to troubleshoot complex, intertwined distributed services
Strong attention to detail
Experience with testing, monitoring, logging, and alerting
Strong documentation skills
Experience in incident management

Site Reliability Engineer

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company