Site Reliability Engineer

Overdrive - Cleveland, OH

posted 3 months ago

Full-time - Entry Level

Cleveland, OH

Sporting Goods, Hobby, Musical Instrument, Book, and Miscellaneous Retailers

About the position

The Site Reliability Engineer (SRE) position at OverDrive is a critical role that focuses on ensuring the availability, performance, and efficiency of our services. This position requires a hybrid work schedule, with two days on campus in Cleveland, Ohio, and three days working from home. The SRE will be responsible for various aspects of service management, including change management, monitoring, emergency response, and capacity planning for both existing and future services. The role demands a proactive approach to predicting performance issues and implementing solutions before they impact end-users. Collaboration with application developers is essential to ensure that applications meet their uptime requirements, and SREs will also participate in an on-call rotation, which may require incident response during off-hours. In this role, you will engage in small projects and individual tasks, receiving regular guidance from more senior developers. You will provide day-to-day support for development teams, which includes building and running deployments, answering questions, and monitoring Service-Level Indicators for applications and systems. Continuous learning is encouraged, and you will independently train in the systems and technologies utilized by the team. Your feedback will be invaluable to application developers, helping them meet performance objectives from a systems perspective. Additionally, you will work with applications in various programming languages within a Linux environment, contributing to the overall reliability and performance of our services.

Responsibilities

Work on small projects and individual tasks with regular guidance from more senior developers.
Provide day-to-day support for development teams by building and/or running deploys, answering questions, etc.
Provide monitoring of Service-Level Indicators for applications and systems.
Independently train in the systems and technologies that the team uses.
Provide feedback to application developers from a system perspective to help meet application performance objectives.
Participate in on-call rotations to provide first-line support during incidents.
Work with applications in various languages in a Linux environment.

Requirements

2+ years experience in software development or system administration; 1+ years experience working in Linux environments.
Proficient understanding of how modern networks and the Internet function.
Experience in identifying and resolving outages and performance issues in complex, networked applications.
Experience in working with large cloud-based providers; Amazon Web Services (AWS) experience a plus.
Experience in use of scripting language to automate tasks (e.g. Ruby or Python).
Experience with configuration automation tools; Ansible experience a plus.
Experience with logging and monitoring frameworks.
Able to participate in on-call rotations that require responding to incidents outside of business hours.
Able to work with a geographically-distributed team, with infrequent in-person communication.

Site Reliability Engineer

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company