Site Reliability Engineer

Bose - Atlanta, GA

posted 2 months ago

Full-time

Atlanta, GA

Furniture, Home Furnishings, Electronics, and Appliance Retailers

About the position

At Bose, we believe that sound is the most powerful force on earth, and our Information Technology team is dedicated to delivering valuable and reliable business and technology solutions. This role involves designing, implementing, and managing systems to ensure high availability and performance of production services. The successful candidate will develop and maintain monitoring, alerting, and logging systems to proactively identify and address issues, ensuring that our services meet the highest standards of reliability and performance. The position requires the creation and enforcement of Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Key Performance Indicators (KPIs). You will lead the response to production incidents, which includes troubleshooting, resolution, and conducting post-incident analysis. Additionally, you will develop and maintain incident response procedures and runbooks, conduct root cause analysis, and implement corrective actions to prevent recurrence. Automation is a key focus of this role, as you will be responsible for automating repetitive tasks and processes to improve efficiency and reduce human error. You will also develop and maintain tools for deployment, configuration management, and system monitoring, collaborating closely with development teams to integrate automation into the software delivery pipeline. Capacity planning and designing scaling strategies to accommodate changes in demand will be essential, as will monitoring resource utilization to optimize infrastructure for cost efficiency. Collaboration is crucial in this role, as you will work with development teams to design scalable and reliable system architectures, participate in architectural reviews, and provide guidance on reliability and performance considerations. You will evaluate and recommend new technologies and approaches to enhance system reliability and performance, document system configurations, processes, and procedures, and create operational runbooks and knowledge base articles. Training and mentorship of team members and stakeholders on reliability best practices will also be part of your responsibilities. Effective communication about system status, incident responses, and reliability improvements is essential, and participation in on-call rotations will be required to respond to incidents as needed.

Responsibilities

Design, implement, and manage systems to ensure high availability and performance of production services.
Develop and maintain monitoring, alerting, and logging systems to proactively identify and address issues.
Create and enforce Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Key Performance Indicators (KPIs).
Lead the response to production incidents, including troubleshooting, resolution, and post-incident analysis.
Develop and maintain incident response procedures and runbooks.
Conduct root cause analysis and implement corrective actions to prevent recurrence.
Automate repetitive tasks and processes to improve efficiency and reduce human error.
Develop and maintain tools for deployment, configuration management, and system monitoring.
Collaborate with development teams to integrate automation into the software delivery pipeline.
Perform capacity planning to ensure systems can handle current and future workloads.
Design and implement scaling strategies to accommodate changes in demand.
Monitor resource utilization and optimize infrastructure to achieve cost efficiency.
Collaborate with development teams to design scalable and reliable system architectures.
Participate in architectural reviews and provide guidance on reliability and performance considerations.
Evaluate and recommend new technologies and approaches to enhance system reliability and performance.
Document system configurations, processes, and procedures.
Create and maintain operational runbooks and knowledge base articles.
Provide training and mentorship to team members and other stakeholders on reliability best practices.
Work closely with software engineers, operations teams, R&D, automotive, and other stakeholders to ensure smooth deployment and operation of services.
Communicate effectively about system status, incident responses, and reliability improvements.
Participate in on-call rotations and be available to respond to incidents as needed.

Requirements

Proficiency in scripting and programming languages (e.g., Python, Go, JSON, Java).
Experience with monitoring and observability tools (e.g., Logic Monitor, Prometheus, New Relic, Grafana, Datadog) preferred.
Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
Familiarity with cloud platforms (e.g., AWS, Azure, Google Cloud).
Experience with configuration management and infrastructure-as-code tools (e.g., Terraform, Ansible) preferred.
Excellent problem-solving and analytical skills.
Strong communication and collaboration abilities.
3+ years of experience in a similar role, with a strong background in systems engineering, software development, or operations.
Bachelor's degree in Computer Science, Information Technology, or a related field.
Advanced degree or relevant certifications (e.g., AWS Certified DevOps Engineer, Google Professional DevOps Engineer) preferred.

Site Reliability Engineer

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company