Bose - Atlanta, GA
posted 2 months ago
At Bose, we believe that sound is the most powerful force on earth, and our Information Technology team is dedicated to delivering valuable and reliable business and technology solutions. This role involves designing, implementing, and managing systems to ensure high availability and performance of production services. The successful candidate will develop and maintain monitoring, alerting, and logging systems to proactively identify and address issues, ensuring that our services meet the highest standards of reliability and performance. The position requires the creation and enforcement of Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Key Performance Indicators (KPIs). You will lead the response to production incidents, which includes troubleshooting, resolution, and conducting post-incident analysis. Additionally, you will develop and maintain incident response procedures and runbooks, conduct root cause analysis, and implement corrective actions to prevent recurrence. Automation is a key focus of this role, as you will be responsible for automating repetitive tasks and processes to improve efficiency and reduce human error. You will also develop and maintain tools for deployment, configuration management, and system monitoring, collaborating closely with development teams to integrate automation into the software delivery pipeline. Capacity planning and designing scaling strategies to accommodate changes in demand will be essential, as will monitoring resource utilization to optimize infrastructure for cost efficiency. Collaboration is crucial in this role, as you will work with development teams to design scalable and reliable system architectures, participate in architectural reviews, and provide guidance on reliability and performance considerations. You will evaluate and recommend new technologies and approaches to enhance system reliability and performance, document system configurations, processes, and procedures, and create operational runbooks and knowledge base articles. Training and mentorship of team members and stakeholders on reliability best practices will also be part of your responsibilities. Effective communication about system status, incident responses, and reliability improvements is essential, and participation in on-call rotations will be required to respond to incidents as needed.