Bose - Atlanta, GA

posted 2 months ago

Full-time - Senior
Atlanta, GA
Furniture, Home Furnishings, Electronics, and Appliance Retailers

About the position

As a Lead Site Reliability Engineer at Bose, you will play a pivotal role in managing and mentoring a team of Site Reliability Engineers (SREs). Your primary responsibility will be to provide guidance, support, and performance evaluations to your team, fostering a culture of collaboration, continuous improvement, and innovation. You will define and communicate clear goals and objectives for the SRE team, ensuring alignment with the overall business objectives of the organization. In this role, you will develop and execute strategies aimed at improving system reliability, availability, and performance. You will drive the adoption of best practices and standards for SRE across the organization, participating in and leading strategic planning for capacity management, disaster recovery, and infrastructure investments. Your leadership will be crucial in conducting post-incident reviews to identify root causes and implement preventive measures, as well as developing and enforcing incident response procedures and runbooks. Collaboration will be key as you work closely with engineering and architecture teams to design scalable and resilient system architectures. You will optimize system performance and reliability through proactive monitoring, tuning, and enhancements, while also evaluating and implementing new technologies and tools to improve system capabilities and efficiency. Automation of operational processes will be a priority, as you aim to improve efficiency and reduce manual intervention. You will oversee the development and maintenance of tools for deployment, monitoring, and configuration management, promoting the use of Infrastructure-as-Code (IaC) and Continuous Integration/Continuous Deployment (CI/CD) practices. Additionally, you will lead efforts in capacity planning to ensure that infrastructure can support current and future business needs, designing and implementing scaling strategies to handle variations in demand and growth. Monitoring and optimizing resource utilization will be essential to balance performance and cost-effectiveness. You will communicate effectively about system status, performance metrics, and ongoing improvements to stakeholders, providing technical guidance and support to other teams as needed. Thorough documentation of systems, processes, and procedures will be expected, along with the creation and maintenance of operational runbooks, knowledge base articles, and training materials. Sharing knowledge and best practices with the team and organization through training sessions and workshops will also be part of your responsibilities.

Responsibilities

  • Lead, mentor, and manage a team of Site Reliability Engineers, providing guidance, support, and performance evaluations.
  • Foster a culture of collaboration, continuous improvement, and innovation within the team.
  • Define and communicate clear goals and objectives for the SRE team, aligning with overall business objectives.
  • Develop and execute strategies to improve system reliability, availability, and performance.
  • Drive the adoption of best practices and standards for SRE across the organization.
  • Participate in and lead strategic planning for capacity management, disaster recovery, and infrastructure investments.
  • Lead post-incident reviews to identify root causes and implement preventive measures.
  • Develop and enforce incident response procedures and runbooks.
  • Collaborate with engineering and architecture teams to design scalable and resilient system architectures.
  • Optimize system performance and reliability through proactive monitoring, tuning, and enhancements.
  • Evaluate and implement new technologies and tools to improve system capabilities and efficiency.
  • Drive the automation of operational processes to improve efficiency and reduce manual intervention.
  • Oversee the development and maintenance of tools for deployment, monitoring, and configuration management.
  • Promote the use of Infrastructure-as-Code (IaC) and Continuous Integration/Continuous Deployment (CI/CD) practices.
  • Lead efforts in capacity planning to ensure infrastructure can support current and future business needs.
  • Design and implement scaling strategies to handle variations in demand and growth.
  • Monitor and optimize resource utilization to balance performance and cost-effectiveness.
  • Work closely with cross-functional teams, including development, operations, and product management, to ensure alignment on reliability and performance goals.
  • Communicate effectively about system status, performance metrics, and ongoing improvements to stakeholders.
  • Provide technical guidance and support to other teams as needed.
  • Ensure thorough documentation of systems, processes, and procedures.
  • Create and maintain operational runbooks, knowledge base articles, and training materials.
  • Share knowledge and best practices with the team and organization through training sessions and workshops.

Requirements

  • 5+ years of experience in Site Reliability Engineering, Systems Engineering, or related roles, with at least 2 years in a leadership or management capacity.
  • Bachelor's degree in Computer Science, Engineering, or a related field. Advanced degree or relevant certifications (e.g., AWS Certified DevOps Engineer, Google Professional DevOps Engineer) preferred.
  • Advanced proficiency in scripting and programming languages such as Python, Go, Bash, or Java.
  • Extensive experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog).
  • In-depth knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Strong familiarity with cloud platforms (e.g., AWS, Azure, Google Cloud).
  • Expertise in configuration management and Infrastructure-as-Code tools (e.g., Terraform, Ansible).
  • Strong understanding of networking, distributed systems, and databases.
  • Proven ability to lead and manage technical teams effectively.
  • Excellent problem-solving, analytical, and communication skills.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service