Software Engineering Manager-SRE

$262,800 - $262,800/Yr

Citizens Bank

posted 2 months ago

Full-time - Manager
Credit Intermediation and Related Activities

About the position

As the Manager of Site Reliability Engineering (SRE) at Citizens, you will play a critical role in ensuring the performance, reliability, and scalability of our systems. Leveraging the principles of Site Reliability Engineering pioneered by Google, you will lead a team of talented engineers in implementing best practices for application performance monitoring, toil reduction, and system stability. Your focus will extend to both complex cloud-based and on-premises applications, ensuring high system uptime and availability. Collaboration with other SRE teams, departments, and business units across the organization will be essential to achieving our goals. In this role, you will engage in deep discussions with technologists to understand the intricacies of our systems and discuss strategic, long-term goals to drive innovation and growth. Your experience with AWS and Azure technologies, as well as proficiency in industry-standard tools, will be crucial for success. You will be responsible for developing and implementing strategies for application performance monitoring, proactively identifying and resolving performance bottlenecks, and driving initiatives to reduce toil and automate repetitive tasks. This will allow your team to focus on high-impact projects that improve system reliability and scalability. You will also establish and enforce best practices for incident management, post-mortem analysis, and continuous improvement, ensuring that lessons learned are applied to prevent future outages. Implementing robust monitoring and alerting systems using tools like Data Dog, ELK, and Open Telemetry will be part of your responsibilities, with a focus on meeting or exceeding defined service level objectives (SLOs) and service level agreements (SLAs). Your leadership will foster collaboration and knowledge sharing with other SRE teams and departments across the organization, leveraging their expertise and resources to drive improvements in system reliability and performance.

Responsibilities

  • Lead and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and excellence.
  • Develop and implement strategies for application performance monitoring to proactively identify and resolve performance bottlenecks.
  • Drive initiatives to reduce toil and automate repetitive tasks, allowing the team to focus on high-impact projects that improve system reliability and scalability.
  • Collaborate closely with cross-functional teams including software engineering, infrastructure, and product management to design, deploy, and maintain highly available and resilient systems.
  • Establish and enforce best practices for incident management, post-mortem analysis, and continuous improvement, ensuring that lessons learned are applied to prevent future outages.
  • Implement robust monitoring and alerting systems using tools like Data Dog, ELK, and Open Telemetry to track system uptime and availability for complex cloud and on-premises applications, focusing on meeting or exceeding defined SLOs and SLAs.
  • Foster collaboration and knowledge sharing with other SRE teams and departments across the organization, leveraging their expertise and resources to drive improvements in system reliability and performance.
  • Engage in deep discussions with technologists to understand the intricacies of our systems and discuss strategic, long-term goals to drive innovation and growth.
  • Utilize expertise in AWS and Azure technologies to architect, deploy, and optimize cloud-based solutions, ensuring scalability, reliability, and cost-effectiveness.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field.
  • Proven experience leading a team of Site Reliability Engineers in a fast-paced and dynamic environment.
  • Deep understanding of application performance monitoring principles and tools, with hands-on experience in designing and implementing monitoring solutions.
  • Strong background in system architecture, infrastructure automation, and cloud technologies, with expertise in AWS and Azure.
  • Expertise in incident management, with the ability to effectively lead and coordinate response efforts during critical incidents.
  • Experience managing system uptime and availability for complex cloud-based and on-premises applications, with a track record of meeting or exceeding defined SLOs and SLAs.
  • Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams and influence decision-making at all levels of the organization.
  • Strong problem-solving skills and a passion for driving continuous improvement and innovation.

Benefits

  • Competitive pay
  • Comprehensive medical, dental and vision coverage
  • Retirement benefits
  • Maternity/paternity leave
  • Flexible work arrangements
  • Education reimbursement
  • Wellness programs
  • Paid time off policy that exceeds mandatory requirements.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service