SRE Manager

$110,000 - $180,000/Yr

Ally Financial - Charlotte, NC

posted 4 months ago

Full-time - Manager
Onsite - Charlotte, NC
Credit Intermediation and Related Activities

About the position

As a Site Reliability Engineer (SRE) Manager at Ally Financial, you will play a pivotal role in ensuring the reliability and scalability of our complex systems. You will be responsible for managing the SRE team, which includes both Ally employees and contractors, and will collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems. Your work will involve advocating for reliability best practices during the application development lifecycle, ensuring that our systems are not only functional but also resilient and efficient. In this role, you will design and implement monitoring and alerting systems that provide real-time visibility into user experience and system health. You will monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability. Additionally, you will develop and maintain automated tools and processes to streamline operational tasks, reducing the need for manual interventions. Your participation in incident response and post-mortems will contribute to our continuous improvement efforts, ensuring that we learn from past incidents and enhance our systems accordingly. You will also conduct capacity planning and resource optimization to handle the growing demands on our infrastructure. This involves continuously researching and evaluating new technologies and practices to enhance the reliability and efficiency of our systems. Your leadership will be crucial in guiding the SRE team through these challenges, fostering a culture of problem-solving and innovation.

Responsibilities

  • Manage the SRE Team including Ally employees and contractors.
  • Collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems.
  • Work closely with development teams and architects to advocate for reliability best practices during the application development lifecycle.
  • Design and implement monitoring and alerting to provide real-time visibility into user experience and system health and performance.
  • Monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability.
  • Develop and maintain automated tools and processes to streamline operational tasks and reduce manual interventions.
  • Participate in incident response and post-mortems, contributing to continuous improvement efforts.
  • Conduct capacity planning and resource optimization to handle growing demands on our infrastructure.
  • Continuously research and evaluate new technologies and practices to enhance the reliability and efficiency of our systems.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related fields preferred (or equivalent practical experience).
  • Strong verbal and written communication skills.
  • 2-4 years of managing an SRE or DevOps team with observability workload.
  • 2-4 years of Agile Management owning SRE roadmaps and deliverables using Scrum/Kanban.
  • 2-4 years of delivering projects alongside a constant flow of side intake and production response workloads.
  • 5+ years of experience as a Site Reliability Engineer or similar role in a production environment.
  • 5+ years of experience with AWS services (ASG, Fargate, Lambda, Aurora DB, Dynamo DB, ALB/NLB).
  • 5+ years of working experience with CI/CD pipelines (Gitlab) and developing infrastructure-as-code (Terraform, Ansible, etc.).
  • Working knowledge of observability platforms like Splunk, Dynatrace, Datadog, Sumo Logic, or New Relic.
  • Working experience with designing Observability for enterprise applications.
  • Experienced knowledge of system administration and DevOps practices.

Nice-to-haves

  • Strong knowledge of Linux/Unix systems and network protocols.
  • Experience with distributed systems and microservices architecture.
  • Proficiency in programming or scripting languages such as Python, Java, or Bash.
  • Hands-on experience with monitoring and logging tools (DynaTrace, Cloudwatch, Prometheus, Grafana, etc.).
  • Familiarity with cybersecurity best practices and principles.
  • Certifications in AWS.

Benefits

  • 11 paid holidays
  • 20 paid time off days
  • 8 hours of volunteer time off yearly
  • 401K retirement savings plan with matching and company contributions
  • Student loan pay downs and 529 educational save up assistance programs
  • Tuition reimbursement
  • Employee stock purchase plan
  • Flexible health and insurance options including medical, dental, and vision
  • Employee, spouse, and child life insurance
  • Short- and long-term disability
  • Pre-tax Health Savings Account with employer contributions
  • Healthcare FSA
  • Critical illness, accident & hospital indemnity insurance
  • Total well-being program
  • Adoption, surrogacy, and fertility assistance
  • Paid parental and caregiver leave
  • Dependent Day Care FSA
  • Childcare discounts
  • Mentally Fit Employee Assistance Program
  • Subsidized and discounted Weight Watchers® program
  • Travel allowances
  • Relocation assistance
  • Signing bonus
  • Equity opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service