SRE Manager

Ally - Raleigh, NC

posted 4 months ago

Full-time - Mid Level

Onsite - Raleigh, NC

Printing and Related Support Activities

About the position

Ally Financial is seeking a talented and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role is crucial for ensuring the reliability and scalability of complex systems, and the ideal candidate will thrive on implementing efficient solutions to prevent and resolve incidents. As an SRE, you will be responsible for managing the SRE Team, which includes both Ally employees and contractors. You will collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems. Your work will involve close collaboration with development teams and architects to advocate for reliability best practices throughout the application development lifecycle. In this position, you will design and implement monitoring and alerting systems to provide real-time visibility into user experience and system health. You will monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability. Additionally, you will develop and maintain automated tools and processes to streamline operational tasks and reduce manual interventions. Participation in incident response and post-mortems will be part of your responsibilities, contributing to continuous improvement efforts. You will also conduct capacity planning and resource optimization to handle growing demands on our infrastructure, continuously researching and evaluating new technologies and practices to enhance the reliability and efficiency of our systems. At Ally, we pride ourselves on fostering a culture that values diverse thinking and supports one another. We are relentless in finding new ways technology can help make experiences better and help people. If you are passionate about technology and want to make a real impact, this is the opportunity for you.

Responsibilities

Manage the SRE Team including Ally employees and contractors.
Collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems.
Work closely with development teams and architects to advocate for reliability best practices during the application development lifecycle.
Design and implement monitoring and alerting to provide real-time visibility into user experience and system health and performance.
Monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability.
Develop and maintain automated tools and processes to streamline operational tasks and reduce manual interventions.
Participate in incident response and post-mortems, contributing to continuous improvement efforts.
Conduct capacity planning and resource optimization to handle growing demands on our infrastructure.
Continuously research and evaluate new technologies and practices to enhance the reliability and efficiency of our systems.

Requirements

Bachelor's degree in Computer Science, Engineering, or related fields preferred (or equivalent practical experience).
Strong verbal and written communication skills.
2-4 years' experience managing an SRE or DevOps team with observability workload.
2-4 years' experience in Agile Management owning SRE roadmaps and deliverables using Scrum/Kanban.
2-4 years' experience delivering projects alongside a constant flow of side intake and production response workloads.
5+ years' experience as a Site Reliability Engineer or similar role in a production environment.
5+ years' experience with AWS services (ASG, Fargate, Lambda, Aurora DB, Dynamo DB, ALB/NLB).
5+ years' working experience with CI/CD pipelines (Gitlab) and developing infrastructure-as-code (Terraform, Ansible, etc.).
Working knowledge of observability platforms like Splunk, Dynatrace, Datadog, Sumo Logic, or New Relic.
Working experience with designing Observability for enterprise applications.
Working knowledge of containers in ECS, EKS, or K8s.
Experienced knowledge of system administration and DevOps.
Development experience along with cloud and physical servers.
Understanding and experience working with business, product, and engineering teams in developing SLI, SLO, and SLA's.

Nice-to-haves

Strong knowledge of Linux/Unix systems and network protocols.
Experience with distributed systems and microservices architecture.
Proficiency in programming or scripting languages such as Python, Java, or Bash.
Hands-on experience with monitoring and logging tools (DynaTrace, Cloudwatch, Prometheus, Grafana, etc.).
Familiarity with cybersecurity best practices and principles.
Certifications in AWS.
Ability to lead triage calls including working across multiple divisions to resolve issues.

Benefits

Competitive holiday and flexible paid-time-off, including time off for volunteering and voting.
Industry-leading 401K retirement savings plan with matching and company contributions.
Student loan and 529 educational assistance programs.
Tuition reimbursement and other financial well-being programs.
Flexible health and insurance options including dental and vision.
Pre-tax Health Savings Account with employer contributions.
Total well-being program that helps you and your family stay on track physically, socially, emotionally, and financially.
Adoption, surrogacy, and fertility support.
Parental and caregiver leave, back-up child and adult/elder day care program, and childcare discounts.
LifeMatters® Employee Assistance Program, subsidized and discounted Weight Watchers® program, and other employee discount programs.

SRE Manager

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company