Ally Financial - Charlotte, NC

posted 4 months ago

Full-time - Mid Level
Onsite - Charlotte, NC
Credit Intermediation and Related Activities

About the position

As a Senior Site Reliability Engineer (SRE) at Ally Financial, you will play a crucial role in ensuring the reliability and scalability of our complex systems. This position is designed for individuals who are passionate about implementing efficient solutions to prevent and resolve incidents. You will be part of a dynamic team that embodies a startup feel while benefiting from the stability of a well-established company. Your work will directly impact the user experience and system performance, making it essential to advocate for reliability best practices throughout the application development lifecycle. In this role, you will collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems. You will work closely with development teams and architects to ensure that reliability is a priority from the outset of application development. Your responsibilities will include designing and implementing monitoring and alerting systems to provide real-time visibility into user experience and system health. You will also monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability. Additionally, you will develop and maintain automated tools and processes to streamline operational tasks, participate in incident response and post-mortems, and contribute to continuous improvement efforts. Conducting capacity planning and resource optimization will be key to handling the growing demands on our infrastructure. You will continuously research and evaluate new technologies and practices to enhance the reliability and efficiency of our systems, ensuring that Ally remains at the forefront of technological innovation.

Responsibilities

  • Collaborate with cross-functional teams to design, build, and maintain robust, scalable, and fault-tolerant systems
  • Work closely with development teams and architects to advocate for reliability best practices during the application development lifecycle
  • Design and implement monitoring and alerting to provide real-time visibility into user experience and system health and performance
  • Monitor and analyze system performance, proactively identifying potential issues and implementing solutions to ensure optimal performance and reliability
  • Develop and maintain automated tools and processes to streamline operational tasks and reduce manual interventions
  • Participate in incident response and post-mortems, contributing to continuous improvement efforts
  • Conduct capacity planning and resource optimization to handle growing demands on our infrastructure
  • Continuously research and evaluate new technologies and practices to enhance the reliability and efficiency of our systems

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related fields preferred (or equivalent practical experience)
  • Strong verbal and written communication skills
  • Ability to collaborate effectively in a team environment and communicate technical concepts to non-technical stakeholders
  • Proven 3+ years' experience as a Site Reliability Engineer or similar role in a production environment
  • 3+ years' experience with AWS services (ASG, Fargate, Lambda, Aurora DB, Dynamo DB, ALB/NLB)
  • 3+ years' working experience with CI/CD pipelines (Gitlab) and developing infrastructure-as-code (Terraform, Ansible, etc.), CloudFormation
  • Working knowledge of observability platforms like Splunk, Dynatrace, Datadog, Sumo Logic or New Relic
  • Working experience with designing Observability for enterprise applications
  • Working knowledge of containers in ECS, EKS or K8
  • Experienced knowledge of system administration, DevOps
  • Development experience along with cloud and physical servers
  • Understanding and experience working with business, product and engineering teams in developing SLI, SLO and SLA's.

Nice-to-haves

  • Strong knowledge of Linux/Unix systems and network protocols
  • Experience with distributed systems and microservices architecture
  • Proficiency in programming or scripting languages such as Python, Java, or Bash
  • Hands-on experience with monitoring and logging tools (DynaTrace, Cloudwatch, Prometheus, Grafana, etc.)
  • Familiarity with cybersecurity best practices and principles
  • Certifications in AWS
  • Ability to lead triage calls including working across multiple divisions to resolve issues.

Benefits

  • 11 paid holidays
  • 20 paid time off days
  • 8 hours of volunteer time off yearly
  • 401K retirement savings plan with matching and company contributions
  • Student loan pay downs and 529 educational save up assistance programs
  • Tuition reimbursement
  • Employee stock purchase plan
  • Flexible health and insurance options including medical, dental and vision
  • Employee, spouse and child life insurance
  • Short- and long-term disability
  • Pre-tax Health Savings Account with employer contributions
  • Healthcare FSA
  • Critical illness, accident & hospital indemnity insurance
  • Total well-being program
  • Adoption, surrogacy and fertility assistance
  • Paid parental and caregiver leave
  • Dependent Day Care FSA
  • Childcare discounts
  • Mentally Fit Employee Assistance Program
  • Subsidized and discounted Weight Watchers® program
  • Travel allowances
  • Relocation assistance
  • Signing bonus
  • Equity
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service