Senior Site Reliability Engineer - ATL

The Matlen Silver Group - Atlanta, GA

posted 11 days ago

Full-time - Mid Level

Atlanta, GA

Professional, Scientific, and Technical Services

About the position

The Senior Site Reliability Engineer will focus on enhancing Delta's reliability engineering practices by driving cross-team initiatives aimed at improving application resiliency, uptime, and performance. The role requires a strong background in modern reliability disciplines and experience in implementing observability plans around logs, metrics, and traces.

Responsibilities

Set SLOs / SLIs / error budgets and manage reliability for infrastructure and applications.
Utilize scripting languages such as JavaScript, Nodejs, Python, Maven, Ansible, and Bash.
Handle diverse systems with configuration management systems like Puppet, Chef, and Ansible.
Eliminate toil by leveraging automation.
Manage incidents using tools like PagerDuty.
Implement monitoring and alerting systems such as Prometheus, Grafana, and Dynatrace.
Understand standard networking protocols and components including HTTP, DNS, TCP/IP, and Load Balancing strategies.
Work with Serverless Application Framework and containerized workloads using Docker or Kubernetes.
Familiarity with distributed systems and Microservices.
Utilize infrastructure automation tools like CloudFormation and Terraform.
Understand CI/CD processes and deployment automation tools like Code Pipeline, Code Deploy, Jenkins, and Bamboo.
Debug, troubleshoot, and solve problems effectively.
Communicate and collaborate with various business units and third parties.
Liaise with developers, operations staff, and third-party resources.
Integrate APIs and coach/mentor team members on reliability engineering.

Requirements

Minimum 5+ years of experience in DevOps practices.
Hands-on experience with AWS Cloud and DevOps principles.
2+ years of experience working on DevOps tools (GitLab CI, AWS-CodePipeline).
2+ years of experience in scripting tools (Bash, Python, etc.).
1+ years of experience in developing NodeJS or TypeScript applications.
2+ years of experience in building and supporting applications in AWS using their native services.
1+ year of experience in AWS CDK.
Ability to troubleshoot and resolve problems with existing AWS Cloud Controls.

Nice-to-haves

1+ year of experience in containerization technologies like Kubernetes, OpenShift, Docker.
1+ year of experience in application resiliency evaluation using AWS FIS.
1+ year of experience using Litmus for Chaos Engineering methods.
Exposure to RedHat OpenShift on AWS (ROSA).

Senior Site Reliability Engineer - ATL

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company