3i People - Atlanta, GA

posted 4 months ago

Full-time - Mid Level
Atlanta, GA
Administrative and Support Services

About the position

We have a position for a Sr. Site Reliability Engineer with one of our clients in Atlanta, GA for an initial contract duration of 5 months. This role is crucial in leading and mentoring a team of Site Reliability Engineers (SREs), fostering a culture of collaboration, continuous learning, and operational excellence. The selected candidate will drive the adoption of SRE best practices and ensure adherence to reliability and performance standards across the organization. The Sr. Site Reliability Engineer will be responsible for designing and implementing highly available, scalable, and fault-tolerant systems using AWS. This includes collaborating with software engineering teams and other SREs to influence design and architecture decisions that improve system reliability and performance. The role also involves developing and maintaining automation scripts and tools to streamline operations, deployments, and monitoring processes. Utilizing Infrastructure as Code (IaC) tools such as Terraform, GitHub Actions, and CloudFormation will be essential for managing infrastructure effectively. The engineer will implement and maintain robust monitoring, alerting, and logging systems using tools like Splunk, Grafana, or New Relic. Additionally, leading incident response efforts, conducting root cause analysis, and implementing measures to prevent recurrence will be key responsibilities. The engineer will oversee the design and maintenance of CI/CD pipelines using tools like Jenkins, GitLab CI, or CircleCI, ensuring seamless and efficient code deployment processes that reduce time to market and increase system reliability. Performance tuning and capacity planning will also be part of the role to ensure systems can handle growing workloads, along with troubleshooting experience to identify and resolve performance bottlenecks in infrastructure and applications.

Responsibilities

  • Lead and mentor a team of SREs, fostering a culture of collaboration and continuous learning.
  • Drive the adoption of SRE best practices and ensure adherence to reliability and performance standards.
  • Design and implement highly available, scalable, and fault-tolerant systems using AWS.
  • Collaborate with software engineering teams and other SREs to influence design and architecture decisions.
  • Develop and maintain automation scripts and tools to streamline operations, deployments, and monitoring processes.
  • Utilize Infrastructure as Code (IaC) tools such as Terraform, GitHub Actions, and CloudFormation to manage infrastructure.
  • Implement and maintain robust monitoring, alerting, and logging systems using tools like Splunk, Grafana, or New Relic.
  • Lead incident response efforts, conduct root cause analysis, and implement measures to prevent recurrence.
  • Oversee the design and maintenance of CI/CD pipelines using tools like Jenkins, GitLab CI, or CircleCI.
  • Ensure seamless and efficient code deployment processes, reducing time to market and increasing system reliability.
  • Conduct performance tuning and capacity planning to ensure systems can handle growing workloads.
  • Identify and resolve performance bottlenecks in infrastructure and applications.

Requirements

  • 5 years of experience in Site Reliability Engineering or related field.
  • Extensive/Strong AWS experience in designing, deploying, and managing scalable/reliable cloud-based infrastructure.
  • Software Engineering background/experience with languages such as Python, Javascript, Bash, etc.
  • In-depth knowledge of Infrastructure as Code (IaC) tools like Terraform, GitHub Actions, CloudFormation, and Ansible.
  • Strong automation and scripting skills, with a solid understanding of CI/CD pipelines (Jenkins).
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service