SRE

Insight Global - Bellevue, WA

posted 3 months ago

Full-time

Bellevue, WA

Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position at Insight Global in Bellevue, Washington, is designed for individuals with a strong background in software development and DevOps best practices. The ideal candidate will have over five years of experience in these areas, particularly within an Enterprise or Shared Service DevOps team. This role emphasizes the importance of automation, requiring proficiency in scripting languages such as Bash and Python. The SRE will be responsible for implementing and managing CI/CD pipelines using tools like Jenkins and GitLab CI/CD, ensuring smooth and efficient software delivery processes. In addition to automation, the role demands a solid understanding of infrastructure-as-code (IAC) principles, with hands-on experience using tools like Ansible and Terraform for infrastructure automation. Familiarity with Amazon Web Services (AWS) is also crucial, as the SRE will work extensively with cloud technologies. The candidate should possess strong problem-solving skills and a proactive approach to maintaining system health, troubleshooting complex issues, and responding to incidents. Experience in post-incident analysis and implementing preventive measures is essential to enhance system reliability and performance. The SRE will also be expected to work with observability tools, monitoring, and alerting systems to ensure that service level agreements (SLAs), service level objectives (SLOs), and service level indicators (SLIs) are met. A commitment to balancing reliability with continuous innovation and development is a key aspect of this role, as the SRE will contribute to creating a robust and scalable infrastructure that supports the company's growth and operational goals.

Responsibilities

Implement and manage CI/CD pipelines using Jenkins and GitLab CI/CD.
Automate infrastructure using tools like Ansible and Terraform.
Monitor and maintain system health, troubleshooting complex issues as they arise.
Respond to incidents and conduct post-incident analysis to prevent future occurrences.
Work with observability tools to ensure SLAs, SLOs, and SLIs are met.
Collaborate with development teams to enhance software delivery processes.
Develop and define metrics for Site Reliability Engineering (SRE).
Utilize AWS cloud technologies for infrastructure management.

Requirements

5+ years of experience in Software Development and DevOps best practices.
Previous experience working on an Enterprise / Shared Service DevOps team.
Proficiency in scripting languages, especially Bash and Python, for automation.
Experience with CI/CD tools like Jenkins and GitLab CI/CD, and strong pipeline management skills.
Familiarity with version control systems, particularly Git, and collaboration platforms.
Knowledge of infrastructure-as-code (IAC) principles and tools, such as Ansible and Terraform.
Experience with Terraform or Ansible for infrastructure automation.
Experience with Amazon (AWS) Cloud technologies.
Strong problem-solving skills and a proactive approach to system health.
A solid background in system administration, infrastructure management, or software engineering.
Experience in incident response, post-incident analysis, and implementing preventive measures.
Familiarity with observability tools, monitoring, and alerting systems.

SRE

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company