Site Reliability Engineer

Turing It Labs - San Jose, CA

posted 3 months ago

Full-time

San Jose, CA

About the position

The Site Reliability Engineer (SRE) position in San Jose, CA, is a critical role that combines software engineering and systems engineering to build and run large-scale, distributed, fault-tolerant systems. The SRE will be responsible for ensuring the reliability, availability, and performance of the company's services and infrastructure. This role requires a deep understanding of cloud computing, particularly with AWS, and experience with container orchestration using Kubernetes. The SRE will also be involved in automating operational tasks and improving system performance through coding and scripting, primarily using tools like Ansible and Python. In this position, the SRE will work closely with development teams to design and implement monitoring solutions using tools such as Dynatrace, Apica, and Grafana. The goal is to proactively identify and resolve issues before they impact customers. The SRE will also participate in on-call rotations, responding to incidents and outages, and will be expected to contribute to post-mortem analyses to prevent future occurrences. This role is essential for maintaining the high standards of service reliability that our customers expect. The ideal candidate will have over 14 years of IT experience, with a strong background in coding and scripting, cloud computing, and monitoring tools. The SRE will be expected to work onsite 2-3 days a week, collaborating with team members to enhance the reliability and performance of our systems. This position offers an exciting opportunity to work with cutting-edge technologies and contribute to the overall success of the organization.

Responsibilities

Ensure the reliability, availability, and performance of services and infrastructure.
Automate operational tasks and improve system performance through coding and scripting.
Design and implement monitoring solutions using tools like Dynatrace, Apica, and Grafana.
Collaborate with development teams to enhance system reliability and performance.
Participate in on-call rotations and respond to incidents and outages.
Conduct post-mortem analyses to prevent future incidents.

Requirements

14+ years of IT experience.
Proficiency in coding/scripting, particularly with Ansible and Python.
Strong knowledge of cloud computing, specifically AWS.
Experience with container orchestration using Kubernetes.
Familiarity with monitoring tools such as Dynatrace, Apica, and Grafana.

Site Reliability Engineer

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company