Site Reliability Engineer

Unclassified - San Jose, CA

posted 2 months ago

Full-time

San Jose, CA

About the position

The Site Reliability Engineer (SRE) position is a critical role that focuses on ensuring the reliability, availability, and performance of our infrastructure and services. The SRE will be responsible for creating and supporting automation scripts using shell, Ansible, and Python to facilitate infrastructure deployments, validations, and monitoring. This role is essential in improving operational tasks and enhancing the overall efficiency of our systems. The SRE will also be tasked with scheduling monitoring scripts using cron and Airflow, ensuring that our systems are continuously monitored and any issues are promptly addressed. In addition to automation and monitoring, the SRE will handle incident management and problem resolution, working closely with various teams to troubleshoot and resolve issues as they arise. The role requires extensive experience in IT infrastructure, particularly with Linux operating systems such as RHEL and CentOS, as well as a strong understanding of distributed computing and container orchestration frameworks, including Kubernetes. The SRE will also be involved in database management, requiring knowledge of both SQL and NoSQL databases. The ideal candidate will have a strong background in building CI/CD pipelines and will be familiar with cloud platforms, specifically AWS. This position offers a hybrid work environment, allowing for two days in the San Jose, CA office and three days of remote work, providing flexibility while maintaining collaboration with the team.

Responsibilities

Creating and supporting automation scripts (shell/ansible/python) for infrastructure deployments, validations, and monitoring to improve operational tasks
Scheduling monitoring scripts using cron and airflow
Monitoring using tools including Dynatrace, Apica, Grafana, etc.
Database handling
Building CI/CD pipelines
Incident handling and problem management

Requirements

14 plus years of IT Infrastructure experience
Extensive experience working with Linux flavors like RHEL/CentOS OS, shells, filesystems, and utilities
Experience in programming languages like Python and Ansible
Knowledge of distributed computing and experience working with container orchestration frameworks including on-prem and Rancher Kubernetes
Good knowledge of Kubernetes objects
Experience working with Storage, ONTAP is preferable: volume, aggregates, backups, DR planning
Experience scheduling monitoring scripts using cron and Airflow
Experience with monitoring tools including Dynatrace, Apica, Grafana, etc.
Database knowledge including SQL and NoSQL databases
Experience building CI/CD pipelines (preferred)
Cloud platform knowledge (specifically AWS) is required

Site Reliability Engineer

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company