Unclassified - San Jose, CA

posted 2 months ago

Full-time
San Jose, CA

About the position

The Site Reliability Engineer (SRE) position is a critical role that focuses on ensuring the reliability, availability, and performance of our infrastructure and services. The SRE will be responsible for creating and supporting automation scripts using shell, Ansible, and Python to facilitate infrastructure deployments, validations, and monitoring. This role is essential in improving operational tasks and enhancing the overall efficiency of our systems. The SRE will also be tasked with scheduling monitoring scripts using cron and Airflow, ensuring that our systems are continuously monitored and any issues are promptly addressed. In addition to automation and monitoring, the SRE will handle incident management and problem resolution, working closely with various teams to troubleshoot and resolve issues as they arise. The role requires extensive experience in IT infrastructure, particularly with Linux operating systems such as RHEL and CentOS, as well as a strong understanding of distributed computing and container orchestration frameworks, including Kubernetes. The SRE will also be involved in database management, requiring knowledge of both SQL and NoSQL databases. The ideal candidate will have a strong background in building CI/CD pipelines and will be familiar with cloud platforms, specifically AWS. This position offers a hybrid work environment, allowing for two days in the San Jose, CA office and three days of remote work, providing flexibility while maintaining collaboration with the team.

Responsibilities

  • Creating and supporting automation scripts (shell/ansible/python) for infrastructure deployments, validations, and monitoring to improve operational tasks
  • Scheduling monitoring scripts using cron and airflow
  • Monitoring using tools including Dynatrace, Apica, Grafana, etc.
  • Database handling
  • Building CI/CD pipelines
  • Incident handling and problem management

Requirements

  • 14 plus years of IT Infrastructure experience
  • Extensive experience working with Linux flavors like RHEL/CentOS OS, shells, filesystems, and utilities
  • Experience in programming languages like Python and Ansible
  • Knowledge of distributed computing and experience working with container orchestration frameworks including on-prem and Rancher Kubernetes
  • Good knowledge of Kubernetes objects
  • Experience working with Storage, ONTAP is preferable: volume, aggregates, backups, DR planning
  • Experience scheduling monitoring scripts using cron and Airflow
  • Experience with monitoring tools including Dynatrace, Apica, Grafana, etc.
  • Database knowledge including SQL and NoSQL databases
  • Experience building CI/CD pipelines (preferred)
  • Cloud platform knowledge (specifically AWS) is required
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service