Site Reliability Engineer (SRE) - San Jose, CA - Onsite

Elliott Moss Consulting - San Jose, CA

posted 3 months ago

Full-time

San Jose, CA

About the position

We are seeking a highly skilled DevOps Engineer / Site Reliability Engineer (SRE) with expertise in managing complex environments and a deep understanding of Linux-based systems, Kubernetes, automation, and cloud platforms. The ideal candidate will have strong experience in deploying, managing, and troubleshooting large-scale applications with a focus on automation, monitoring, and cloud services. This position is based in San Jose, CA and requires onsite presence. The role is expected to last for a duration of 12 months, and we welcome candidates with various visa statuses, excluding OPT and CPT. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our systems and applications. You will work closely with development teams to implement best practices in automation and monitoring, and you will be involved in the deployment and management of applications in a cloud environment. Your expertise in Kubernetes and cloud platforms will be crucial in maintaining the health of our infrastructure and ensuring that our services are available and performant. The ideal candidate will also have a solid understanding of storage technologies, particularly NetApp ONTAP, and will be proficient in scripting languages such as Shell, Ansible, and Python. You will utilize monitoring and performance tools like Dynatrace, Apica, and Grafana to ensure that our systems are operating optimally. Familiarity with CI/CD pipelines and DevOps tools such as Jenkins and GitLab CI will also be essential in this role. Strong problem-solving skills and the ability to troubleshoot complex systems are a must, as you will be tasked with resolving issues that may arise in our production environment.

Responsibilities

Deploy, manage, and troubleshoot large-scale applications.
Implement automation and monitoring best practices.
Work closely with development teams to ensure system reliability.
Utilize Kubernetes for cluster management and orchestration.
Manage cloud infrastructure, particularly in AWS.
Monitor system performance using tools like Dynatrace and Grafana.
Develop scripts for automation and configuration management using Shell, Ansible, and Python.
Ensure the health and performance of storage technologies, especially NetApp ONTAP.
Collaborate on CI/CD pipeline implementations using tools like Jenkins and GitLab CI.

Requirements

Proficiency in Linux OS, particularly RHEL/CentOS.
Hands-on experience with Kubernetes, specifically using Rancher for cluster management.
Solid understanding of storage technologies, especially NetApp ONTAP.
Strong scripting skills in Shell, Ansible, and Python for automation and configuration management.
Experience with monitoring and performance tools such as Dynatrace, Apica, and Grafana.
Proficiency in working with both SQL and NoSQL databases.
Familiarity with CI/CD pipelines and DevOps tools (Jenkins, GitLab CI, etc.).
Cloud platform experience, particularly with AWS (EC2, S3, RDS, Lambda, etc.).
Strong problem-solving skills and the ability to troubleshoot complex systems.

Site Reliability Engineer (SRE) - San Jose, CA - Onsite

About the position

Responsibilities

Requirements

Tools

Career Hubs

Guides

Company