The Site Reliability Engineer (SRE) position in San Jose, CA, is a critical role that combines software engineering and systems engineering to build and run large-scale, distributed, fault-tolerant systems. The SRE will be responsible for ensuring the reliability, availability, and performance of the company's services and infrastructure. This role requires a deep understanding of cloud computing, particularly with AWS, and experience with container orchestration using Kubernetes. The SRE will also be involved in automating operational tasks and improving system performance through coding and scripting, primarily using tools like Ansible and Python. In this position, the SRE will work closely with development teams to design and implement monitoring solutions using tools such as Dynatrace, Apica, and Grafana. The goal is to proactively identify and resolve issues before they impact customers. The SRE will also participate in on-call rotations, responding to incidents and outages, and will be expected to contribute to post-mortem analyses to prevent future occurrences. This role is essential for maintaining the high standards of service reliability that our customers expect. The ideal candidate will have over 14 years of IT experience, with a strong background in coding and scripting, cloud computing, and monitoring tools. The SRE will be expected to work onsite 2-3 days a week, collaborating with team members to enhance the reliability and performance of our systems. This position offers an exciting opportunity to work with cutting-edge technologies and contribute to the overall success of the organization.