Workday - McLean, VA
posted 4 months ago
At Workday, we are looking for a dedicated DevOps/Site Reliability Engineer (SRE) who is passionate about automating, operating, and improving our pioneering cloud-native service platforms. This role is crucial as it supports one or more contracts with the U.S. Federal Government, which requires all personnel working on these contracts to be U.S. citizens. The primary function of our DevOps/SRE team is to ensure the reliability and availability of our platform, meeting desired Service Level Agreements (SLAs), reducing operational load, and scaling sustainably in alignment with business growth. As a key member of our team, you will be responsible for software engineering and operations, focusing on reducing operational toil and enhancing the overall customer experience. Our team operates in a scrum environment, planning automation and improvements through two-week sprints. We are autonomous, with an on-call function that follows the sun, ensuring continuous support. Our tech stack is entirely cloud-native, utilizing technologies such as Kubernetes, Istio, OPA, GoLang, Ruby/Groovy, ArgoCD, Jenkins, Prometheus, and Grafana. You will be responsible for ensuring the safe change and reliability of customer environments, implementing SLO gated multi-stage deployment automation, and improving platform reliability and observability. This includes developing effective Service Level Indicators (SLIs) to ensure that SLOs are achieved, building an extendable observability architecture, and establishing new processes in collaboration with platform service teams. We are looking for someone who is passionate about identifying and solving problems in distributed environments, particularly those that scale across configuration, Linux Operating Systems, and networks. You should have hands-on experience with distributed environments, especially Kubernetes, and a strong belief in the importance of automation for operating large-scale systems. Your drive for customer success will be key in this role, as will your ability to work independently and collaboratively with diverse global teams. Excellent documentation skills and experience in developing detailed runbooks and processes are also essential.