System One - Bethesda, MD
posted 3 months ago
The Site Reliability Engineer (SRE) position at ALTA IT Services is a critical role focused on ensuring the reliability, performance, and stability of production systems. The SRE will be responsible for deploying builds into production environments and leveraging their programming background to read and understand code, although no code remediation is required. A significant part of the role involves automating routine tasks to eliminate manual intervention, particularly in areas such as access management. The SRE will also be tasked with establishing and enhancing operational capabilities from the ground up, ensuring that the platform operates efficiently and effectively. In addition to deployment and automation, the SRE will triage and troubleshoot issues, identifying root causes of errors such as 403 errors. Effective incident management is crucial, as the SRE will oversee the development, testing, and staging environments, ensuring that all systems are functioning optimally. The role requires a proactive mindset focused on automation and efficiency, with a strong emphasis on leveraging modern technologies and practices to improve system reliability and performance. The ideal candidate will have relevant education and experience in Site Reliability Engineering, with a solid technical stack that includes AWS, Kubernetes, Python, Shell scripting, and experience with GitHub Actions or Jenkins for automated test scripts. This position is a contract-to-hire opportunity, providing a pathway to a permanent role for the right candidate.