System One - Bethesda, MD

posted 3 months ago

Full-time
Bethesda, MD
Administrative and Support Services

About the position

The Site Reliability Engineer (SRE) position at ALTA IT Services is a critical role focused on ensuring the reliability, performance, and stability of production systems. The SRE will be responsible for deploying builds into production environments and leveraging their programming background to read and understand code, although no code remediation is required. A significant part of the role involves automating routine tasks to eliminate manual intervention, particularly in areas such as access management. The SRE will also be tasked with establishing and enhancing operational capabilities from the ground up, ensuring that the platform operates efficiently and effectively. In addition to deployment and automation, the SRE will triage and troubleshoot issues, identifying root causes of errors such as 403 errors. Effective incident management is crucial, as the SRE will oversee the development, testing, and staging environments, ensuring that all systems are functioning optimally. The role requires a proactive mindset focused on automation and efficiency, with a strong emphasis on leveraging modern technologies and practices to improve system reliability and performance. The ideal candidate will have relevant education and experience in Site Reliability Engineering, with a solid technical stack that includes AWS, Kubernetes, Python, Shell scripting, and experience with GitHub Actions or Jenkins for automated test scripts. This position is a contract-to-hire opportunity, providing a pathway to a permanent role for the right candidate.

Responsibilities

  • Deploy builds into production
  • Leverage programming background to read and understand code (no code remediation required)
  • Automate routine tasks to eliminate manual intervention (e.g., access management)
  • Ensure platform performance and stability
  • Establish and enhance operational capabilities from the ground up
  • Triage and troubleshoot issues (e.g., identify root causes of 403 errors)
  • Manage incidents effectively
  • Oversee development, testing, and staging environments

Requirements

  • Relevant education and experience in Site Reliability Engineering
  • Experience with AWS
  • Proficiency in Kubernetes
  • Strong programming skills in Python
  • Experience with Shell scripting
  • Familiarity with GitHub Actions or Jenkins for automated test scripts

Benefits

  • Health and welfare benefits coverage options including medical, dental, vision
  • Spending accounts
  • Life insurance
  • Voluntary plans
  • Participation in a 401(k) plan
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service