Steampunk - McLean, VA

posted 2 months ago

Full-time - Mid Level
McLean, VA
Food Services and Drinking Places

About the position

The Sr. Site Reliability Engineer (SRE) at Steampunk is responsible for ensuring the reliability, availability, and performance of cloud infrastructure and large-scale software systems. This role combines software engineering with traditional operations to minimize downtime and mitigate potential failures, while collaborating with various teams to implement best practices in reliability and performance optimization.

Responsibilities

  • Conduct in-depth analyses of infrastructure to identify areas for improvement in performance, scalability, and resource utilization.
  • Define and implement key reliability metrics, service-level objectives (SLOs), and service-level indicators (SLIs) to measure system health.
  • Design and implement automation tools to reduce manual toil and enhance operational efficiency.
  • Collaborate with software development teams to optimize performance and resilience of services through code improvements and architectural enhancements.
  • Forecast capacity requirements and implement strategies for auto-scaling and load balancing.
  • Provide mentorship and training on SRE principles to cross-functional teams.
  • Lead the development and implementation of incident response procedures to minimize user impact.
  • Monitor systems to ensure insight into performance, health, and availability.

Requirements

  • Master's degree and 8 years of experience; OR Bachelor's degree and 10 years of IT experience.
  • Eligible to obtain and maintain a government security clearance.
  • Knowledge and experience with Agile and DevSecOps methodologies.
  • Experience in system engineering in areas including telecommunications, computer languages, operating systems, and database management systems.
  • Experience with source code and binary repository products (e.g., GitHub, GitLab).
  • Experience with infrastructure and cloud management tools (e.g., AWS CloudWatch).
  • Experience with log management and analysis tools (e.g., Splunk).
  • Experience with automation and configuration management tools (e.g., Terraform, Puppet).

Nice-to-haves

  • Knowledge and experience with NewRelic and/or other AIOps platforms.
  • Programming skills in Javascript, Ruby, and/or Go.
  • Experience with Nginx, HAProxy, Docker, Kubernetes, or similar technologies.
  • Experience with messaging systems and collaboration software.
  • Experience with Linux and Windows operating systems, along with scripting tools (e.g., Bash, Powershell).
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog).

Benefits

  • Telework/flex scheduling
  • Health, dental, and vision insurance upon hire
  • Paid time off with sell-back benefit and carryover option
  • 11 Federal Holidays
  • 100% paid military leave
  • 100% 401(k) plan match upon hire
  • Professional development/education reimbursement
  • Flexible spending accounts
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service