Site Reliability Engineer

HiveWatchposted 25 days ago

$120,000 - $140,000/Yr

Full-time • Mid Level

El Segundo, CA

Upload and Match ResumeTrack Jobs with Teal

About the position

As a Site Reliability Engineer (SRE), you will bridge the gap between development and operations, deploying updates to HiveWatch products, serving in an on-call rotation, & ensuring our services are reliable, scalable, and performant. You'll play a critical role in designing, implementing, and maintaining our infrastructure while collaborating closely with operations engineers and software engineers to solve complex technical challenges.

Responsibilities

Plan and execute software deployments to production environments with minimal customer impact
Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
Respond to alerts and resolve incidents within defined SLA timeframes
Develop and maintain Terraform and Github Actions automation for deployment processes
Manage feature flags and configuration changes during deployments
Communicate deployment status to stakeholders when necessary
Maintain deployment documentation and runbooks
Design, build, and maintain scalable and reliable infrastructure
Implement and manage CI/CD pipelines for efficient and reliable deployments
Monitor system performance and respond to incidents in a timely manner
Collaborate with development teams to optimize application performance
Implement security best practices and ensure compliance with SOC2 and other security requirements
Document processes, configurations, and troubleshooting procedures
Continuously improve system reliability through post-incident reviews and proactive improvements
Develop and improve monitoring and alerting based on operational experience
Balance on-call responsibilities with regular work duties

Requirements

Bachelor's degree in Computer Science or related field, or equivalent practical experience
3+ years of experience in systems administration, DevOps, or similar roles
Strong knowledge of AWS cloud platforms
Expertise managing AWS resources with Terraform
Proficiency in scripting languages (Python, Bash, etc.)
Familiarity with containerization and orchestration (Docker, Kubernetes)
Experience with monitoring and logging systems (Prometheus, Grafana)
Knowledge of networking concepts and security principles
Excellent problem-solving skills and attention to detail
Strong communication skills and ability to work in a collaborative environment