Site Reliability Engineer

$176,800 - $176,800/Yr

Wolf Works - Mountain View, CA

posted 8 days ago

Full-time
Mountain View, CA

About the position

The Site Reliability Engineer (SRE) role involves designing, implementing, and maintaining complex data systems that support millions of customers. This hands-on position focuses on applying Cloud Native principles to ensure high availability, security, performance, and scalability of database systems, while working with cutting-edge technologies and maintaining critical infrastructure.

Responsibilities

  • Design, build, and maintain CI/CD pipelines in Jenkins.
  • Deploy services in Kubernetes clusters using Helm, Kustomize, and similar tools.
  • Implement infrastructure changes in AWS with a deep understanding of AWS services.
  • Participate in on-call duties for pre-production and production systems, supporting multi-million users.
  • Write and review RCA (Root Cause Analysis) documentation to prevent the recurrence of incidents and share learnings.
  • Contribute to system upgrades, deployment automation, monitoring enhancements, and production changes.
  • Create operational playbooks, write how-to articles, and gain domain knowledge to drive team improvements.
  • Participate in FMEA (Failure Mode and Effects Analysis) testing, chaos testing, and security remediation efforts.
  • Share best practices for operational excellence and cost optimization.
  • Automate processes to reduce manual efforts and increase efficiency.
  • Continuously look for opportunities to increase developer velocity and productivity.

Requirements

  • Bachelor's or master's degree in Computer Science or a related technical field, or equivalent experience.
  • 4+ years of hands-on experience with development and operations in AWS environments.
  • Expertise in performance monitoring, troubleshooting, and tuning.
  • Experience with AWS services and Cloud hosting.
  • Proficiency in DevOps automation using scripting languages.
  • Experience with programming languages such as Java, Python, or Ruby.
  • Knowledge of Docker, Kubernetes, and ArgoCD.
  • Experience with monitoring and observability tools such as Splunk, Wavefront, AppDynamics, Prometheus, and Tracing.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service