Varoposted about 1 month ago
$180,000 - $220,000/Yr
Full-time • Mid Level
New York City, NY
Credit Intermediation and Related Activities

About the position

As a Staff Site Reliability Engineer (SRE), you will be playing a pivotal role in ensuring the reliability, scalability, and performance of our cloud-based services. You will drive best practices and contribute to both the design and implementation of robust cloud infrastructures as well as actively participate in shaping the technical roadmap.

Responsibilities

  • Lead and mentor a team of SREs, driving best practices and fostering a culture of reliability and performance
  • Provide strategic direction in the design, implementation, and management of scalable and resilient cloud based infrastructure on AWS
  • Oversee the implementation and optimization of observability solutions using OpenTelemetry for distributed tracing and monitoring
  • Supervise the utilization of Prometheus and Grafana for effective monitoring and visualization system metrics and health
  • Manage the design and implementation of service meshes with Istio to ensure secure and reliable service communication
  • Develop and enforce SRE best practices, including incident response, post-mortem analysis and capacity planning
  • Collaborate with development teams and service owners to ensure alignment with reliability and performance goals
  • Contribute to the technical roadmaps and Objective and Key Results (OKRs) by providing leadership and insights on improving system reliability, scalability, and performance
  • Write and maintain infrastructure as code for core systems (terraform, terraform modules and kubernetes helm charts); build and maintain CI/CD pipelines
  • Collaborate with development teams to implement and improve SLIs and SLOs for their services and to promote service ownership
  • Automate operational tasks to save time and improve accuracy
  • Write clean and scalable scripts, software and systems to manage platform infrastructure and applications.

Requirements

  • Minimum 12 years experience as a Site Reliability, DevOps, or Software Engineer with proficiency in one or more high-level languages (such as Python, GoLang, Ruby, Java, or JavaScript)
  • Proven leadership with demonstrated experience in SRE team settings, with a focus on driving and architecting projects
  • Expert Linux and troubleshooting skills
  • Experience in building and supporting high-availability cloud environments in AWS
  • Expertise using Infrastructure as code (IaC) and deployment automation with tools such as Terraform, Helm, Gitlab or equivalent
  • Experience running Kubernetes and Istio in production
  • Advanced Observability skills with monitoring, logging and tracing tools such as Prometheus, Grafana, Jaeger/Tempo, ELK/Loki, and OpenTelemetry
  • Experience instrumenting code (Java/Kotlin, Python, Go, etc.) and creating simple instrumentation frameworks for developers to adopt where auto instrumentation may fall short
  • Participate in an on-call rotation for after-hours production infrastructure incidents
  • Experience with SDLC, CI/CD, and related tooling

Nice-to-haves

  • Kafka and message streaming experience is a plus

Benefits

  • $180,000 - $220,000 a year

Job Keywords

Hard Skills
  • Gitlab
  • Istio
  • OpenTelemetry
  • Prometheus
  • Python
  • 1jWx7ge l42JB9D
  • 2DiMJWn
  • 5o3JmKYRZLCH JHmfrwihM6xn
  • 7uCPiebk
  • 8hMf YLbRmOXFxoI
  • 8ilJtQ6djMv
  • 9wotGHN1h ct3RsACFz
  • ArHdUJK56 ujg7IvCahLPB
  • BqrgO
  • gEa
  • gj04c
  • LgtBajsAX4x9KEe 6gP IavO3
  • N1p2DsGr9SdXoev 31y GAvIu
  • r1lWC
  • TmfHWAw7ZYy
  • uU1mTO in6fKElRwcMrk7D
  • v5M29pZfoI
  • wnbjJ63ha sBaJjSH2A
  • wVR9P6
  • x3si0NZ DP9sokvF0jd
  • xFAO7B
Build your resume with AI

A Smarter and Faster Way to Build Your Resume

Go to AI Resume Builder
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service