Site Reliability Engineer

$135,000 - $350,000/Yr

Alchemy.Agency - New York, NY

posted 11 days ago

Full-time - Mid Level
New York, NY
Professional, Scientific, and Technical Services

About the position

As a Site Reliability Engineer at Alchemy, you will play a crucial role in enhancing developer productivity and ensuring the reliability of our globally used developer platform. You will collaborate with the engineering team to design, deploy, and continuously improve the infrastructure that supports our onchain products, focusing on high reliability, latency, and cost efficiency.

Responsibilities

  • Set high standards for Reliability at Alchemy
  • Develop and own company-wide Reliability best practices like SLO definition, incident management, postmortem reviews, launch readiness reviews, change management
  • Architect production infrastructure and tools that encourage and enforce high reliability
  • Inspire the broader engineering organization to ensure Reliability is a first-class citizen in the products we build
  • Collaborate, partner, advise, review, and mentor engineering teams on Reliability topics like high reliability architecture, observability, safe change management
  • Improve critical infrastructure and systems that are used to operate infrastructure at scale (i.e. compute, networking, deployment, observability, code tooling/libraries etc.)
  • Develop and own best practices for managing production infrastructure: provisioning, application scaling, configuration management, capacity planning, monitoring, etc.
  • Develop and own best practices for developer processes: CI/CD, dev and staging environments, etc.
  • Provide input into long-term platform requirements and operational guidelines with a focus on reliability
  • Continuously raise our standard of engineering excellence by implementing best practices for coding, testing, and deployment
  • Build and maintain documentation around process and workflows

Requirements

  • 6+ years of experience as an Infrastructure Engineer focused on Reliability (e.g., Site Reliability Engineer, Production Engineer, Platform Engineer)
  • Experience leading and driving company-wide reliability efforts and engineering initiatives
  • Experience with observability best practices and tooling like Prometheus, Grafana and Datadog
  • Experience designing and operating large-scale, multi-region production systems
  • Experience working with AWS or other cloud infrastructures
  • Experience with container schedules and runtimes such as Docker and Kubernetes
  • Experience building deployment pipelines leveraging common CI/CD tools (e.g. Argo, Flux, Gitops)
  • Experience with Infrastructure-as-Code (e.g. Terraform, Pulumi, Chef, Puppet, etc)
  • Strong communication and collaboration skills

Nice-to-haves

  • Experience with running production services on bare-metal
  • Experience with Typescript and Python
  • Excellent understanding of web applications and architecture

Benefits

  • Comprehensive medical, dental, and vision coverage
  • 401k
  • Unlimited flexible time off
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service