Staff Site Reliability Engineer (SRE)

$144,000 - $278,000/Yr

Cribl - Washington, DC

posted 6 months ago

Full-time - Senior

Remote - Washington, DC

About the position

Cribl is on a mission to unlock the value of all observability data, and we are seeking a Staff Site Reliability Engineer (SRE) to join our dynamic team. As a remote-first company, we empower our employees to perform their best work from anywhere. In this role, you will be part of a collaborative engineering organization dedicated to creating, deploying, testing, and shipping high-quality software that meets the needs of our customers. You will have the opportunity to work with some of the biggest names in the industry, helping them solve their most pressing data challenges. As a Staff Site Reliability Engineer, you will engage with various teams to enhance service delivery and reliability throughout the entire lifecycle of our products. Your responsibilities will include measuring and monitoring production systems to ensure availability, latency, and overall system health. You will investigate the root causes of errors and instability in our production cloud services, driving teams towards operational excellence. Your role will also involve collaborating with product and platform teams to advocate for changes that improve reliability, resilience, and observability. We are looking for individuals who are passionate about reliability and have strong opinions on how to improve systems. You will be involved in all aspects of our cloud services, from conception to design to development and production. If you enjoy fixing things and have a creative approach to reducing toil through innovation and automation, this position is for you. You will also have on-call responsibilities, ensuring that our systems remain reliable and efficient.

Responsibilities

Engage with teams to improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with a focus on availability, latency, and overall system health
Investigate the causes of errors and instability in production cloud services and drive teams towards operational excellence
Collaborate with product and platform teams to advocate for changes that enhance reliability, resilience, and observability
Identify and reduce toil through creative innovation and automation
Participate in on-call responsibilities to maintain system reliability

Requirements

Extensive experience with enterprise scale continuous delivery environments
8+ years of experience in a DevOps or SRE role
Proficient in development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred), Puppet, Chef, or Ansible
Knowledge of sustainable incident response in a blameless environment
Familiarity with cloud platforms (preferably AWS) and container orchestration technologies
Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry, etc.
Background in Linux Systems Engineering
Experience with incident response tools like PagerDuty, FireHydrant, or Blameless
Ability to work autonomously in a distributed team environment

Nice-to-haves

Knowledge of cloud and application security
Strong understanding of cloud design patterns for scale, data management, and resiliency
A passion for high quality and a knack for testing
Opinions about dashboards, metrics, and SLOs

Benefits

Health insurance
Dental insurance
Vision insurance
Short-term disability insurance
Life insurance
Paid holidays
Paid time off
Fertility treatment benefit
401(k) plan
Equity options
Eligibility for a discretionary company-wide bonus

Staff Site Reliability Engineer (SRE)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company