Staff Site Reliability Engineer (SRE)

$152,000 - $230,500/Yr

Cribl - Lansing, MI

posted 2 months ago

Full-time - Mid Level

Remote - Lansing, MI

About the position

Cribl is a forward-thinking company that values both productivity and a light-hearted work environment. As a remote-first organization, we empower our employees to excel in their roles from anywhere. We are currently seeking a Senior Site Reliability Engineer (SRE) to join our mission of unlocking the value of observability data. This role is crucial as it involves working with a team of technical engineers dedicated to delivering high-quality software while enjoying a fun and collaborative atmosphere. In this position, you will be integral to the engineering organization, contributing to the envisioning, creation, deployment, testing, and shipping of Cribl products. You will have the opportunity to be part of a transformative technology that enhances customer control over their observability data. Your role will involve engaging with various teams to improve service delivery and reliability throughout the product lifecycle. You will measure and monitor production systems, focusing on availability, latency, and overall system health, while also identifying and addressing the root causes of errors and instability in our cloud services. As a Senior SRE, you will advocate for improvements in reliability, resilience, and observability by collaborating with product and platform teams. You will also be responsible for driving down operational toil through innovative solutions and automation. On-call responsibilities will be part of your role, ensuring that you are actively involved in maintaining the reliability of our systems.

Responsibilities

Engage with teams to improve service delivery and reliability across their entire lifecycle.
Measure and monitor all production systems with a focus on availability, latency, and overall system health.
Identify the causes of errors and instability in production cloud services and drive teams towards operational excellence.
Collaborate with product and platform teams to improve and evolve systems by advocating for changes that enhance reliability, resilience, and observability.
Identify and reduce operational toil through creative innovation and automation.
Participate in on-call responsibilities.

Requirements

Extensive experience with enterprise scale continuous delivery environments.
5+ years of experience in a DevOps or SRE role.
Proficiency in development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
Experience with Configuration Management Tools such as Terraform (preferred), Puppet, Chef, or Ansible.
Knowledge of sustainable incident response in a blameless environment.
Familiarity with cloud platforms, preferably AWS, and container orchestration technologies.
Experience with APM and observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, and Sentry.
Background in Linux Systems Engineering.
Experience with incident response tools like PagerDuty, FireHydrant, or Blameless.
Ability to work autonomously in a distributed team.

Nice-to-haves

Knowledge of cloud and application security.
Strong understanding of cloud design patterns for scale, data management, and resiliency.
A passion for high-quality software and a knack for testing.
Strong opinions about dashboards, metrics, and SLOs.

Benefits

Health insurance
Dental insurance
Vision insurance
Short-term disability insurance
Life insurance
Paid holidays
Paid time off
Fertility treatment benefit
401(k) plan
Equity options
Eligibility for discretionary company-wide bonus

Staff Site Reliability Engineer (SRE)

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company