Cribl - Washington, DC

posted 6 months ago

Full-time - Senior
Remote - Washington, DC

About the position

Cribl is on a mission to unlock the value of all observability data, and we are seeking a Staff Site Reliability Engineer (SRE) to join our dynamic team. As a remote-first company, we empower our employees to perform their best work from anywhere. In this role, you will be part of a collaborative engineering organization dedicated to creating, deploying, testing, and shipping high-quality software that meets the needs of our customers. You will have the opportunity to work with some of the biggest names in the industry, helping them solve their most pressing data challenges. As a Staff Site Reliability Engineer, you will engage with various teams to enhance service delivery and reliability throughout the entire lifecycle of our products. Your responsibilities will include measuring and monitoring production systems to ensure availability, latency, and overall system health. You will investigate the root causes of errors and instability in our production cloud services, driving teams towards operational excellence. Your role will also involve collaborating with product and platform teams to advocate for changes that improve reliability, resilience, and observability. We are looking for individuals who are passionate about reliability and have strong opinions on how to improve systems. You will be involved in all aspects of our cloud services, from conception to design to development and production. If you enjoy fixing things and have a creative approach to reducing toil through innovation and automation, this position is for you. You will also have on-call responsibilities, ensuring that our systems remain reliable and efficient.

Responsibilities

  • Engage with teams to improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with a focus on availability, latency, and overall system health
  • Investigate the causes of errors and instability in production cloud services and drive teams towards operational excellence
  • Collaborate with product and platform teams to advocate for changes that enhance reliability, resilience, and observability
  • Identify and reduce toil through creative innovation and automation
  • Participate in on-call responsibilities to maintain system reliability

Requirements

  • Extensive experience with enterprise scale continuous delivery environments
  • 8+ years of experience in a DevOps or SRE role
  • Proficient in development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred), Puppet, Chef, or Ansible
  • Knowledge of sustainable incident response in a blameless environment
  • Familiarity with cloud platforms (preferably AWS) and container orchestration technologies
  • Experience with APM and Observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry, etc.
  • Background in Linux Systems Engineering
  • Experience with incident response tools like PagerDuty, FireHydrant, or Blameless
  • Ability to work autonomously in a distributed team environment

Nice-to-haves

  • Knowledge of cloud and application security
  • Strong understanding of cloud design patterns for scale, data management, and resiliency
  • A passion for high quality and a knack for testing
  • Opinions about dashboards, metrics, and SLOs

Benefits

  • Health insurance
  • Dental insurance
  • Vision insurance
  • Short-term disability insurance
  • Life insurance
  • Paid holidays
  • Paid time off
  • Fertility treatment benefit
  • 401(k) plan
  • Equity options
  • Eligibility for a discretionary company-wide bonus
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service