Cribl - Lansing, MI

posted 2 months ago

Full-time - Mid Level
Remote - Lansing, MI

About the position

Cribl is a forward-thinking company that values both productivity and a light-hearted work environment. As a remote-first organization, we empower our employees to excel in their roles from anywhere. We are currently seeking a Senior Site Reliability Engineer (SRE) to join our mission of unlocking the value of observability data. This role is crucial as it involves working with a team of technical engineers dedicated to delivering high-quality software while enjoying a fun and collaborative atmosphere. In this position, you will be integral to the engineering organization, contributing to the envisioning, creation, deployment, testing, and shipping of Cribl products. You will have the opportunity to be part of a transformative technology that enhances customer control over their observability data. Your role will involve engaging with various teams to improve service delivery and reliability throughout the product lifecycle. You will measure and monitor production systems, focusing on availability, latency, and overall system health, while also identifying and addressing the root causes of errors and instability in our cloud services. As a Senior SRE, you will advocate for improvements in reliability, resilience, and observability by collaborating with product and platform teams. You will also be responsible for driving down operational toil through innovative solutions and automation. On-call responsibilities will be part of your role, ensuring that you are actively involved in maintaining the reliability of our systems.

Responsibilities

  • Engage with teams to improve service delivery and reliability across their entire lifecycle.
  • Measure and monitor all production systems with a focus on availability, latency, and overall system health.
  • Identify the causes of errors and instability in production cloud services and drive teams towards operational excellence.
  • Collaborate with product and platform teams to improve and evolve systems by advocating for changes that enhance reliability, resilience, and observability.
  • Identify and reduce operational toil through creative innovation and automation.
  • Participate in on-call responsibilities.

Requirements

  • Extensive experience with enterprise scale continuous delivery environments.
  • 5+ years of experience in a DevOps or SRE role.
  • Proficiency in development with JavaScript/Node.js/TypeScript in a Linux/Mac environment.
  • Experience with Configuration Management Tools such as Terraform (preferred), Puppet, Chef, or Ansible.
  • Knowledge of sustainable incident response in a blameless environment.
  • Familiarity with cloud platforms, preferably AWS, and container orchestration technologies.
  • Experience with APM and observability tools such as New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, and Sentry.
  • Background in Linux Systems Engineering.
  • Experience with incident response tools like PagerDuty, FireHydrant, or Blameless.
  • Ability to work autonomously in a distributed team.

Nice-to-haves

  • Knowledge of cloud and application security.
  • Strong understanding of cloud design patterns for scale, data management, and resiliency.
  • A passion for high-quality software and a knack for testing.
  • Strong opinions about dashboards, metrics, and SLOs.

Benefits

  • Health insurance
  • Dental insurance
  • Vision insurance
  • Short-term disability insurance
  • Life insurance
  • Paid holidays
  • Paid time off
  • Fertility treatment benefit
  • 401(k) plan
  • Equity options
  • Eligibility for discretionary company-wide bonus
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service