Splunk

posted 17 days ago

Full-time - Senior
Remote
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Senior Site Reliability Engineer at Splunk will play a crucial role in managing and improving the reliability and resiliency of SRE-managed services and infrastructure within FedRAMP environments. This position focuses on enabling developers to operate highly available, scalable, and cost-efficient applications while minimizing operational burdens. The role involves working with a team of engineers to deliver quality products, mentoring new engineers, and leading various reliability projects.

Responsibilities

  • Own Splunk Cloud Observability in FedRAMP environments.
  • Work across the organization to deliver quality products that delight Splunk's passionate users.
  • Collaborate with teams to build a state-of-the-art, cloud-based environment for massive-scale data processing.
  • Mentor new engineers to achieve more than they thought possible.
  • Work on reliability projects including HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO.
  • Implement chaos engineering practices.
  • Ensure application uptime and performance.
  • Manage capacity planning and monitoring.
  • Establish SLIs, SLOs, error budgets, and monitoring dashboards.
  • Oversee deployment and operations of large-scale distributed data stores and streaming services.
  • Establish design patterns for monitoring and benchmarking.
  • Document production run books and guidelines for developers.
  • Reduce toil through tooling, runbooks, and automation for production environments.
  • Manage incidents and improve MTTD/MTTR for services.
  • Optimize cloud costs.

Requirements

  • 7+ years of experience in handling large-scale cloud-native microservices platforms.
  • 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud (AWS or GCP).
  • Experience with infrastructure automation and scripting using Python and/or Golang.
  • Experience developing, deploying, and maintaining Java services.
  • Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc.
  • Experience with deployment, operations, and performance management of large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis.
  • Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems.

Nice-to-haves

  • AWS Solutions Architect certification preferred.
  • Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
  • Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
  • Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
  • Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.

Benefits

  • Medical insurance
  • Dental insurance
  • Vision insurance
  • 401(k) plan and match
  • Paid time off
  • Incentive compensation
  • Equity or long-term cash awards
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service