Senior Site Reliability Engineer - Observability (FedRAMP)

Splunk

posted 2 months ago

Full-time - Senior

Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Infrastructure Software Engineer role at Splunk focuses on building and maintaining a sophisticated cloud-scale, big data, and microservices platform. The position emphasizes automation, reliability engineering, and infrastructure-as-code to enhance the operational efficiency of SRE-managed services. Engineers in this role will design new services, mentor junior engineers, and work on various reliability projects to ensure high availability and performance of applications.

Responsibilities

Design new services, tools, and monitoring to be implemented by the entire team.
Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
Mentor new engineers to achieve more than they thought possible.
Work on reliability projects, including HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO.
Implement chaos engineering practices.
Manage application uptime and performance.
Conduct capacity management & planning.
Establish SLIs, SLOs, error budgets, and monitoring dashboards.
Responsible for deployment and operations of large-scale distributed data stores and streaming services.
Establish design patterns for monitoring and benchmarking.
Document production run books and guidelines for developers.
Reduce toil through tooling, runbooks & automation to handle production environments.
Manage incident response and improve MTTD/MTTR for services.
Optimize cloud costs.

Requirements

7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud (AWS or GCP).
Experience with infrastructure automation and scripting using Python and/or bash scripting.
Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc.
Experience with deployment, operations, and performance management of large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems.
Candidate must be a US citizen or a permanent resident and must reside on US soil.

Nice-to-haves

AWS Solutions Architect certification preferred.
Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
Experience handling cloud infrastructure and operations in strict security, compliance, and regulatory environments such as FedRAMP.
Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.

Match and compare your resume to any job description

Start Matching

Senior Site Reliability Engineer - Observability (FedRAMP)

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company