Splunk

posted 2 months ago

Full-time - Senior
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Infrastructure Software Engineer role at Splunk focuses on building and maintaining a sophisticated cloud-scale, big data, and microservices platform. The position emphasizes automation, reliability engineering, and infrastructure-as-code to enhance the operational efficiency of SRE-managed services. Engineers in this role will design new services, mentor junior engineers, and work on various reliability projects to ensure high availability and performance of applications.

Responsibilities

  • Design new services, tools, and monitoring to be implemented by the entire team.
  • Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
  • Mentor new engineers to achieve more than they thought possible.
  • Work on reliability projects, including HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO.
  • Implement chaos engineering practices.
  • Manage application uptime and performance.
  • Conduct capacity management & planning.
  • Establish SLIs, SLOs, error budgets, and monitoring dashboards.
  • Responsible for deployment and operations of large-scale distributed data stores and streaming services.
  • Establish design patterns for monitoring and benchmarking.
  • Document production run books and guidelines for developers.
  • Reduce toil through tooling, runbooks & automation to handle production environments.
  • Manage incident response and improve MTTD/MTTR for services.
  • Optimize cloud costs.

Requirements

  • 7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
  • 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud (AWS or GCP).
  • Experience with infrastructure automation and scripting using Python and/or bash scripting.
  • Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc.
  • Experience with deployment, operations, and performance management of large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
  • Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems.
  • Candidate must be a US citizen or a permanent resident and must reside on US soil.

Nice-to-haves

  • AWS Solutions Architect certification preferred.
  • Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
  • Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
  • Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
  • Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
  • Experience handling cloud infrastructure and operations in strict security, compliance, and regulatory environments such as FedRAMP.
  • Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service