Visaposted about 1 month ago
$160,600 - $232,900/Yr
Full-time • Senior
Ashburn, VA

About the position

The Lead Site Reliability Engineering (SRE) is a critical part of our Visa Cloud platform strategy. In this role, you will be focused on ensuring Visa’s development platform and processes enable our software engineers to focus more on innovation than infrastructure. This role will drive the adoption of observability best practices and instrument automation for resolving recurring issues. You must be comfortable working with software engineering teams and supporting their demanding needs to ensure the security, availability and performance of the platform. This engineer must be capable of triaging issues on the front line as well as framing strategic initiatives from leadership. Being hands on keyboard is a must for this role with a focus on developing reliability engineering for Visa Cloud Platform.

Responsibilities

  • Guide the instrumentation of monitoring for the Visa Cloud Platform (IaaS/PaaS/Container as a service)
  • Ensure the platform target SLAs are met and implement appropriate SLIs for supporting services
  • Work with developers during service transition, evaluating reliability and operability of the applications and ensuring adequate monitoring, alerting and observability
  • Partner with peers within Operations & Infrastructure supporting ongoing maintenance and enhancement of the platform
  • Set standards for automating routine tasks and workflows in support of the larger DevEx SRE team
  • Support multiple internal stakeholders with a variety of technical challenges
  • Analyze and discern patterns in the myriad of issues that arise and propose solutions to these problems

Requirements

  • 10+ years of relevant work experience with a Bachelor’s Degree or at least 7 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 4 years of work experience with a PhD, OR 13+ years of relevant work experience.
  • Hands on experience in Linux and Windows systems and good understanding of distributed computing environments.
  • Advanced level programming and/or scripting in 3 or more of the following: Python, Java, Go, PowerShell, JavaScript, Terraform, Ansible, Helm, Chef, Cloud Formation
  • 3+ years of experience managing CI/CD tooling such as Jenkins, Github, Bitbucket, ArgoCD, Artifactory, Bitbucket, Azure DevOps in a large-scale environment
  • 5+ Years experience managing observability tooling such as Grafana, Prometheus, Splunk, Datadog, New Relic, DynaTrace, Sentry, etc. in a large-scale environment
  • Advanced understanding of YAML, JSON, HTML, XML.
  • 5+ years of work experience supporting relational and non-relational databases (MySQL, MongoDB, PostgreSQL, etc.), including creating and running queries, managing performance and scaling
  • Experience managing container infrastructure and supporting development transformation to a container first model
  • 3 or more years in SRE or Platform Engineering group for high availability/critical platforms/applications
  • Exposure to Virtualization (Hyper-V, VMware, scvmm etc)
  • Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management

Nice-to-haves

  • 12 or more years of work experience with a Bachelor’s Degree or 8-10 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, MD) or 6+ years of work experience with a PhD
  • Master’s Degree in IT, CS or related field and/or 10+ years relevant work experience

Benefits

  • Medical
  • Dental
  • Vision
  • 401 (k)
  • FSA/HSA
  • Life Insurance
  • Paid Time Off
  • Wellness Program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service