Lead Site Reliability Engineer

Visaposted about 1 month ago

$160,600 - $232,900/Yr

Full-time • Senior

Ashburn, VA

Match Score

Add your resume to Teal and unlock your Job Match score for free

Add Resume Bookmark with Teal

About the position

The Lead Site Reliability Engineering (SRE) is a critical part of our Visa Cloud platform strategy. In this role, you will be focused on ensuring Visa’s development platform and processes enable our software engineers to focus more on innovation than infrastructure. This role will drive the adoption of observability best practices and instrument automation for resolving recurring issues. You must be comfortable working with software engineering teams and supporting their demanding needs to ensure the security, availability and performance of the platform. This engineer must be capable of triaging issues on the front line as well as framing strategic initiatives from leadership. Being hands on keyboard is a must for this role with a focus on developing reliability engineering for Visa Cloud Platform.

Responsibilities

Guide the instrumentation of monitoring for the Visa Cloud Platform (IaaS/PaaS/Container as a service)
Ensure the platform target SLAs are met and implement appropriate SLIs for supporting services
Work with developers during service transition, evaluating reliability and operability of the applications and ensuring adequate monitoring, alerting and observability
Partner with peers within Operations & Infrastructure supporting ongoing maintenance and enhancement of the platform
Set standards for automating routine tasks and workflows in support of the larger DevEx SRE team
Support multiple internal stakeholders with a variety of technical challenges
Analyze and discern patterns in the myriad of issues that arise and propose solutions to these problems

Requirements

10+ years of relevant work experience with a Bachelor’s Degree or at least 7 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 4 years of work experience with a PhD, OR 13+ years of relevant work experience.
Hands on experience in Linux and Windows systems and good understanding of distributed computing environments.
Advanced level programming and/or scripting in 3 or more of the following: Python, Java, Go, PowerShell, JavaScript, Terraform, Ansible, Helm, Chef, Cloud Formation
3+ years of experience managing CI/CD tooling such as Jenkins, Github, Bitbucket, ArgoCD, Artifactory, Bitbucket, Azure DevOps in a large-scale environment
5+ Years experience managing observability tooling such as Grafana, Prometheus, Splunk, Datadog, New Relic, DynaTrace, Sentry, etc. in a large-scale environment
Advanced understanding of YAML, JSON, HTML, XML.
5+ years of work experience supporting relational and non-relational databases (MySQL, MongoDB, PostgreSQL, etc.), including creating and running queries, managing performance and scaling
Experience managing container infrastructure and supporting development transformation to a container first model
3 or more years in SRE or Platform Engineering group for high availability/critical platforms/applications
Exposure to Virtualization (Hyper-V, VMware, scvmm etc)
Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management

Nice-to-haves

12 or more years of work experience with a Bachelor’s Degree or 8-10 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, MD) or 6+ years of work experience with a PhD
Master’s Degree in IT, CS or related field and/or 10+ years relevant work experience