Site Reliability Engineer, VP - Scheduling Platform Dallas - - Vice President

Goldman Sachs - Dallas, TX

posted 2 months ago

Full-time - Senior

Dallas, TX

Securities, Commodity Contracts, and Other Financial Investments and Related Activities

About the position

As a Site Reliability Engineer at Goldman Sachs, you will hold the title of Vice President and be a key player in the Engineering Division, specifically within the Scheduling Platform team. This role is based in Dallas, Texas, and is pivotal in ensuring the reliability and scalability of the Procmon Platform, which is responsible for scheduling tens of millions of daily jobs across various business units including Global Banking & Markets and Asset & Wealth Management. Your work will involve managing technical operations for systems that handle hundreds of thousands of compute cores, ensuring that our infrastructure is robust and capable of meeting the demands of a fast-paced financial environment. In this position, you will be tasked with building observability for new deployments, ensuring that systems are robust from day one, and identifying areas for improvement in mature deployments. You will troubleshoot and resolve complex issues related to block devices, file descriptors, and packet loss, and lead real-time outage investigations, presenting postmortems to senior management. Additionally, you will define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), collaborating closely with development teams to ensure that systems are well-designed and instrumented for performance and reliability. Your responsibilities will also include planning and managing deployments and migrations, implementing robust business continuity and security programs, and providing regional coverage for the Procmon platform, which includes participating in on-call support. This role requires a proactive approach to problem-solving and a deep understanding of the technical landscape in which Goldman Sachs operates, particularly in a highly regulated financial services environment.

Responsibilities

Own technical operations for systems that manage hundreds of thousands of compute cores
Build observability for new deployments to ensure robustness from day one, as well as mature deployments to identify and implement improvements
Troubleshoot and resolve issues with block devices, file descriptors, and packet loss
Lead real-time outage investigations and present postmortems to senior management
Define SLIs and SLOs and partner with development teams to ensure systems are sufficiently well designed and instrumented
Partner with our development team throughout development and operations
Plan and manage deployments and migrations (including end-of-life programs)
Plan and implement robust business continuity and security programs
Provide regional coverage for the Procmon platform and participate in on-call support

Requirements

5+ years of relevant professional experience
3+ years of Linux fundamentals and system administration skills
3+ years of networking experience (familiarity with TCP/IP, IP routing, firewalls, secure tunneling protocols)
3+ years experience working with distributed computing systems and Cloud computing environments
Excellent problem-solving and automation skills
Proficiency in at least one programming language; the team uses a mix of Go, Python and Erlang
Able to operate effectively in a mission critical, highly regulated financial services environment

Benefits

Training and development opportunities
Firmwide networks
Wellness programs
Personal finance offerings
Mindfulness programs

Site Reliability Engineer, VP - Scheduling Platform Dallas - - Vice President

About the position

Responsibilities

Requirements

Benefits

Tools

Career Hubs

Guides

Company