Goldman Sachs - Dallas, TX
posted 2 months ago
As a Site Reliability Engineer at Goldman Sachs, you will hold the title of Vice President and be a key player in the Engineering Division, specifically within the Scheduling Platform team. This role is based in Dallas, Texas, and is pivotal in ensuring the reliability and scalability of the Procmon Platform, which is responsible for scheduling tens of millions of daily jobs across various business units including Global Banking & Markets and Asset & Wealth Management. Your work will involve managing technical operations for systems that handle hundreds of thousands of compute cores, ensuring that our infrastructure is robust and capable of meeting the demands of a fast-paced financial environment. In this position, you will be tasked with building observability for new deployments, ensuring that systems are robust from day one, and identifying areas for improvement in mature deployments. You will troubleshoot and resolve complex issues related to block devices, file descriptors, and packet loss, and lead real-time outage investigations, presenting postmortems to senior management. Additionally, you will define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), collaborating closely with development teams to ensure that systems are well-designed and instrumented for performance and reliability. Your responsibilities will also include planning and managing deployments and migrations, implementing robust business continuity and security programs, and providing regional coverage for the Procmon platform, which includes participating in on-call support. This role requires a proactive approach to problem-solving and a deep understanding of the technical landscape in which Goldman Sachs operates, particularly in a highly regulated financial services environment.