Goldman Sachs - Dallas, TX

posted 2 months ago

Full-time - Senior
Dallas, TX
Securities, Commodity Contracts, and Other Financial Investments and Related Activities

About the position

As a Site Reliability Engineer at Goldman Sachs, you will hold the title of Vice President and be a key player in the Engineering Division, specifically within the Scheduling Platform team. This role is based in Dallas, Texas, and is pivotal in ensuring the reliability and scalability of the Procmon Platform, which is responsible for scheduling tens of millions of daily jobs across various business units including Global Banking & Markets and Asset & Wealth Management. Your work will involve managing technical operations for systems that handle hundreds of thousands of compute cores, ensuring that our infrastructure is robust and capable of meeting the demands of a fast-paced financial environment. In this position, you will be tasked with building observability for new deployments, ensuring that systems are robust from day one, and identifying areas for improvement in mature deployments. You will troubleshoot and resolve complex issues related to block devices, file descriptors, and packet loss, and lead real-time outage investigations, presenting postmortems to senior management. Additionally, you will define Service Level Indicators (SLIs) and Service Level Objectives (SLOs), collaborating closely with development teams to ensure that systems are well-designed and instrumented for performance and reliability. Your responsibilities will also include planning and managing deployments and migrations, implementing robust business continuity and security programs, and providing regional coverage for the Procmon platform, which includes participating in on-call support. This role requires a proactive approach to problem-solving and a deep understanding of the technical landscape in which Goldman Sachs operates, particularly in a highly regulated financial services environment.

Responsibilities

  • Own technical operations for systems that manage hundreds of thousands of compute cores
  • Build observability for new deployments to ensure robustness from day one, as well as mature deployments to identify and implement improvements
  • Troubleshoot and resolve issues with block devices, file descriptors, and packet loss
  • Lead real-time outage investigations and present postmortems to senior management
  • Define SLIs and SLOs and partner with development teams to ensure systems are sufficiently well designed and instrumented
  • Partner with our development team throughout development and operations
  • Plan and manage deployments and migrations (including end-of-life programs)
  • Plan and implement robust business continuity and security programs
  • Provide regional coverage for the Procmon platform and participate in on-call support

Requirements

  • 5+ years of relevant professional experience
  • 3+ years of Linux fundamentals and system administration skills
  • 3+ years of networking experience (familiarity with TCP/IP, IP routing, firewalls, secure tunneling protocols)
  • 3+ years experience working with distributed computing systems and Cloud computing environments
  • Excellent problem-solving and automation skills
  • Proficiency in at least one programming language; the team uses a mix of Go, Python and Erlang
  • Able to operate effectively in a mission critical, highly regulated financial services environment

Benefits

  • Training and development opportunities
  • Firmwide networks
  • Wellness programs
  • Personal finance offerings
  • Mindfulness programs
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service