Global Payments - Richardson, TX
posted 2 months ago
As a Site Reliability Engineer II, you will play a crucial role in ensuring the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of our systems. This position is designed to create a bridge between development and operations by applying a software engineering mindset to system administration topics. You will split your time between operations/on-call duties and developing systems and software that enhance site reliability and performance. Your responsibilities will include chaos engineering, where you will think laterally about potential system failures, design tests to observe their behavior in practice, and formulate and implement remediation plans as necessary. You will be tasked with pushing our systems to their limits and designing solutions to elevate them to the next performance tier. Utilizing practices from DevOps and GitOps, you will improve automation and processes to enable self-service capabilities. Safeguarding reliability is paramount; you will ensure that our services are highly available, resilient against disasters, self-monitoring, and self-healing. You will run "game days" to test assumptions about reliability and identify what might break before it impacts our customers. Additionally, you will review designs with a focus on enhancing the holistic stability of our platform and identifying potential risks. Building systems to proactively monitor the health, performance, and security of our production and non-production virtualized infrastructure will also be a key part of your role. Improving our monitoring and alerting systems is essential to ensure that engineers are paged when necessary and not disturbed unnecessarily. You will troubleshoot systems and network issues in collaboration with our Technical Operations Team and evolve our Software Development Life Cycle (SDLC), practices, and tooling to incorporate Site Reliability considerations and best practices. Lastly, you will be responsible for developing runbooks and enhancing documentation to support these efforts.