Site Reliability Engineer II

Global Payments - Richardson, TX

posted 2 months ago

Full-time - Mid Level

Richardson, TX

Credit Intermediation and Related Activities

About the position

As a Site Reliability Engineer II, you will play a crucial role in ensuring the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of our systems. This position is designed to create a bridge between development and operations by applying a software engineering mindset to system administration topics. You will split your time between operations/on-call duties and developing systems and software that enhance site reliability and performance. Your responsibilities will include chaos engineering, where you will think laterally about potential system failures, design tests to observe their behavior in practice, and formulate and implement remediation plans as necessary. You will be tasked with pushing our systems to their limits and designing solutions to elevate them to the next performance tier. Utilizing practices from DevOps and GitOps, you will improve automation and processes to enable self-service capabilities. Safeguarding reliability is paramount; you will ensure that our services are highly available, resilient against disasters, self-monitoring, and self-healing. You will run "game days" to test assumptions about reliability and identify what might break before it impacts our customers. Additionally, you will review designs with a focus on enhancing the holistic stability of our platform and identifying potential risks. Building systems to proactively monitor the health, performance, and security of our production and non-production virtualized infrastructure will also be a key part of your role. Improving our monitoring and alerting systems is essential to ensure that engineers are paged when necessary and not disturbed unnecessarily. You will troubleshoot systems and network issues in collaboration with our Technical Operations Team and evolve our Software Development Life Cycle (SDLC), practices, and tooling to incorporate Site Reliability considerations and best practices. Lastly, you will be responsible for developing runbooks and enhancing documentation to support these efforts.

Responsibilities

Conduct chaos engineering to test system failures and implement remediation plans.
Push systems to their limits and design solutions for improved performance.
Utilize DevOps and GitOps practices to enhance automation and self-service capabilities.
Ensure high availability and resilience of services against disasters.
Run "game days" to test reliability assumptions and identify potential failures.
Review designs to enhance platform stability and identify risks.
Build systems for proactive monitoring of health, performance, and security.
Improve monitoring and alerting systems to optimize engineer response.
Troubleshoot systems and network issues with the Technical Operations Team.
Evolve SDLC practices and tooling to incorporate Site Reliability best practices.
Develop runbooks and improve documentation.

Requirements

Bachelor's degree in Computer Science, Information Technology, Business/Management Information Systems, or a related field.
Minimum of 2 years of relevant experience in a similar role.
Experience with Linux, specifically RHEL and AIX Systems.
Ability to read, write, and update shell scripting.
Understanding of PCI requirements and workflows.
Familiarity with Key Encryption processes and procedures.

Nice-to-haves

Experience in Public and Private Clouds, Jenkins, Terraform, Ansible, OpenShift, Kubernetes, or AWS EKS.

Site Reliability Engineer II

About the position

Responsibilities

Requirements

Nice-to-haves

Tools

Career Hubs

Guides

Company