Site Reliability Engineer II

$145,600 - $145,600/Yr

ManpowerGroup - Chandler, AZ

posted 3 months ago

Full-time - Mid Level

Chandler, AZ

11-50 employees

Administrative and Support Services

About the position

Our trusted client, a Fortune 50 financial services firm, is seeking a Site Reliability Engineer II for a long-term contract. This position is crucial for ensuring the reliability and support of the Container Platform across both on-premises and external cloud environments, including Azure, AWS, and Google Cloud. The role will primarily focus on operations, requiring a strong background in managing applications on container platforms, as well as automation and platform management skills. The ideal candidate will have extensive experience with Kubernetes, OpenShift, and various monitoring tools, and will be responsible for troubleshooting performance, connectivity, and security issues within the container platform. The Site Reliability Engineer will perform deep dives into systemic and latent reliability issues, manage incidents effectively, and conduct blameless Root Cause Analysis (RCA) in collaboration with engineering and operations teams. This role also involves identifying and driving opportunities for automation to enhance operational excellence, ensuring resiliency during implementation, and working closely with architecture, engineering, and product teams to design cloud services. The position requires participation in 24x7 on-call coverage following a "follow the sun" model, making it essential for the candidate to be adaptable and responsive to changing project scopes and priorities.

Responsibilities

Ensure reliability and support of the Container Platform across on-prem and external clouds (Azure, AWS, Google).
Monitor and troubleshoot performance, connectivity, and security issues in the Container platform, including Openshift, Rancher (RKE), and Azure (AKS) environments.
Perform deep dives into systemic and latent reliability issues, managing incidents and problems effectively.
Identify, analyze, and resolve infrastructure vulnerabilities and application deployment issues.
Conduct blameless Root Cause Analysis (RCA) and collaborate with engineering and operations teams to roll out fixes.
Provide troubleshooting support throughout the lifecycle of applications on the container platform, including application onboarding.
Identify and drive opportunities for automation to reduce toil and improve operational excellence.
Collaborate with risk and compliance teams to ensure visibility, implement controls, and remediate vulnerabilities.
Ensure resiliency during implementation and work with engineering teams to identify and fix resiliency issues.
Act as a key stakeholder in the design of cloud services, working closely with architecture, engineering, and product teams.
Participate in 24x7 on-call coverage following a 'follow the sun' model.

Requirements

Bachelor's or Master's degree in Computer Science or a related technical field involving systems, or equivalent practical experience.
Minimum of 5+ years of hands-on experience supporting Kubernetes/Openshift/RKE/EKS Container platforms.
Proficiency in Python, Ansible, Golang, and Shell scripting.
Strong experience with major services related to Compute, Storage, Network, and Security.
Experience with monitoring tools like Prometheus and Dynatrace, and cloud-native tools such as Azure Monitor and Log Analytics.
Strong understanding and experience working with complex IAM infrastructures, including Active Directory, Azure AD Connect, Azure AD, and Ping Identity or other SSO solutions.
Advanced knowledge of Linux OS, DNS, DHCP, Kerberos, and Windows Authentication.
Experience with CI/CD tools like Git/Jenkins and the GitOps model.
Excellent understanding of Linux/Windows operating systems administration.
Experience in container security and vulnerability remediation.
Systematic problem-solving approach with a strong sense of ownership and drive.
Ability to juggle competing priorities and adapt to changes in project scope.
Excellent interpersonal, organizational, and communication skills (written, verbal, and presentation).
Proven ability to work independently with minimal supervision, as well as part of a team with direct responsibilities.

Nice-to-haves

Kubernetes, Openshift, and Terraform certifications are a plus.
Experience with Terraform, ArgoCD, Tekton, and K-native technologies.
Familiarity with agile deployment methodologies (GitOps) and various container runtimes.
Understanding of the operator deployment pattern.
Experience working in a highly available multi-datacenter environment.
Understanding of cost management, inventory management, and the FinOps model.

Benefits

Long-term contract opportunity
Competitive pay rate
Hybrid work model (3 days in office, 2 days work from home)

Site Reliability Engineer II

About the position

Responsibilities

Requirements

Nice-to-haves

Benefits

Tools

Career Hubs

Guides

Company