Sapient Razorfish - Irving, TX

posted 22 days ago

Full-time - Mid Level
Irving, TX
10,001+ employees

About the position

The Site Reliability Engineer (SRE) will ensure the reliability, scalability, and availability of services across cloud and on-prem platforms, focusing on OpenShift and Grafana. This role combines expertise in automation, observability, and infrastructure management to optimize resource allocation and maintain service uptime, particularly for AI/ML and GPU-based workloads.

Responsibilities

  • Use tools like Ansible and Python to automate provisioning, monitoring, and scaling tasks.
  • Set up Grafana dashboards and Prometheus alerts to track service health, uptime, and performance metrics across platforms.
  • Deploy and manage applications on OpenShift or other Kubernetes-based platforms, ensuring efficient application lifecycle management.
  • Implement and automate monitoring for both cloud and on-prem environments, ensuring compliance with SLA requirements.
  • Monitor and optimize GPU and CPU utilization, ensuring resources are allocated efficiently across workloads.
  • Participate in Agile/Scrum sprint planning, collaborating with other teams to ensure tasks are delivered on time and aligned with service-level objectives.
  • Automate manual processes such as resource requests, tenant onboarding, and lifecycle management for AI/ML platforms and other workloads.

Requirements

  • Strong experience with automation tools like Ansible and Python scripting for infrastructure management.
  • Proficiency in Grafana and Prometheus for monitoring and setting up alerting mechanisms.
  • Hands-on experience managing applications in OpenShift or other Kubernetes-based platforms.
  • Ability to automate service monitoring and infrastructure scaling in both cloud and on-prem environments, ensuring SLA compliance.
  • Experience with infrastructure management for cloud (GCP) and hybrid environments.
  • Experience with infrastructure as code (IaC) tools (Terraform).

Benefits

  • Flexible vacation policy; time is not limited, allocated, or accrued
  • 16 paid holidays throughout the year
  • Generous parental leave and new parent transition program
  • Tuition reimbursement
  • Corporate gift matching program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service