Site Reliability Engineer

Trillium StaffingSanta Clara, CA
425d$166,400 - $187,200

About The Position

Trillium Professional is seeking a Site Reliability Engineer to join their team in Santa Clara. This role focuses on monitoring and recovering assets in a private cloud environment, with a specific emphasis on building and stabilizing virtualization infrastructure. The engineer will deploy and maintain a large farm of machines using configuration management and infrastructure automation tools, while also participating in on-call support for infrastructure issues.

Requirements

  • Bachelor's or Master's Degree in Computer Science or Software Engineering, or equivalent experience.
  • 5+ years of professional experience required.
  • Good with system and platform debugging.
  • Virtualization experience (vSphere, Hyper-V, KVM, Xen server).
  • Familiar with Client Configuration tools (Chef preferred, Ansible).
  • Experience working in large scale enterprise production systems.
  • Ability to debug and analyze system issues, code to triage, root cause and resolve issues in the infrastructure.
  • Familiar with maintenance and setup of Linux and Windows hosts.
  • Scripting experience with Python, Go, or Unix shell proficiency.
  • Experience with version control systems like Perforce and GIT.

Nice To Haves

  • Familiar with private cloud setups (VMware, Dell, Apple).
  • Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker, and Kubernetes.
  • Background with automating bare metal and VM provisioning.
  • Experience with supporting GPUs, embedded device development, driver development, and CUDA/TensorRT applications.
  • Development experience in Chef, Ansible, and infrastructure orchestration.

Responsibilities

  • Fleet monitoring & recovery of assets in a private cloud environment.
  • Building and stabilizing virtualization infrastructure of ESXi, KVM, and Hyper-V.
  • Deploying and maintaining a large farm of machines using Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
  • Participating in on-call & rotational L1 support for round-the-clock monitoring and remediation of infrastructure issues (PagerDuty).
  • Analyzing and debugging operating system, networking, configuration, and performance problems.
  • Assisting in the roll-out and deployment of infrastructure configurations to support the latest hardware and technologies.
  • Contributing to the development of monitoring systems for real-time infrastructure monitoring (Zabbix, Big Panda, Grafana).

Benefits

  • Competitive pay rate of $80 - $90/hour.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service